Skip to main content

Efficient goods handling is critical for both manufacturing plants and distribution centers, as it directly impacts the supply chain, lead times, and customer satisfaction. Traditionally, companies have relied on manual labor and human supervision to ensure that trucks are loaded and unloaded according to standard operating procedures (SOPs). However, this approach is prone to human error, inconsistencies, and delays, which can lead to significant losses in time and resources.
This blog describes a use case of  deep learning in the logistics industry, exhibiting how it transforms the way trucks are loaded and unloaded.

Use Case

This solution/project focuses on qualitative analysis of an industrial process using video/CCTV camera and analyze the SOP (Standard Operating Procedure), whether it has been followed correctly or not.
This solution also provides insights into the SOP (Standard Operating Procedure) defining the scope of improvement in terms of overall time taken to complete the process or identify specific steps in the SOP which have room for improvement.

The company produces aluminum rolls that are loaded onto the trucks at the dock area. Presently, the company uses a CCTV camera to capture live feed 24×7 feed. For this use case, we have collected video files of two days from the company.

CCTV Footage of loading/unloading dock with 3 bays

Current Process At Loading/Unloading Bay

The trucks arrive at one of the bays followed by SOP process as mentioned below:

  1. The driver uncovers the trailer.
  2. The driver and the technician release the trailer straps
  3. Forklift arrives either to load the aluminum roll or to unload it.
  4. In case of loading, the forklift has the aluminum roll and the truck trailer is empty.
  5. In case of unloading, the forklift is empty, and the truck trailer is loaded with aluminum rolls
  6. The driver and technician put the straps of the trailer back on and inspect around the truck to remove any obstacles
  7. The driver covers the trailer back.

Solution Design and Components

We analyzed the pain points and designed a decoupled serverless architecture following the approach mentioned in the diagram below.

Solution design

Camera Component :
This component allows either a live camera or a video file to stream the MJPEG encoded video data over the network. In this project, a video file is used to stream the data, at 1080p HD (High Definition) resolution and 30 FPS (Frames Per Second).

Model Inference Logic :
This component houses the Yolov4 trained model to detect the following defined classes, on each image decoded from the MJPEG stream from the camera component and composes the JSON object to be consumed by the business logic component –

a. Truck covered

b. Empty Truck

c. Loaded Truck

d. Aluminum Roll

e. Forklift

Business Logic and SOP orchestration component :

This component orchestrates the defined SOP and business logic as per the company and uses the sequence of detected classes from the model inference. After this, the sequence is analyzed as per the SOP and is checked for anomalies or deviations (if any). This can also be used for fine tuning and optimization of the SOP and increase the throughput. The output of this component is compiled into a CSV file for its use in the business reports.

Business Analytics and SOP reports component :

This component uses the CSV file generated in the business logic component and generates meaningful insights and analytics for the business users.


The machine learning operations for this project was done in the following parts –

  1. Data collection
  2. Data Annotation
  3. Model training
  4. Model validation

Data Collection :
2 days worth of video files from CCTV camera was received from the company. From these, video clips of 20 operations were extracted (10 for loading operations and 10 for unloading operations) and these video clips were then used image by image for annotations.

Data Annotations :
For the annotations, CVAT was used as the state-of-the-art technology to expedite the annotation process. CVAT (Computer Vision Annotation Tool) is a free, open source, web-based image and video annotation tool which is used for labeling data for computer vision algorithms. The annotation was done for 40,000 images across the following classes –

  1. Truck covered
  2. Empty Truck
  3. Loaded Truck
  4. Aluminum Roll
  5. Forklift

Once done, the complete annotation package was exported for mounting onto the model training.

Model Training: For model training, YOLOv4 was used. YOLO stands for You Look Only Once and is a single pass state-of-the-art algorithm that can be used to detect objects in images. The YOLOv4 method was created by Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. It is twice as fast as EfficientNet with comparable performance. YOLOv4’s architecture is composed of CSPDarknet53 as a backbone, spatial pyramid pooling additional module, PANet path-aggregation neck, and YOLOv3 head. YOLOv4 uses many new features and combines some of them to achieve state-ofthe-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a real-time speed of ~65 FPS on Tesla V100. Following are the new features used by YOLOv4:

  1. Weighted-Residual-Connections (WRC)
  2. Cross-Stage-Partial-connections (CSP)
  3. Cross mini-Batch Normalization (CmBN)
  4. Self-adversarial training (SAT)
  5. Mish activation
  6. Mosaic data augmentation
  7. DropBlock regularization
  8. Complete Intersection over Union loss (CIoU loss)

The model was trained using Google Colab with a Tesla K100 GPU runtime. The model training was done for 10,000 iterations, batch size = 64 and 0.0045 learning rate and the following graph was obtained after the training process was completed, with an average loss of 0.0236 (2.36%) –

Model training

Standard Operating Procedure

The standard operating procedure (SOP) for a particular process is issued or decided by the business entity that has the details on the list of steps to be followed in sequence to complete the process and the maximum time allowed for each of the corresponding steps defined in the SOP.

The SOP is designed keeping in mind the economic target of the business for every day/week/month and all the workers must abide by it, else the risk of economic losses increases.

The following table is the SOP issued by the business entity for the truck loading process in one of their Europe based plant –

business entity for the truck loading process

The following table is the SOP issued by the business entity for the truck unloading process in one of their Europe based plant –

Truck unloading SOP time


The business logic we developed was the mediating and integrating layer between the classes defined in the model and SOP defined by the business entity. The logic was given the feed of sample CCTV recording obtained from the business entity and each frame of the video was analyzed to detect the following defined classes –

1. Truck covered

2. Empty Truck

3. Loaded Truck

4. Aluminum Roll

5. Forklift

Proximity calculator — A proximity calculator logic was developed especially for Step-3 and Step-4 in the SOP where it was necessary to detect the location of the Aluminum roll either on the forklift or the truck trailer.

The following formula was used for the proximity calculation —

d = √[ (x2 — x1)2 + (y2 — y1)2]

d : Euclidean distance

x1,y1 : co-ordinate of the center of the bounding box for Aluminum Roll class

x2,y2 : co-ordinate of the center of the bounding box for Forklift/Empty Truck/Loaded Truck class

Below  are the mapping steps that the algorithm follows w.r.t the SOP, to keep a track on each step –

Mapping steps for the algorithm

Time instance calculator

Based on the detection of each of the mapped classes w.r.t. the SOP defined, the time instance calculator used the following logic to calculate the time taken for each step and subsequently the total time taken for the entire process –

Time for each step = (Total number frames for the step / Frames per second of video) / 60

Total time taken for the process = Time taken for step 1 + ……… + Time taken for step 5

For example –

If the Frames per second of video = 30 and total frames for step 1 = 3000, Then time for step 1 = (3000/30)/60 = 1.67 minutes

By now, you must have got a clear idea about the architecture and the business logic of this intelligent solution. In the next blog, we will talk about the working principle of the solution. 

About the author 

Sailesh Patra is an AI enthusiast and passionate tech blogger, with a deep-rooted fondness for all things cutting-edge. With a background in AI, Computer Vision, and Robotics, he brings a unique perspective and innovative approach to solve real world problems.

Skip to content