From jderobot
Jump to: navigation, search

GSOC 2017 Project[edit]

Video: A Usual Day at JdeRobot

The above video shows the output of our trained YOLO-Depth model for the person detection task after being trained on just 256 unique training images.

  • Top left channel: Person detection using YOLO-D
  • Top right channel: RGB
  • Centre left channel: Depth
  • Centre right channel: Depth Colour Map
  • Bottom left channel: Linear Gray Scale Depth Map
  • Bottom right channel: Bit Interleaved Colour Map with RGB



Deep Learning to Detect Objects in Videos from RGB-D Input


I aim to solve the problem of detecting and recognizing objects in videos by using RGB and depth data as input. The deep learning algorithm used would be an extension of the convolutional neural network YOLO (You Only Look Once). There are two phases to my project. In my first phase, since YOLO is natively written in Darknet based in C, I will explore ports of YOLO in a more friendlier framework (ex: Keras or Tensorflow) and test their accuracy against Darknet.

In the second phase, I will extend Darknet to accept RGB and depth data as input. An accuracy script will be written in Darknet to evaluate the accuracy of the created models, original YOLO (only RGB), extended YOLO (depth only) and YOLO-D (RGB + depth). The inclusion of depth should hopefully result in an increase in accuracy.

Importance of Person Detection to Robotics The second phase of my project in which I extend YOLO to train and test on a depth data set will involve images labeled using the person class only. Person detection is important to the vision system of a robot because a robot can learn to avoid people (ex: in driver-less cars) or follow people depending on the use case. An output of our YOLO-Depth model trained for person detection is shown in the start video.

Broad Steps[edit]

S.No. Step Status Date Remarks
Phase 1
1 Explore existing neural architectures for object detection (R-CNN, YOLO, SSD) DONE June 2, 2017
2 Read and learn about YOLO from research paper DONE June 9, 2017
3 Install and experiment with YOLO on Darknet DONE June 16, 2017
4 Explore and evaluate existing depth data sets DONE June 16, 2017
5 Install and explore existing ports of YOLO in Keras or Tensorflow DONE June 23, 2017
6 Test chosen Tensorflow port of YOLO on VOC2007 DONE June 30, 2017
7 Integrate YOLO with JdeRobot's Camera Feed DONE July 7, 2017
Phase 2
8 Prepare custom data set having RGB and depth maps DONE July 14, 2017
9 Modify YOLO in Darknet to train and test on a custom data set (RGB only) DONE July 21, 2017
10 Modify YOLO in Darknet to train and test on a custom data set (Depth only) DONE July 28, 2017
11 Write an extension of YOLO to load 16 bit depth images DONE August 4, 2017
12 Write an extension of YOLO to train and test on RGB + depth DONE August 18, 2017
13 Write a script in Darknet to compare accuracies of different models DONE August 25, 2017


  • Nigel Fernandez (nigelsteven[dot]fernandez[at]iiitb[dot]org)
  • Francisco Rivas (franciscomiguel[dot]rivas[at]urjc[dot]es)
  • Alberto Martín (almartinflorido[at]gmail[dot]com)



I'm thankful to my open source organization JdeRobot for providing me with an Nvidia GeForce GTX 1080Ti GPU for training and testing my neural models. I had ssh access in a docker container hosted on the GPU server.

About YOLO[edit]

YOLO is the major focus for our GSOC project. Yolo breaks away from the approach of using classifiers to detect objects and instead frames the problem as a regression problem to bounding boxes and class probabilities. A single neural network is trained and used to predict bounding boxes and class probabilities in one evaluation.

Yolo is extremely fast providing an fps rate of 45 while a faster version runs at 150 fps. This is because the regression problem does not require a complex pipeline. Yolo reasons globally about an image while making predictions. This enables it to encode contextual information. This is different from R-CNN as it only sees the proposed regions of the image and not the entire image. Yolo is also tested to have the capacity to learn generalizable representations of objects. It gives decent accuracy of artwork images.

A brief explanation of the working would be as follows. The input images is divided into an SXS grid. A grid is responsible for detecting an object if the centre of the object lies in that grid. Each grid cell predicts B bounding boxes and confidence scores for these boxes. If no object lies in that box, the confidence score should be zero. Otherwise the confidence score is the IOU between the predicted box and the ground truth.

From above, each bounding box consists of five predicted values - x, y (relative centre of box) width, height and confidence. Each grid cell predicts C conditional class probabilities conditioned on the grid cell containing an object. The paper uses S=7, B = 2. This results in the final layer predicting a 7X7X30 tensor. How did 7X7X30 come about? 7X7 represents the number of grid cells. 30 is the number of predictions per grid cell. Each grid cell predicts 2 bounding boxes consisting of 5 predictions each (5X2 = 10) and 20 class probabilities (pascal voc has 20 labelled classes). Therefore a total of 10+20 = 30 predictions per grid cell.

The network design is inspired by GoogLeNet. The neural network consists of 24 convolutional layers followed by 2 fully connected layers. The difference between GoogLeNet and Yolo is the use of 1X1 layer as a reduction layer followed by a 3X3 convolutional layer. A faster version of Yolo exists with 9 instead of 24 convolutional layers.

Drawbacks of YOLO include more localization errors compared to R-CNN. It faces an issue in localizing small objects. Further multiple small images contained in one image grid (7X7) will not be detected due to the limit B which is set. Each grid cell can predict only B bounding boxes and therefore at most B objects. B is set to 2. Another limitation is in the loss function of Yolo. The loss function treats an error in a small bounding box the same as in a large bounding box. This is not appropriate since a small error in a small bounding box will affect IOU more as compared on a large bounding box.

Phase 1[edit]

Week 1: May 30 to June 2[edit]

  • Literature review on research papers using neural networks, in particular CNN, for object detection
  • Read YOLO, SSD and R-CNN paper in depth
  • Exploratory search for existing code ports of YOLO from Darknet in Tensorflow or Keras
  • Weekly Report 1

Literature Review[edit]


R-CNN uses a different paradigm than the conventional sliding window approach. The sliding window approach uses a fixed size window to slide over the image and a CNN is applied at each instance for feature extraction followed by object classification. R-CNN optimizes by using selective search for region proposals. A high capacity CNN is then applied to only these regions to localize and segment objects.

Their object detection system consists of three major parts. Category independent region proposals are generated using a selective search. The second part uses a CNN to extract a fixed length feature vector from each region. The third part consists of a class specific linear SVMs which use the feature vector for object recognition. The paper also illustrates the usefulness of supervised pre-training on an auxiliary task followed by domain specific fine tuning.

Drawbacks of this paper include the use of a complex training pipeline which is slow and hard to optimize. This is because each individual component must be trained separately. R-CNN falls short of providing real time performance with a poor FPS rate (7 fps). Variants like Fast and Faster R-CNN exist.


The previous described paper R-CNN follows the paradigm of hypothesizing bounding boxes, generating a feature vector and applying a high quality classifier. SSD on the other hand does not use this paradigm and eliminated bounding box proposals and the subsequent feature sampling stage. This results in a simple end-to-end training and a real time performance of 59 fps.

The major novelty is the use of small convolutional filters applied to feature maps to produce predictions of different scales (more robust) on default bounding box offsets (no region proposal).

Week 2: June 3 to June 9[edit]

Keras Tutorial[edit]

During this week I spent part of my time learning Keras through its official documentation. There are two main paradigms for designing a neural architecture. These paradigms are the sequential and functional paradigm. I designed a simple MLP for binary classification based on the tutorial documentation in both the sequential and functional paradigms.

MLP for Binary Classification in the Sequential Paradigm

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout

# Generate dummy data
x_train = np.random.random((1000, 20))
y_train = np.random.randint(2, size=(1000, 1))
x_test = np.random.random((100, 20))
y_test = np.random.randint(2, size=(100, 1))

model = Sequential()
model.add(Dense(64, input_dim=20, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

              metrics=['accuracy']), y_train,
score = model.evaluate(x_test, y_test, batch_size=64)

MLP for Binary Classification in the Functional API / Paradigm:

from keras.layers import Input, Dense
from keras.models import Model

# Generate dummy data
x_train = np.random.random((1000, 20))
y_train = np.random.randint(2, size=(1000, 1))
x_test = np.random.random((100, 20))
y_test = np.random.randint(2, size=(100, 1))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)
              metrics=['accuracy']), y_train,
score = model.evaluate(x_test, y_test, batch_size=64)

Week 3: June 10 to June 16[edit]

  • Install and test original YOLO v2 from Darknet
  • Exploratory search for RGB-D data sets

RGB-D Datasets[edit]

  • This paper paper “RGBD Datasets: Past, Present and Future” lists the popular RGBD datasets available for use for the research community. The website with additional information is available here. Datasets across eight categories are presented which are semantics, object pose estimation, camera tracking, scene recognition, object tracking, human actions, faces and identification. This list is compiled by Michael Firman of University College London.
  • [1] 10,000 RGB-D images at a similar scale as PASCAL VOC
  • [2] 300 common household objects
  • [3] RGB-D SLAM Dataset and Benchmark
  • [4] NYU Depth Dataset V2 for indoor scene images
  • [5] Microsoft RGB-D Dataset 7-Scenes
  • [6] Cornell-RGBD-Dataset for Scene Understanding for Personal Robots
  • [7] Imperial College London: Living room and office room scenes
  • [8] A Large-Scale Hierarchical Multi-View RGB-D Object Dataset: 300 objects organized into 51 categories
  • [9] RGB-D People Dataset

Experimenting with YOLO on Darknet[edit]

I decided to experiment by changing the confidence thresholds in YOLO. By default, YOLO displays objects detected with a confidence threshold above 0.25. Sample output detections are provided below for various confidence levels.

Confidence level: 0.25 Predicted in: 0.204s Dog: 82% Truck: 65% Bicycle: 85%

Confidence level: 0.15 Predicted in: 0.196s Motorbike: 15% Dog: 82% Truck: 65% Bicycle: 85%

Confidence level: 0.10 Predicted in: 0.197s Person: 11% Motorbike: 15% Dog: 82% Truck: 65% Bicycle: 85%

Week 4: June 17 to June 23[edit]

Tensorflow Port of YOLO[edit]

Written by former Google Brain employee, Trinh Trieu, this port of YOLO in TensorFlow has been used in TensorFlow’s Android demo and in Udacity’s self-driving car course. The port is quite popular on GitHub with 998 stars and 321 forks. The port supports running pre-designed YOLO models like the standard YOLO and tiny-YOLO by loading pre-trained weight files. In addition it supports running custom modifications to the neural architecture by changing the configuration files. Another feature is training a model from scratch on the Pascal VOC 2007 data set or your own data set. The port is built on DarkFlow which is the TensorFlow version of DarkNet.

Comparison on Sample Image[edit]

Output from Original YOLO on Darknet

Output of Tensorflow Port of YOLO

Week 5: June 24 to June 30[edit]

  • Perform testing on chosen code ports of YOLO for future use
  • Complete search of RGB-D data sets
  • Complete search of existing code ports of YOLO
  • Weekly Report 4

Work Completed[edit]

In this week, my work consisted of evaluating the accuracy of YOLO-T (Trieu's Tensorflow port of YOLO) on the validation data set of PASCAL VOC 2012 main challenge. To this aim, I completed the following steps:

  • Installed YOLO-T on my system and experimented with different settings (ex: changing output to json format).
  • Read the documentation especially the evaluation method used in the PASCAL VOC 2012 main challenge.
  • Downloaded the image set and corresponding annotations / labels from the VOC server.
  • Wrote a python script '' to store only the validation images and corresponding annotations from the total image set for further accuracy testing.
  • Wrote a python script '' which converts the original VOC annotations in XML to the same format used for prediction by YOLO-T to enable a convenient accuracy evaluation.
  • Wrote a python script '' to compute the Intersection Over Union (IOU) of two rectangles taking care of corner cases in overlapping and non-overlapping of rectangles.
  • Ran YOLO-T on the validation set and stored the predicted outputs for accuracy evaluation.
  • Wrote a python script '' which loads in the validation set correct labels, predicted labels from YOLO-T and uses IOU to compute the accuracy score of YOLO-T.

Calculating IOU[edit]

The algorithm which takes two rectangles as input and calculates their IOU score is shown.

Evaluating Tensorflow Port (YOLO-T)[edit]

I used the find IOU script to evaluate the accuracy of YOLO-T. The evaluation method used in PASCAL VOC 2012 was to treat an IOU > 0.5 as a correct prediction. The script loads in the validation images and computes the accuracy of YOLO-T for each one. The following metrics are recorded:

  • total number of predicted objects
  • total number of truth objects
  • number of correct predictions
  • number of incorrect predictions
  • number of truth objects not predicted


I computed the accuracy metrics for $300$ images from the validation set and plotted a histogram of two metrics. In the first histogram (figure 1) the fraction of correct predictions (similar to precision) is taken. In the second histogram (figure 2) the fraction of truth objects correctly predicted (similar to recall) is taken.

Figure 1: Fraction of Correct Predictions

Figure 2: Fraction of Truth Objects Predicted


We can observe that YOLO-T predicts most of the truth objects since majority of the images are in the last bin in histogram (figure 2). On the downside, we can also observe that YOLO-T make incorrect predictions since majority of the images are in middle bin in histogram (figure 1). These incorrect predictions might be caused due to the low confidence threshold set in the settings due to which less confident predictions (mostly duplicate predictions were observed by me) were also included in the output of YOLO-T. Setting a higher confidence threshold should decrease the number of incorrect / duplicate predictions.

Week 6: July 1 to July 7[edit]

Summary of This Week's Work[edit]

The major goal of this week was to integrate YOLO with the camera feed of JdeRobot. With this aim in mind, I completed the following steps:

  • Updated my installation of JdeRobot to the current development version available on GitHub. This lead me to finding a few bugs in the installation process.
  • Installed the 'cameraview_py' tool in JdeRobot. During my exploration of the code and installation process, I solved a few bugs in the code and added a 'CMakeLists.txt' file for an easy installation process. I have issued a pull request to integrate this updated code with JdeRobot.
  • Wrote a python script '' to integrate YOLO-T with cameraview tool and cameraserver driver. Unfortunately YOLO-T is Python3 compatible and JdeRobot is Python2 compatible. Therefore the code is not useful but I did learn a lot from exploring the source base of JdeRobot.
  • Explored Python2 wrappers for YOLO which can be integrated with JdeRobot. I installed and experimented with two suitable wrappers namely (1) a recent python wrapper written by YOLO's author Joseph and (2) a stable python wrapper PyYolo.
  • Performed the final integration of YOLO (PyYolo) with JdeRobot's camera feed (cameraview tool and cameraserver driver).

Installing the Development Version of JdeRobot[edit]

The CMakeLists.txt located in JdeRobot/Deps/ros/CMakeLists.txt in the source code of JdeRobot has been recently modified to include new dependencies such as ros-kinetic-kobuki-gazebo and ros-kinetic-turtlebot-gazebo For new users following the installation manual, the installation results in an error. After exploring I found out that the error was due to two new required packages not being installed, namely, kobuki-gazebo and ros-kinetic-turtlebot-gazebo. The manual has been updated to include the following command:

sudo apt install ros-kinetic-kobuki-gazebo ros-kinetic-turtlebot-gazebo

Completing Script[edit]

During my exploration and installation of the recently added cameraview_py tool in JdeRobot I resolved a few bugs. I used the tool basic_component_py as a guide to perform my code updates. The following actions were done:

  • I added a file which contains the cmake install prefix as follows:
python @CMAKE_INSTALL_PREFIX@/share/jderobot/python/cameraview_py/ $*
  • Using the CMakeLists.txt of basic_component_py, I wrote a CMakeLists.txt file for cameraview_py After this basic_component_py was successfully installed.
  • The IP address contained in the file cameraview_py.cfg was hard coded to:
Camera.Proxy = cameraA:default -h -p 9999

This was changed to work with localhost by using:

Camera.Proxy = cameraA:default -h localhost -p 9999
  • After successfully installing basic_component_py, I started cameraserver and ran camerview_py. I noticed that the colour space used by camerview_py was BGR instead of RGB. To resolve this I have added the following line in the file
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

Integrating Trieu's YOLO-T with JdeRobot[edit]

My first attempt was to integrate Trieu's YOLO-T port with JdeRobot. To do so I extende the existing to by adding the following code:

options = {"model": "cfg/yolo.cfg", "load": "bin/yolo.weights", "threshold": 0.1}
tfnet = TFNet(options)
imgcv = image
result = tfnet.return_predict(imgcv)        
for item in result:
    tlx = item['topleft']['x']
    tly = item['topleft']['y']
    brx = item['bottomright']['x']
    bry = item['bottomright']['y']
    label = item['label']
    cv2.rectangle(imgcv, (tlx, tly), (brx, bry), (0,255,0), 3)
    cv2.putText(imgcv, label, (tlx, tly), font, 2, (255,255,255), 1, cv2.LINE_AA)
cv2.imshow("Image", imgcv)

The integration proved to be a real challenge. The first challenge was linking the JdeRobot libraries and dependencies with the libraries used by YOLO-T. I resolved this issue and created a Python virtual environment where all the packages required were successfully imported. Unfortunately YOLO-T is Python3 compatible and JdeRobot is Python2 compatible. I had to explore Python2 compatible wrappers for YOLO.

Exploration of Python Wrappers for Darknet[edit]

In my search for Python wrappers for YOLO, I found two suitable candidates.

Python Wrapper written by YOLO's Author There is a recent python wrapper written by Joseph (pjreddie) for YOLO. Since the script is recent there are a few bugs and the installation process was painful. The installation finally completed when I modified the load_net line to be:

net = load_net(b"cfg/tiny-yolo.cfg", b"tiny-yolo.weights")

The predictions though were highly inaccurate. This seems to be an open issue. For example the output predictions for the following image were:

[('sheep', 0.5437681078910828), ('sofa', 0.5343875885009766), ('pottedplant', 0.5168187022209167), ('horse', 0.5168185234069824), 
('aeroplane', 0.5089177489280701), ('cow', 0.49469587206840515), ('train', 0.4943316876888275), ('boat', 0.47638556361198425), 
('dog', 0.47564685344696045), ('chair', 0.4743657410144806)]

Notice how the confidence scores are around 0.5 giving an indication that the predictions might possibly be random?

Python Wrapper PyYolo The next suitable candidate was PyYolo. The installation process was again broken. After reading the comments on the issue, I experimented with turning OpenCV support off. This worked and PyYolo was successfully installed. I have commented on the issue to notify other developers.

To experiment on a sample input image, I modified the file to be:

darknet_path = './darknet'
datacfg = 'cfg/'
cfgfile = 'cfg/tiny-yolo.cfg'
weightfile = '../tiny-yolo.weights'
filename = darknet_path + '/data/giraffe.jpg'
thresh = 0.25
hier_thresh = 0.5
pyyolo.init(darknet_path, datacfg, cfgfile, weightfile)

outputs = pyyolo.test(filename, thresh, hier_thresh)
for output in outputs:

This worked and I'm using PyYolo as a Python2 wrapper for YOLO in integrating YOLO with JdeRobot's camera feed.

Phase 2[edit]

In phase 2, I plan to extend YOLO on Darknet to accept RGB and depth data as input. In this phase, I aim to demonstrate the effectiveness of including depth data in YOLO for an improvement in accuracy. To achieve this we will be using our own data set containing RGB and depth maps of images. Our data set will undergo a proper formatting, shuffling and splitting (80-20 for train-test). Our data set contains only one class namely the person class. YOLO will be trained to detect persons on RGB channels only, depth channels only and finally on a combination of RGB and depth channels. An accuracy comparison will be done using a Darknet script to demonstrate the anticipated increase in accuracy.

Why Darknet was chosen as the framework?

Although the YOLO ports which I tested performed accurately, they still were slightly below the original YOLO on Darknet in terms of accuracy and especially speed. Speed is important for real time detection which we need since we a re designing the vision model of a robot.

The steps to perform in phase 2 are as follows:

S.No. Step Status
1 Parse labels to format suitable for training of Yolo DONE
2 Split data set into 80-20 train and test split with proper naming convention DONE
3 Explore training Yolo on Darknet on custom data set DONE
4 Train Yolo on Darknet on RGB only DONE
5 Train Yolo on Darknet on Depth only DONE
6 Accuracy script in Darknet to calculate average precision and average recall DONE
7 Integrate RGB and depth maps into single image file DONE
8 Change Darknet code to load and process 16 bit images DONE
9 Train Yolo on Darknet on RGBD DONE
10 Comparison of accuracies of RGB, Depth and RGBD networks DONE

Week 7: July 8 to July 14[edit]

Preparation of Data Set[edit]

My mentors Francisco and Alberto have compiled a depth data set in an indoor setting. The data set contains 321 images of people in an indoor environment. The depth map is encoded as an RGB image in the following manner:

  • Channel D1 contains the gray scale intensity between 0 to 255
  • Channel D2 contains the first byte of the depth data (B1)
  • Channel D3 contains the second byte of depth data (B2)

The depth information in essentially encoded using two bytes (B1:B2). The range of the depth information is between 0-10,000mm. I ran a Python script to load each depth map in OpenCV and find the average minimum depth and average maximum depth which came out to be 0mm and 6928mm respectively. Few points on the depth data set:

  • How was the data set labeled? The labeling was a manual process. The output of YOLO was used as an initial indicator of bounding boxes. Each bounding box was then verified to be accurate or manually modified if not.
  • Why is the minimum depth 0mm since the lower bound for sensor measurement is generally around 300mm? This is because the depth sensor fails to record the depth of objects made up of certain materials. Also if direct sunlight falls on the object during measurement it results in an inaccurate measurement.
  • What should be done with the 0mm measurement at pixel (x,y)? I'm currently thinking on the following lines:
    • Keep the measurement as 0mm
    • Substitute the measurement with a random measurement value
    • Substitute the measurement with the global mean value
    • Substitute the measurement with the local mean value where the local mean is the mean of the depth measurements of pixels contained within a square of side min(width, height)/10 centered at (x, y)

Parsing the Label Annotations[edit]

To retrain yolo on the RGB channels of our data set, I first had to change the format of labels suitable for input to training. The following format of labels is required:

Each .jpg file should have a corresponding .txt file with the same file name and in the same directory containing the annotations for the image. For each object present in the image, a new line in the file is filled with object number and object coordinates of the image. The following format is used:

<object-class> <x> <y> <width> <height>


  • object-class is an integer between 0 to n-1 (with n being the number of classes)
  • <width> and <height> are float values denoting the relative width and height of the bounding box w.r.t. to the image dimensions
  • <x> <y> are the co-ordinates of the centre of the bounding box also specified in relative terms

I wrote a Python script to parse the annotations to the required format.

Splitting the Data Set into 80-20 Train-Test Split[edit]

The naming convention of the original dataset had to be changed. Images were organized into folders named after locations of places in which the image was captured. I wrote a Python script which traverses the directory tree and compiles all image files into 2 directories namely train and test. Before compiling, I used the random.shuffle command to randomly shuffle the dataset and then split the data set into 80-20 train and test split. The train and test directory internally contains an rgb directory and a depth directory to store images and their corresponding depth maps respectively. The rgb directory in train contains 256 images named from "00000_rgb.png" to "00255_rgb.png" and the rgb directory in test contains 65 images named from "00256_rgb.png" to "00320_rgb.png". The depth directory in train contains 256 images named from "00000_depth.png" to "00255_depth.png" and the depth directory in test contains 65 images named from "00256_depth.png" to "00320_depth.png".

Week 8: June 15 to June 21[edit]

Training Yolo on Custom Data Set[edit]

Our data set contains a single person class converting our problem from object classification to object detection. To train Darknet to detect custom objects I performed the following steps:

  • Created a custom yolo configuration file named model_name.cfg which is exactly the same as yolo-voc.2.0.cfg except for the following changes:
    • changed number of classes to 1 since we have a single person class
    • changed the number of filters in the last convolutional layer to 30 following the equation, filters=(classes + coords + 1)*num) where classes=1, coords=4 and num=5
  • Created a file model.names containing the names of the classes. In our case we have a single line with 'person'
  • Created a file containing the training information required to train model including paths to the train and test data along with the number of classes. The file looks like this:
classes= 1
train  = data/train_paths.txt
valid  = data/test_paths.txt
names = data/model.names
backup = backup
  • All images of training and testing set were stored in a single directory data/dir1 along with the corresponding annotation files for each image. Each .png file had a corresponding .txt annotation file in the appropriate format.
  • Wrote a Python script to generate two files train_paths.txt and test_paths.txt. These files contain the paths of the image file names which are to included in the respective set (train and test respectively). train_paths.txt is composed of 256 lines where each line contains the path to a training image input. test_paths.txt is composed of 65 lines where each line contains the path to a testing image input.
  • To speed up the training process, I'm using pre-trained weights [darknet19_448.conv.23] for initialisation which can be downloaded from here.
  • Finally we can start training with the following command:
./darknet detector train data/ cfg/model_name.cfg darknet19_448.conv.23
  • Early stopping is performed by observing the average loss value. Once the average loss stops decreasing, the training process is stopped.
  • A sample screen shot from the training process is as follows:

Training on RGB Channels Only[edit]

From the section on training Yolo on a custom data set the following steps were performed:

  • custom configuration file model_name.cfg is yolo_custom_rgb.cfg
  • class names file model.names is custom_rgb.names
  • data file is containing:
classes= 1
train  = data/train_custom_rgb.txt
valid  = data/test_custom_rgb.txt
names = data/custom_rgb.names
backup = backup/rgb
  • stored training and testing images along with annotations in directory data/custom/rgb/
  • created two files for train and test image paths called class="inlinecode">train_custom_rgb.txt</code> and class="inlinecode">test_custom_rgb.txt</code> respectively
  • Start training with the following command:
./darknet detector train data/ cfg/yolo_custom_rgb.cfg darknet19_448.conv.23

Week 9: June 22 to June 28[edit]

Training on Depth Only[edit]

From the section on training Yolo on a custom data set the following steps were performed:

  • custom configuration file model_name.cfg is yolo_custom_depth.cfg
  • class names file model.names is custom_depth.names
  • data file is containing:
classes= 1
train  = data/train_custom_depth.txt
valid  = data/test_custom_depth.txt
names = data/custom_depth.names
backup = backup/depth
  • stored training and testing images along with annotations in directory data/custom/rgb/
  • created two files for train and test image paths called train_custom_depth.txt and test_custom_depth.txt respectively
  • Start training with the following command:
./darknet detector train data/ cfg/yolo_custom_depth.cfg darknet19_448.conv.23

Week 10: June 29 to August 4[edit]

Loading 16 bit PNG file in Darknet[edit]

In order to load 16 bit PNG images I changed the image.c file of Darknet in several places to ensure the conversion of the data matrix from the current unsigned char (8 bits) to unsigned short (16 bits). The normalization factor of 255 has to be changed to the new normalization factor of 65535. The main function which was modified was load_image_stb which looks like:

image load_image_stb(char *filename, int channels)
    int w, h, c;
    //changed here
    unsigned short *data = stbi_load_16(filename, &w, &h, &c, channels);
    //fprintf(stderr, "Yoload\n");
    if (!data) {
        fprintf(stderr, "Cannot load image \"%s\"\nSTB Reason: %s\n", filename, stbi_failure_reason());
    if(channels) c = channels;
    int i,j,k;
    image im = make_image(w, h, c);
    int maxx = data[0];
    for(k = 0; k < c; ++k){
        for(j = 0; j < h; ++j){
            for(i = 0; i < w; ++i){
                int dst_index = i + w*j + w*h*k;
                int src_index = k + c*i + c*w*j;
                //changed here X 2
                if (data[src_index] > maxx)
                    maxx = data[src_index];
      [dst_index] = (float)data[src_index]/65535.;
    fprintf(stderr, "Max %d\n", maxx);
    //fprintf(stderr, "Yoload\n");
    return im;

Week 11: August 5 to August 11[edit]

Combining RGB and Depth Data[edit]

Currently we have RGB input as a separate PNG file and the depth map as a separate PNG file. These PNG files are 8 bit depth, 3 channel files. The way I'm combining these files is to create a single PNG file of 16 bit depth, 3 channels.

  • Data 1: R, G, B channels of 8 bits each
  • Data 2: D1, D2, D3 channels of 8 bits each
  • Required: C1, C2, C3 channels of 16 bits each

Naive Idea:

One naive approach is to use bitwise shifting as follows:
C1 = R << 8 | D1
C2 = G << 8 | D2
C3 = B << 8 | D3
(shift by 8 bits left and then OR)

Drawback of bitwise shifting: The major drawback of bitwise shifting is that the R, G and B channels which get shifted by 8 bits to the left occupy all the significant bit positions and therefore dominate over the depth channels. The final value is decided by the R, G and B channels with the depth channels adding a maximum of 255 to the existing value.

One Solution - Bit Sequence Interleaving - Morton Number:
We observe that we need to combine two independent 8 bit channels into a single 16 bit channels without any bias to any of the channels. To overcome the dominance issue in the naive approach, one solution is to use bit sequence interleaving (similar to computing the Morton number). This means the following: If channel R is b1, b2, ..., b8 and channel D1 is bb1, bb2, ..., bb8, then the combined 16 bit channel will have b1, bb1, b2, bb2, ..., b8, bb8 as the bit sequence. This ensures that the bits are places pseudo independently and no one 8 bit sequence dominates over the other.

To compare both the above combination approaches I wrote two python scripts which combine two 8 bit depth, 3 channel files into a single 16 bit depth, 3 channel file. One script combines using bitwise shifting and the other combines using bit interleaving.

A sample combined 16-bit PNG file using bit interleaving is displayed:

Only RGB

Only Depth

RGB + Depth (16 bit interleaving)

Week 12: August 12 to August 18[edit]

Creating Depth Colour Maps[edit]

An observation about the depth map is the jumps in gradients which should ideally be smoothed out. To do so, I linearly scaled the depth map from 0-10,000mm to 0-255 gray scale intensity range.

The function to linearly scale range [A, B] to range [C, D] is as follows:

f(x) = C*( 1 - (x-A)/(B-A) ) + D*( x-A/B-A )

In our case we have to scale the depth measurements of 0-10,000mm linear to a gray scale range of 0-255. Therefore the mapping will be: A-0, B-10000, C-0, D-255 to give:

f(x) = 255*( x/10000 )

This one channel image was then converted to a jetmap using OpenCV's 'applyColorMap' function which transforms it into a 3 channel RGB image with 0 represented by Blue and 255 by Red. The linear scaling into a color map helps smooth the edges of the depth map providing appropriate gradient information to the neural network. These depth colour maps were then combined with the RGB images using the bit interleaving technique as previously described.

RGB Only

Depth Only

Depth to Linear Gray Scale

Depth Colour Map

RGB + Depth Colour Map (16 bit interleaving)

Week 13: August 19 to August 25[edit]

Writing an Accuracy Script in Darknet[edit]

To calculate the recall and accuracy of a trained model on the test set, I changed the validate_detector_recall function in darknet to keep track of the number of region proposals (predicted boxes), the number of truth boxes and the correct predictions. Precision is calculated as correct predictions / number of region proposals and recall is calculated as correct predictions / number of truth boxes.

void validate_detector_recall(char *datacfg, char *cfgfile, char *weightfile)
    network net = parse_network_cfg(cfgfile);
        load_weights(&net, weightfile);
    set_batch_network(&net, 1);
    fprintf(stderr, "Learning Rate: %g, Momentum: %g, Decay: %g\n", net.learning_rate, net.momentum, net.decay);

    list *options = read_data_cfg(datacfg);
    //changed here X 3
    char *valid_images = option_find_str(options, "valid", "data/test_custom_depth.txt");
    list *plist = get_paths(valid_images);
    char **paths = (char **)list_to_array(plist);

    layer l = net.layers[net.n-1];
    int classes = l.classes;

    int j, k;
    box *boxes = calloc(l.w*l.h*l.n, sizeof(box));
    float **probs = calloc(l.w*l.h*l.n, sizeof(float *));
    for(j = 0; j < l.w*l.h*l.n; ++j) probs[j] = calloc(classes+1, sizeof(float *));

    int m = plist->size;
    int i=0;

    float thresh = .1;
    float iou_thresh = .5;
    float nms = .4;

    int total = 0;
    int correct = 0;
    int proposals = 0;
    float avg_iou = 0;
    //changed here
    int proposals_current = 0;
    for(i = 0; i < m; ++i){
	//changed here
	proposals_current = 0;
        char *path = paths[i];
        image orig = load_image_color(path, 0, 0);
        image sized = resize_image(orig, net.w, net.h);
        char *id = basecfg(path);
        get_region_boxes(l, sized.w, sized.h, net.w, net.h, thresh, probs, boxes, 0, 1, 0, .5, 1);
        if (nms) do_nms(boxes, probs, l.w*l.h*l.n, 1, nms);

        char labelpath[4096];
        find_replace(path, "images", "labels", labelpath);
        find_replace(labelpath, "JPEGImages", "labels", labelpath);
	//changed here
        find_replace(labelpath, ".png", ".txt", labelpath);
        find_replace(labelpath, ".JPEG", ".txt", labelpath);

        int num_labels = 0;
        box_label *truth = read_boxes(labelpath, &num_labels);
        for(k = 0; k < l.w*l.h*l.n; ++k){
            if(probs[k][0] > thresh){
		//changed here
        for (j = 0; j < num_labels; ++j) {
            box t = {truth[j].x, truth[j].y, truth[j].w, truth[j].h};
            float best_iou = 0;
            for(k = 0; k < l.w*l.h*l.n; ++k){
                float iou = box_iou(boxes[k], t);
                if(probs[k][0] > thresh && iou > best_iou){
                    best_iou = iou;
            avg_iou += best_iou;
            if(best_iou > iou_thresh){
	//changed here
        fprintf(stderr, "%5d %5d %5d\tRPs/Img: %.2f\tIOU: %.2f%%\tRecall: %.2f%%\tPrecision: %.2f%%\n", i, correct, total, (float)proposals/(i+1), avg_iou*100/total, 100.*correct/total, 100.*correct/proposals);

After successfully compiling the modified Yolo on Darknet, I changed the configuration files to be used to train on a custom dataset (16 bit PNG images). YOLOD_1 represents the network trained on 16 bit PNG images formed using bitwise shifting. YOLOD_2 represents the network trained on 16 bit PNG images formed using bitwise interleaving. YOLOD_3 represents the network trained using depth maps as colour maps combined with RGB images using bit interleaving. For these accuracy computations, the threshold for region proposal generations in was kept to be 0.10 The accuracy comparisons is shown in the below table:


S.No. Model Average Precision Average Recall
1 RGB Only 80.25% 96.30%
2 Depth Only 92.91% 87.41%

3 YOLOD_1 (Bitwise Shifting) 87.97% 86.67%
4 YOLOD_2 (Interleaving) 86.67% 96.30%
5 YOLOD_3 (Depth Color Map + Interleaving) 97.76% 97.04%

Future Work[edit]

  • We plan to write a research paper on the improvement in accuracy using our RGB + depth combination as demonstrated on our data set. The data set will be increased from the current 321 images to a few thousand images.
  • We plan to integrate the YOLO-D model into the JdeRobot code base for RGB + depth based onject detection in robots.