To handle image processing in our system we use Nvidia Jetson TX2, which gives us real-time AI performance. Each module has 256 CUDA cores, which are all used for deep learning, computer vision, GPU computing making it ideal for our needs. Jetson runs Linux Ubuntu with dedicated JetPack 3.2. Our software runs OpenCV 3.4 with GPU accelerated vision processing, which has a great performance with new CUDA 9.0 Toolkit. We use cuDNN (v. 7.0.5) for even faster image classification. Classification is done by Keras - high-level neural networks API. Keras is capable of running on top of TensorFlow library (numerical computation library) which is even faster with new high-performance deep learning inference optimizer and runtime - Nvidia TensorRT. Basically, we are using TensorRT to deliver real-time processing and streaming to the user.
The whole system uses three types of neural network algorithms. Microphone array uses AI to classify targets noise. Image from the camera is processed and uses artificial intelligence to classify objects and depending on the objects surrounding it uses two tracking algorithms. We use KCF algorithm for general purpose tracking, TLD tracker under occlusion (tracked objects are being covered by other objects - crossing objects), and in some cases we use GOTURN tracker. Entire AI processing in the vision sensor (camera) is performed using NVIDIA Jetson TX2. This allows us to leverage the powerful supercomputer capabilities of Jetson TX2, at a relatively low cost and low power consumption – important factors in our commercial application.
As it was said before – on detection triggered by any sensor there is a signal generated and sent to the camera. It gives information about the approximate position of the object. We say object, because it may not be drone. To be certain about the detection it’s best to have more than one sensor confirming detection. The coordinates generated by the radar are used to control the camera. Knowing the approximate position of the object we point the camera to have optimal view on it. For better control, we use Onvif protocol, which is supported by most of the cameras. Our system uses AXIS Q8685-E PTZ camera. It provides a thirty-fold zoom and panoramic image and allows quick but smooth control.
The image is obtained by the GStreamer application. One of the benefits of using Gstreamer is its architecture. It allows using any number of processing steps between acquiring the video and using it in our application. It is designed as a pipeline application, which means that we build our processing line from elements, that are passing the image from one to another as a stream.
After image is obtained it is processed by the OpenCV algorithms. First step is to apply mog to the image, and then we find contours of all objects. We sort the contours to first have acceptable size. If the object is to small, there is very little chance that it will be classified as drone, so it is excluded from the search. It may reject some drones that are very far away and may appear as dots on the image, but very little objects have a very small chance to be classified, and with many small objects there is a lot of processing power required to process them. So, rejecting some of the small object is a way to optimize, or rather minimize number of objects that have a very small chance to be classified anyway.
This solution gives us information about all objects that are on the image at all time. To make it better we don’t always use all the image for classification, but only one frame every few seconds. Rather than processing whole image, we only find moving objects. We do this by finding differences between two frames, and for this we use background subtraction. After that, we have parts of image that are sent to the processing.
Detection algorithm works in two states; detecting and tracking. After confirmed detection, algorithm switches to the tracking state. At this state algorithm has one or more targets to track. Depending on the number of moving objects and their paths we use different tracking algorithms. If moving targets intersect, we switch to TLD tracker, which quickly recovers after the track is lost. Second tracker used is GOTURN tracker, which is using regression networks. This tracker is used whenever there are more than two targets located at a very close range and their area intersect and/or one is covering the other one. Great advantage of using this tracker is that it allows us to reduce number of tracked objects by merging image areas into one track. This also helps join many detections triggered by cloud movement and then later reject them. For general purpose we use KLD, which is faster and more reliable.
For classification we use Keras Neural Network. The model consists of densely-connected Neural Network layers with relu activation function, categorical cross entropy as loss function and SGD optimizer. The model is created from the background of the site. Depending on the site the background will be different, so for each site we use special application to gather images of the site, and later on there is an automated process of learning the model. For each learning process we use as much drone images as possible, to have the model perfectly balanced.