Home
Challenges
NVIDIA® Jetson™ Developer Challenge

This Challenge is completed

NVIDIA® Jetson™ Developer Challenge

Winners announced

prize pool $42,789

SEE RESULTS

Seunghyun Lee

Added: Feb 15, 2018

V - Watching n Talking for Smart Car n Home

play
pdf

Project description

My dream's been to make a real "ASURADA", a car-embedded robot, from Japan animation " 新世期GPX サイバーフォーミュラ Cyber Formula" by myself. So, I have wanted to implement a friendly AI product with which people feel reliable or friendship not only in car, but also in house!

I will call my project V! ( "V!" looks like upside down shape of "Ai").

My goal is to make this hand-held device have the ability to see and hear and speak like a human assistant.

While driving, it will warn you several dangerous situations like a sudden car interception or braking of front car, plus, traffic light change to red, yellow or green. You can control a radio volume/channel via hand gesture as it will read your hand motion. Plus, It will always monitor your face to prevent you from drowsy driving.

Also, it will hear what you are talking and reply like a funny friend or sometimes give you a reliable local information.

In house, you can take it to your living room, maybe good to put it in front of TV.

Then, it will monitor your pose and provide you a health care game or wrong-pose warning sound in case you are sleeping in a wrong pose.

Also, it will tell you to step back when you are too close to TV(it would be nice for babies or children who likes watching TV so closely).

To satisfy my goal, I need to analyze video(object and keypoint detection) and audio(what people saying)

-------------------------------------------------------PROEJECT DETAIL------------------------------------------------------------

For those, I studied hard a lot of object/ facial, hand, body keypoint detection methods including YoloV2, SSD, RFCN, DAN, OpenPose, OpenFace and so on and I choose YoloV2 as a base (face/hand/car/traffic light) detector. And not only I used the existing imageset like WIDER FACE or Ibug DB, I collected about 8000 face/hand/traffic images by myself from my car black box and recording myself. I used a open labelling tool for annotating all images manually.

For data augmentation, I used the open imgaug python tool to finally get a million images.

In case of speech recognition, I used DeepSpeech(https://github.com/mozilla/DeepSpeech) and a open Chatbot(https://github.com/AastaNV/ChatBot) provided by nvidia.

For DCNN optimization, I tried several methods/tools like Tensorflow's transform_graph with 'quantize_weights' or TensorRT, Caffe-Jacinto, Caffe-Ristretto. I faced tons of issues and unfortunately, I got bad results. I mean, optimized(quantized or prunned)

I tried to use separable depth-wise conv layers in YoloV2's feature extraction step, but it also showed me not enough good accuracy of classification/bbox regression while reducing the existing full yolo v2 model size(200MB) to 130MB. And Yolo network was too sensitive to get quantized.

So I just tried to train full yolo with the darknet 19 pretrained model and my custom imageset. And I trained tiny-yolo(v2) with the full yolo, as I failed to converge to low loss level when I trained it from scratch with just the darknet-19.

For Hand keypoint detection, I used a open hand gesture project(https://github.com/lmb-freiburg/hand3d) and replaced the existing hand detection(HandSegNet) with YoloV2's hand cropped image) for better speed.

For Face Landmark detection, I used OpenFace as I had some minor installtion issues when I install DAN open source to TX2. And I check eye open/closed times and mouth open/close to see if you are drowsy.

Currently, I used several open projects(all under GPL licenses except OpenPose/OpenFace which both allow their project to be used for only academic or non-profit purpose) to make the demo work before submission deadline, but I will replace all of those with my own integrated DCNN/GAN networks.

I will update my blog to explain what I have done in detail soon.

Comment

Please login to leave a comment

Comments (6)

Seunghyun Lee

you can see my caffe based yolo v2's detection results on my github!
Seunghyun Lee

I succeeded in converting darknet yolo v2 to caffemodel! I uploaded the model / prototxt on my gitub. So, I will keep going on converting it to tensorRT for more optimization!
Seunghyun Lee

I replaced Face Landmark detector(Theano) with TF version for better optimization and re-trained it from scratch with 300W DB. It runs > 25 FPS. I modified the existing net architecture a lot. I removed all stage 2 net and replaced all conv with separable dw conv(270MB->7MB!!). it runs x2 faster!!
Seunghyun Lee

Currently It runs in real time (24~31 FPS) on TX2 on detection.
yolov2+ face keypoint net run in 15~18 FPS, when it detect a hand, yolov2+hand net runs in 17~20FPS while face net is disabled.
But, after opimizing nets with TenorRT or TF, I guess all net would run in more than 20FPS.
Seunghyun Lee

In case of Hand keypoint detection, I use just middle part of the whole model, 'PoseNet'. I removed 'HandSegNet' and 'PosePriorNet, ViewPointNet' and I quantized the 'PoseNet' part to get the reduced and faster model. Size is changed from 188.4MB(2 pickles) -> 70 MB(1 frozen pb) -> 17.6 MB(1 qt pb)
Seunghyun Lee

i'm trying to optimize Hand and Face keypoint detectors using TensorRT.
As I replaced the existing HandSegNet and Face Detector with darknet, I runs 2 times faster. After TensorRT, I expect to see x3~4 speed up.
Plus, I'm using Adafuit I2S Amp and Respeaker Hardware for voice recognition and TTS

This Challenge is completed

NVIDIA® Jetson™ Developer Challenge

V - Watching n Talking for Smart Car n Home

Project description

Comment

Comments (6)

Let’s talk

Log in

Log in

Create a candidate account

Reset your password

Create an employer account