My dream's been to make a real "ASURADA", a car-embedded robot, from Japan animation " 新世期GPX サイバーフォーミュラ Cyber Formula" by myself. So, I have wanted to implement a friendly AI product with which people feel reliable or friendship not only in car, but also in house!
I will call my project V! ( "V!" looks like upside down shape of "Ai").
My goal is to make this hand-held device have the ability to see and hear and speak like a human assistant.
While driving, it will warn you several dangerous situations like a sudden car interception or braking of front car, plus, traffic light change to red, yellow or green. You can control a radio volume/channel via hand gesture as it will read your hand motion. Plus, It will always monitor your face to prevent you from drowsy driving.
Also, it will hear what you are talking and reply like a funny friend or sometimes give you a reliable local information.
In house, you can take it to your living room, maybe good to put it in front of TV.
Then, it will monitor your pose and provide you a health care game or wrong-pose warning sound in case you are sleeping in a wrong pose.
Also, it will tell you to step back when you are too close to TV(it would be nice for babies or children who likes watching TV so closely).
To satisfy my goal, I need to analyze video(object and keypoint detection) and audio(what people saying)
-------------------------------------------------------PROEJECT DETAIL------------------------------------------------------------
For those, I studied hard a lot of object/ facial, hand, body keypoint detection methods including YoloV2, SSD, RFCN, DAN, OpenPose, OpenFace and so on and I choose YoloV2 as a base (face/hand/car/traffic light) detector. And not only I used the existing imageset like WIDER FACE or Ibug DB, I collected about 8000 face/hand/traffic images by myself from my car black box and recording myself. I used a open labelling tool for annotating all images manually.
For data augmentation, I used the open imgaug python tool to finally get a million images.
In case of speech recognition, I used DeepSpeech(https://github.com/mozilla/DeepSpeech) and a open Chatbot(https://github.com/AastaNV/ChatBot) provided by nvidia.
For DCNN optimization, I tried several methods/tools like Tensorflow's transform_graph with 'quantize_weights' or TensorRT, Caffe-Jacinto, Caffe-Ristretto. I faced tons of issues and unfortunately, I got bad results. I mean, optimized(quantized or prunned)
I tried to use separable depth-wise conv layers in YoloV2's feature extraction step, but it also showed me not enough good accuracy of classification/bbox regression while reducing the existing full yolo v2 model size(200MB) to 130MB. And Yolo network was too sensitive to get quantized.
So I just tried to train full yolo with the darknet 19 pretrained model and my custom imageset. And I trained tiny-yolo(v2) with the full yolo, as I failed to converge to low loss level when I trained it from scratch with just the darknet-19.
For Hand keypoint detection, I used a open hand gesture project(https://github.com/lmb-freiburg/hand3d) and replaced the existing hand detection(HandSegNet) with YoloV2's hand cropped image) for better speed.
For Face Landmark detection, I used OpenFace as I had some minor installtion issues when I install DAN open source to TX2. And I check eye open/closed times and mouth open/close to see if you are drowsy.
Currently, I used several open projects(all under GPL licenses except OpenPose/OpenFace which both allow their project to be used for only academic or non-profit purpose) to make the demo work before submission deadline, but I will replace all of those with my own integrated DCNN/GAN networks.
I will update my blog to explain what I have done in detail soon.
Comments (6)
Seunghyun Lee
Seunghyun Lee
Seunghyun Lee
Seunghyun Lee
Seunghyun Lee
Seunghyun Lee