Thursday, October 6, 2016

Raspberry Pi Speed Sign Detector: Classifiers and Neural Networks

This post will detail a progress update on a speed limit sign detection project. It will present the results of a lbp classifier and neural networking for the purpose of detecting US speed limit signs. Readers are encouraged to read part 1 before continuing with this post.

All code for this project can be found here

Last update I left off by saying I'd let a raspberry pi mine through hand selected videos of dash cam footage to build a training set. After a few weeks and over 50 hours of video 3118 images of speed signs and 9298 negative images were autonomously mined. In fact automation would prove to be a very valuable tool.

The flow of the sign detection algorithm as proposed in the last post can be summarized by the following chart:

Figure 1:
Speed sign detection algorithm flow chart
In this post the classifier and detection stages will be introduced (stage 2 and stage 3 respectively of Figure 1). Two methods for the detection stage will be examined: the first uses an optical character recognition and is referred to as OCR detection. The later method uses a convolutional neural network and is referred to as CNN detection.

Cascade Classifier Training and Results:

With the results of the first mining session a lbp cascade classifier was trained to be used in stage 2. OpenCV has a built in cascade trainer via the opencv_traincascade command. The classifier used in this update is a 15 stage classifier trained with 1000 positive and negative images. Below are some examples of the results:





The classifier is effective at finding signs. In an evaluation set of 60 positives images it identifies 100% of speed limit signs. However, it's also very prone to false positives:






The classifier wasn't intended to be right 100% of the time though. Its purpose is to filter out the results of the FRST vetting only the best candidates to send to the detection stage. Expensive calls to the detection algorithm are reduced by reserving the detection stage for only the most promising candidates.

In the first post, an OCR was proposed to be used for detection. Testing found that while the OCR is very resilient to false positives, it isn't very accurate. The following table summarizes OCR detection based off of results from an, albeit small, evaluation set consisting of 60 images containing a sign (positive images) and 40 images with no sign present (negative images):




Total Input Identified as signs Mislabeled Performance
Positive Images 60 28 (46%) 0 28/60 (46%)
Negative Images 40 0 (0%) - 0/40 (0%)


Note that OCR detection isn't completely immune to false positives or mislabeling. For example exit signs from on-ramps were observed to be detected by the OCR.

OCR detection's low accuracy lead to consideration of other detection techniques. On technique that has become ubiquitous with image recognition is neural networking (NN). In fact there a number papers examining the use of NN for traffic sign detection [1][2][3].

The main drawback to using NN is that they require a very large amount of images to properly train. Video mining proved invaluable for this task as it allowed a large data set to be built from scratch with little labor. For the purpose of mining I chose to omit stage 3 of figure 1. The reasoning was two fold: firstly it allows for significantly more images to be found (over 6 times). Secondly, it provides false positives to refine the classifier. Also OpenTLD is no longer used because the classifier is accurate enough to track signs. The only drawback is that manual sorting of the results needs to be done. 

The raspberry pi again mined through a list of youtube videos of dash cam footage taken from various roads. 50 hours of footage was mined in about a week, netting over 16 thousand images of signs ranging from 25 to 75. Over 25 thousand negative images were mined as well. This brought the total sign database to 19237 positive images and 33488 negative images. The interested reader can find the database here. Positive images are gray scale containing only the speed limit sign at resolutions mostly around 20x25. Users will have to resize all images to a standard size before use in training.

Training the Neural Network:

Neural networking is supported by a wealth of software packages, mostly in python. While there are C++ libraries for NN, I found python libraries to be by far the easiest to install, train and run. Because of this, python is used for the detection phase by embedding it in C++, the project's language, using the Python.h library.

NoLearn and Keras were the NN packages that I used. First I trained a deep belief network using NoLearn. In this NN architecture image pixels are lexicographically ordered into a single vector. That vector is used as input to the NN. The most successful architecture used a sing 800 node fully connected hidden layer and an 11 node output layer (one for each sign ranging from 25,30,35, … 75). 90% accuracy on an evaluation set would be reached after 20 or so epochs. However 90% accuracy is not what one should expect in real performance. Mainly because the evaluation set consisted of images taken, at random, from the training set. Meaning that most of evaluation images could be different scales of signs used in training. This is due to images consisting of the same sign over multiple frame. So as the sign approaches the camera it appears larger, hence is scaled up. 90% accuracy may indicate over fitting (memorizing the training images) rather than true accuracy. I ran another mining session for about 1000 evaluation images. On that set the score was significantly lower, around 50%.

One issue was that the NN would be successful on signs in the center of the screen, but fail on signs to the left, right, above or below center. One theory was that some fundamental features of a sign were lost since training images are converted into a 1D vector. Convolutional neural networks (CNN) are a variant of regular NN. It's input is a 2D image and has been noted to preform better for objects that are translated/rotated.

Keras was used for CNN training and building. It again scored over 90% in identification in about 30 epochs. However, the CNN scored better in the evaluation images receiving 72% accuracy. The following table shows the CNN's performance with the evaluation set used to construct the first table above:




Total Input Identified as signs Mislabeled Performance
Positive Images 60 54 (90%) 9 45/60 (75%)
Negative Images 40 12 (30%) - 12/40 (30%)


The results from the second evaluation set shows that CNN detection is more accurate than OCR detection. However, it is also much more prone to false positives and mislabeling. The next section will evaluate the frame rate performances between OCR detection and CNN detection.

Results:

For this performance test the first 10000 frames of the following video was used: IH35 North through Oklahoma City

The video is originally 1280 x 720 but is resized to 640x480. Next the right hand sized of the road is only scaned leaving a final resolution of 320 x 330. Frame rates of the classifier, FRST, OCR and the CNN were recorded. First results show the impact the classifier has on frame rate for OCR detection:

Figure 2:
Frame rate using OCR detection without classifier
Figure 3:
Frame rate using OCR detection with the classifier

The most notable difference is that the standard deviation in Figure 3 is smaller than that of Figure 2. This means that the algorithm doesn't lag as much on things like billboards, exit signs or other objects with text or features in it.

Next is the performance of the sign detection algorithm using CNN detection for stage 3.

Figure 4:
Frame rate using CNN detection
Figure 4 shows the frame rate does not experience any significant dips due to feature heavy objects. Note there is single dip occurred around frame 100. It is caused by Keras initializing the CNN.

Figure 5:
Figure 3 and Figure 4 overlaid for comparison

Figure 5 is a direct comparison of when stage 3 uses an OCR vs an CNN. It appears to show that OCR detection gives a better frame rate than CNN detection. Filtering the high frequency noise illustrates this better:

Figure 6:
Figure 5 run through high frequency filter
One may conclude that on average OCR detection runs faster than CNN detection. Such a conclusion may be slightly misleading. Consider the frame rate for the detection stage (stage 3 only) with OCR detection and CNN detection:

Figure 7:
CNN detection fps vs OCR detection fps
Figure 7 shows the speed in which the detection stage runs with OCR and CNN detection. Often OCR detection is much faster than CNN detection. However OCR detection speed will suddenly dip causing noticeable lag. The reason for this is that in OCR detection blocks of man made text are first sought. There must be a minimal of 2 blocks present in an image, each having an area of 30% of the image. If they are not present then the image is rejected before use of the actual OCR is made. So images of the sky, pavement or grass are quickly rejected since there are no man made text blocks present in nature. When there are enough blocks in the image to pass the preliminary check: the frame rate can tank. For example man made signs that don't contain "Speed" or "Limit" will still get read, but rejected. CNN detection preforms the same task regardless of what features are present in it's input. This results in it running slower overall, but far more consistently. 

The following videos are demonstrations of the algorithm in action. Two video styles were chosen for demonstration. The first is footage of the 10000 frames used generate the above figures and it's shown without edits. It is intended to demonstrate overall performance, detection, false positives and miss-labeling. The next video shows a highlight of signs passing by in unfavorable conditions. Note that no changes where made to the algorithm for recording of one video to the next. The red text and a red circle indicate OCR detection's output. Green text and a green rectangle indicate CNN detection's output.





Moving forward:

The CNN is slower overall than OCR, but holds a more consistent frame rate. Further it should be apparent from the video demos that CNN detection is far more accurate. However it's main drawback is that it is currently very prone to false positives:

A few improvements are worth consideration. The first is creating a speed limit sign tracker using a Kalman Filter and the classifier as measurement input. Signs vetted by the CNN will be tracked for a interval of time. Each frame a vote will increment will associate with a detected sign speed. The speed that both receives the highest votes and passes a minimal vote threshold will be chosen as the current speed. Ideally this will cut down on random false positives and some false identifications. A drawback is that signs which are briefly on screen would no longer be detected.

Another improvement would be to simply feed the CNN less false positives. This can be accomplished by making the classifier more accurate as it's output is fed directly into the CNN. So the less false positives the classifier turns out, the less chances the CNN itself will output a false positive. The classifier could be trained in stages where the false positives of one stage are used as negatives for training in the next stage. With automated data mining it shouldn't be difficult to implement.

A “not sign” output could also be added to the CNN. I was skeptical of doing this as the set of “sign or not sign” encompass the entire universe, making “not sign” quite large. However, a similar principle to that which was proposed for improving the classifier, where false positives are feed back as negatives, may be usable for the CNN as well. Also literature using CNNs for traffic sign detection boast a 95-98% accuracy, so the CNN has plenty of room to be further refined through better training.

Finally there is still the issue of speed. The average frame rate of the CNN stage is 1000 fps, the classifier reaches 1500 fps, however FRST is only 46 fps:
Fig 7
Frame rate of the FRST
FRST will of course be the slowest as it must process a 320 x 330 pixel image, where as the classifier typically processes a 18 x 24 image and the CNN processes a 28 x 28 image. However even though 46 fps is real time speed, it's measured from the algorithm being run on an laptop with 6gb ram and an intel i7 CPU. When run on the raspberry pi 3 the fps on the same video is around 5fps: significantly slower. Speeding up FRST would then yield the biggest fps boost on the raspberry pi.

Before I propose a speed increase, let me first briefly explain how FRST works.
FRST converts an image to it's gradient magnitude and then zeros all pixels whose magnitudes fall below a set threshold. It then searches for the center of a rectangle of given size by voting around each pixel. Votes are carried out along lines in a "+" shape moving outward from each pixel. Votes move along each line up, down, left and right to a point and issues a positive or negative vote. Votes are finally tallied in a “voting matrix” in which the location receiving the highest vote is the most likely rectangle center.

Each pixel is scanned via a nested for loop, however the votes cast on one pixel will not affect the voting result on latter pixels. It might be possible to speed FRST up using threading or parallel processing. The image could be divided up into N x M subsections. Each subsection would carry their own voting matrix whose size is that of the original image. The regular voting routine would then be run on each subsection and then each voting matrix would simply be added together to create the final overall voting result. This method would produce the same voting matrix as a normal FRST, so no accuracy loss is incurred.

I'll update when there is more, until then thanks for reading:
- Joseph

References:

[1] Ishak, Khairul Anuar, et al. "A speed limit sign recognition system using artificial neural network." 2006 4th Student Conference on Research and Development. IEEE, 2006.

[2] Kundu, Subrata Kumar, and Patrick Mackens. "Speed Limit Sign Recognition Using MSER and Artificial Neural Networks." 2015 IEEE 18th International Conference on Intelligent Transportation Systems. IEEE, 2015.

[3] Sermanet, Pierre, and Yann LeCun. "Traffic sign recognition with multi-scale convolutional networks." Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011.

1 comment: