Sunday, July 24, 2016

Raspberry Pi Speed Sign Detector: Overview

In my introduction post I mentioned that this blog will be focused on the  process of designing a UAV. The majority of content on here will stick with that, but I thought it might be good to occasionally share some other projects I'm interested in. So in this post I'll introduce a speed sign detection algorithm using OpenCV on the Raspberry Pi 3. It will contain the current state of things and future plans. I'll upload the source code to a repository after I flatten out some bugs.





One of my goals was not to create any training sets for a computer to learn what a speed sign is. Which means I tried to break down what a speed sign is into general traits and exploit those traits. The traits I choose to attack are: shape, text and layout.


  • Shape - Always rectangular with a standard aspect ration of a height larger than it's width. This trait is also invariant of color.
  • Text - A sign always contains "SPEED", "LIMIT" and a number divisible by 5 indicating the speed.
  • Layout - The number contained in the sign is always larger than the other text and is at the bottom. In the US, a speed sign is always on the right hand side of the road. So I can preemptively reduce the search region by half.

The main algorithm searches for and exploits these features in four stages: Recognition, Classification, Detection and Tracking. Recognition scans an image searching for a rectangle of the appropriate size and aspect ratio to that of a sign. Once found, this candidate rectangle is moved to the classification stage where it is scanned for man made blocks of text. The found blocks of text are cropped out and sent to an Optical Character Recognition (OCR) to be turned into strings. If those strings match that of a sign, then the candidate rectangle is classified as a sign. Finally the speed itself is detected using facts about the signs layout. Afterwards the sign is cropped from the image to be used as an object to track by openTLD in subsequent images. 

The main challenge is that the OCR is very sensitive to noise surrounding legible text. So we need a good image of ONLY the sign's text to feed to it. However speed signs are in
an unstructured environment, so each step leading up to the OCR seeks to filter through that environment and provide a clean image of text to the OCR.

Now I'll go into more detail by tracing images through the steps. First, let's examine this image:

Raw Image


Recognition:
In this phase the image is scanned for rectangles characteristic to that of a speed signs. This is the first step in zeroing in on the sign's text. It significantly reduces the area to scan for man made text by reducing the search size from 640x480 to ~50x60 pixels. This will be important later to reduce calls to the expensive OCR. Note that this process will always find a candidate region, even if a sign is not present.

A Fast Radial Symmetric Transform (FRST) is used here to find rectangles of similar size and aspect ratio to that of a speed sign. The FRST used in the code was adapted from:
Traffic signdetection using computer vision” by Andreas Mogelmose.

The author gives his source code here (full download option).

I also found the following to be a good resource for learning about the FRST:

The only alterations I made to Andreas' implementation was prepping it for real time use. This had me removing some redundant or unnecessary calls and optimizing where I could.
The FRST will give the center and size of the best voted rectangle within the image. This allows the recognition be cropped from the image for further analysis. Although, at this point it is still unknown if this region is an actual sign.

FRST result. Area inside green rectangle is suspected sign.

Classification:
This phase classifies regions as man made text to be used later by the OCR. It again reduces the area of the image that the OCR needs to scan by boxing in likely man made text.
The first step of this phase works by reducing noise on the image through a process known as morphological gradients:

Cropped FRST region

Morphological gradient

The resulting white pixels are then binarized and expanded horizontally to connect neighboring pixels with one another

Binarized pixels expanded horizontally

The image is then broken down into components. To illustrate what I mean by components, imagine coloring a collection of shapes with a pencil. Each region that can be colored in without lifting the pencil from the page are considered a component. Below shows an example containing 5 components, each represented by a different color:

component region example
In this application we want to "color in" the text of "SPEED", "LIMIT" and the number so that each are considered it's own component. Ideally we want something like this (ignoring the border as a component):

Idealized component regions
One might wonder how each character got it's own color since a pencil would have to be lifted to color them all in. This is why the pixels in in the binary image were expanded horizontally. It was to traverse the blank spaces between each character since we're after the whole word. The amount of pixels to expand horizontally is determined experimentally.

Now if we drew minimal bounding boxes that capture each pixel of every color we'd get something like:

minimal bounding boxes enclosing components

In in the real world, images of signs have a bit of noise that shows up when this is done. The program's output, for instance, has some unwanted boxes drawn on it:


Boxed in component (noisy)

This noise can be reduced if bounding boxes below a certain threshold of height, width and area are discarded. The results should adequately capture all main text inside the sign, as seen by the final output of this phase:

Boxed in components after thresholding (denoised)

Detection:
In this phase legible text is extracted from the classification step and structured in a way that will finally detect a sign and its speed. Remember that a clean image of the text is needed for the OCR to work well. So after enclosing the components at the end of classification, each region is cropped and converted to a black and white image. We then should get a clear image containing only the sign's text, like the following:

Text ready for OCR
Note that "SPEED", "LIMIT" and "65" are all separate components, which are individually fed to the OCR. I just choose to illustrate them as they appear in the sign.

Each component is passed to the OCR which returns a string. In the example case the OCR's output is "SPEED", "LIMIT" and "65." Meaning the text was perfectly recognized. These results are not typical though. For example the image:



Gives the black and white regions:



However the OCR's output is "LIMIY" and "70."  Situations like these are handled using the Levenberg Edit Distance Algorithm. The algorithm gives a metric with which one can measure the "distance" between two words. Distance is determined by how many characters need to be added, removed or swapped from a word to match the other. For instance "at" and "cat" have a distance of 1 since "c" needs to be added to "at." What makes the edit distance useful for this project is I can assign a likeness score ranging from 0 to 1 to characters that appear similar (0 being the same character, 1 being unambiguously distinct)
In the case in question "LIMIY" and "LIMIT" have a edit distance of less than 1 since "Y" and "T" look so similar. So the algorithm will treat "LIMIY" as "LIMIT." The likeness scores are stored in a symmetric matrix called the distance matrix
The distance matrix is given here. It comes from research done by studying what letters a sample of children most confused when learning the alphabet. The published paper documenting this study is:
“An analysis of critical features of letters, tested by a confusion matrix.”

If "SPEED" or "LIMIT" are found in the image, then it will search the remaining components for a number divisible by 5 that lies below "SPEED" or "LIMIT" in the image. If such a number is in the sign, then it's deemed a speed sign and given a value. A tracking flag is toggled and the last step is performed.

Tracking:
After a sign is correctly detected it will be tracked for the next few seconds. Tracking is mainly performed to allow quicker performance in case numerous frames are needed to be analyzed and to alert the program when the sign is off the screen, so that it can resume steps 1-3. In order to track the sign a good cropping of it needs to be known. In the first example image given above, there is a nice (green) bounding box containing primarily the sign. However, this isn't always the case. For example, consider the cropped region given by the FRST of an image containing a 70 mph sign taken from a video:

FRST cropped object


Although the region contains the sign, it might not track well in subsequent images. This is because the tracking could pick some other feature-rich object to focus on. To address this, we can exploit the standard layout of a sign's speed limit number to generalize a better cropping scheme. In this way the cropping will know to go out a certain number of pixels from the start of the number component based off its size:


Context cropping explanation

The above image demonstrates how a better image can be cropped from the sign's number component. The rectangle starting at the green dot, point r, and it's width W and height H are known from the detection phase. From r, the algorithm goes to the left a*W pixels and then up b*H pixels to the point P. The point P defines the upper left corner of a new rectangle to be drawn around the sign. This new rectangles' width and height are also some proportion of W and H.

When done to the FRST cropped image we get:

Context based cropping object

Lets compare the tracker's performance of the FRST cropping vs context based cropping on the same video segment:

                             
                                      FRST cropping results
Context Based cropping results



















Notice the sequence on the left has the sign sliding off the screen. Where as the right sequence keeps it in frame as the car drives by.
One might ask why the number was used? Why not "SPEED" or "LIMIT?" The answer is that perhaps "SPEED" or "LIMIT" were detected but not both. So in the demonstration image, the green dot might be in one of three different locations. This would require different constants to move left and up for each case. The number was chosen as an anchor point because in the current program, it must be known to make it to the tracking phase. In future updates I will have different values for a and b in cases containing "SPEED" or "LIMIT."

The new cropped image is then used as the object for openTLD to track. In future updates I can exploit the position of “SPEED” or “LIMIT” to quickly get more images of the sign to analyze in case the speed number itself could not have been found (the more images, the better the odds of detection). When the tracked object is no longer on screen, it simply clears the tracking flag allowing for the 3 other phases to again be run.

These 3-4 steps are preformed on each frame captured from a video camera. The idea is to mount the camera on the dash board and point it at the middle of the road. GPS or OBD2 can also be used to alert the driver they're speeding. However, that might require more accuracy to reduce potential annoyances.

Moving Forward:
This project still needs a bit of work. It's gotten around 80% detection rate with still images, which can be increased after some tweaking. For comparison, most published sign detection boasts a 90+% rate.

The main area for improvement is speed. On a Raspberry Pi 3 one can expect 5-10 fps depending on things that the OCR gets hung on: billboards or other road signs. If it had high accuracy this wouldn't be too much of an issue, however, it is still prone to errors and false positives. A significant increase in accuracy could be had, if there where more frames for it to examine hypothesized sign regions. So increasing frames could be a venue to increase accuracy. But how to make it run faster?

The FRST runs at around 100hz, which is much faster than the camera itself so no problems there. Detection is by far the slowest. It can tank at 2hz or even less. This is due to the OCR running on dozens of regions each frame. The vast majority of components processed by the OCR are junk. So reserving the OCR for only the most promising regions would save a lot of time.

Computers can be trained to recognize objects in a process known as cascade training. One can pass a sufficiently large amount of positive and negative sample images into an algorithm, have it build up a description of the desired object and quickly identify it in future scenes. The problem: "sufficiently large",  is often thousands upon thousands of positive and negative images. The prospect of searching google for 4-6 thousand images of signs and preparing them by hand is too monotonous (and that's just the positive images). On top of that, I was not able to find publicly accessible databases containing stock speed sign images. This is why I tried to avoid machine training when I stated my goals at the start of this post. 

I still have a decent speed sign detection program though. So why not mine video data from streaming sites like youtube? I wrote a program to do just that. It takes a list of desired URLs, downloads them one at a time, scans the entire video saving positive and negative images, deletes the video and repeats the process until completion. Even with a moderately successful speed sign detection program one can find a huge amount of sign images by going through hours of footage. Fortunately for me there are a number of youtube channels that are dedicated to sharing footage of trucks driving across America. For instance this channel:


has over 6000 videos with durations ranging from 2 minutes to 8 hours, with the majority being in the 20 minute range. I just have to be sure that the videos have a varied amount of signs and not just 60-70 mph signs on highways. A cursory scan of the video will confirm this as well as good driving conditions for better detection rates. Currently I qued around 50 hours of footage to mine and it's expected to finish some time next week. After that I'll preform the training and update how it goes. I figured I'd share it because I thought it was an interesting idea and perhaps someone down the line can find it useful as well.

Thanks for reading,
- Joseph

No comments:

Post a Comment