In my introduction post I mentioned that this blog will be focused on the process of designing a UAV. The majority of content on here will stick with that, but I thought it might be good to occasionally share some other projects I'm interested in. So in this post I'll introduce a speed sign detection algorithm using OpenCV on the Raspberry Pi 3. It will contain the current state of things and future plans. I'll upload the source code to a repository after I flatten out some bugs.
One of my goals was not to create any training sets for a computer to learn what a speed sign is. Which means I tried to break down what a speed sign is into general traits and exploit those traits. The traits I choose to attack are: shape, text and layout.
The main challenge is that the OCR is very sensitive to noise surrounding legible text. So we need a good image of ONLY the sign's text to feed to it. However speed signs are in
an unstructured environment, so each step leading up to the OCR seeks to filter through that environment and provide a clean image of text to the OCR.
One of my goals was not to create any training sets for a computer to learn what a speed sign is. Which means I tried to break down what a speed sign is into general traits and exploit those traits. The traits I choose to attack are: shape, text and layout.
- Shape - Always rectangular with a standard aspect ration of a height larger than it's width. This trait is also invariant of color.
- Text - A sign always contains "SPEED", "LIMIT" and a number divisible by 5 indicating the speed.
- Layout - The number contained in the sign is always larger than the other text and is at the bottom. In the US, a speed sign is always on the right hand side of the road. So I can preemptively reduce the search region by half.
The main algorithm searches for and exploits these features in four stages: Recognition, Classification, Detection and Tracking. Recognition scans an image searching for a rectangle of the appropriate size and aspect ratio to that of a sign. Once found, this candidate rectangle is moved to the classification stage where it is scanned for man made blocks of text. The found blocks of text are cropped out and sent to an Optical Character Recognition (OCR) to be turned into strings. If those strings match that of a sign, then the candidate rectangle is classified as a sign. Finally the speed itself is detected using facts about the signs layout. Afterwards the sign is cropped from the image to be used as an object to track by openTLD in subsequent images.
The main challenge is that the OCR is very sensitive to noise surrounding legible text. So we need a good image of ONLY the sign's text to feed to it. However speed signs are in
an unstructured environment, so each step leading up to the OCR seeks to filter through that environment and provide a clean image of text to the OCR.
Now I'll go into more detail by tracing images through the steps. First, let's examine this image:
Raw Image |
Recognition:
In this phase the image is scanned for rectangles characteristic to that of a speed signs. This is the first step in zeroing in on the sign's text. It significantly reduces the area to scan for man made text by reducing the search size from 640x480 to ~50x60 pixels. This will be important later to reduce calls to the expensive OCR. Note that this process will always find a candidate region, even if a sign is not present.
A Fast Radial Symmetric Transform (FRST) is used here to find rectangles of similar size and aspect ratio to that of a speed sign. The FRST used in the code was adapted from:
A Fast Radial Symmetric Transform (FRST) is used here to find rectangles of similar size and aspect ratio to that of a speed sign. The FRST used in the code was adapted from:
“Traffic signdetection using computer vision” by Andreas Mogelmose.
The author gives his
source code here (full download option).
I also found the following to be a good
resource for learning about the FRST:
The only alterations
I made to Andreas' implementation was prepping it for real time use. This had me removing some redundant or unnecessary calls and optimizing where I
could.
The FRST will give
the center and size of the best voted rectangle within the image. This allows
the recognition be cropped from the image for further analysis. Although, at
this point it is still unknown if this region is an actual sign.
FRST result. Area inside green rectangle is suspected sign. |
Classification:
This phase classifies regions
as man made text to be used later by the OCR. It again reduces the area of the image that the OCR needs to scan by boxing in likely man made text.
The first step of this phase works by reducing noise on the
image through a process known as morphological gradients:
Cropped FRST region |
Morphological gradient |
The resulting white pixels are then binarized and expanded horizontally to connect
neighboring pixels with one another
Binarized pixels expanded horizontally |
The image is then broken down into components. To illustrate what I mean by components, imagine coloring a collection of shapes with a pencil. Each region that can be colored in without lifting the pencil from the page are considered a component. Below shows an example containing 5 components, each represented by a different color:
component region example |
Idealized component regions |
Now if we drew minimal bounding boxes that capture each pixel of every color we'd get something like:
In in the real world, images of signs have a bit of noise that shows up when this is done. The program's output, for instance, has some unwanted boxes drawn on it:
minimal bounding boxes enclosing components |
In in the real world, images of signs have a bit of noise that shows up when this is done. The program's output, for instance, has some unwanted boxes drawn on it:
Boxed in component (noisy) |
This noise can be reduced if bounding boxes below a certain threshold of height, width and area
are discarded. The results should adequately capture all main text inside the sign, as seen by the final output of this phase:
Boxed in components after thresholding (denoised) |
Detection:
In this phase
legible text is extracted from the classification step and
structured in a way that will finally detect
a sign and its speed. Remember that a clean image of the text is needed for the OCR to work well. So after enclosing the components at the end of classification, each region is cropped and converted to a black and white image. We then should get a clear image containing only the sign's text, like the following:
Text ready for OCR |
Each component is passed
to the OCR which returns a string. In the example case the OCR's output is "SPEED", "LIMIT" and "65." Meaning the text was perfectly recognized. These results are not typical though. For example the image:
Gives the black and white regions:
However the OCR's output is "LIMIY" and "70." Situations like these are handled using the Levenberg Edit Distance Algorithm. The algorithm gives a metric with which one can measure the "distance" between two words. Distance is determined by how many characters need to be added, removed or swapped from a word to match the other. For instance "at" and "cat" have a distance of 1 since "c" needs to be added to "at." What makes the edit distance useful for this project is I can assign a likeness score ranging from 0 to 1 to characters that appear similar (0 being the same character, 1 being unambiguously distinct).
In the case in question "LIMIY" and "LIMIT" have a edit distance of less than 1 since "Y" and "T" look so similar. So the algorithm will treat "LIMIY" as "LIMIT." The likeness scores are stored in a symmetric matrix called the distance matrix.
In the case in question "LIMIY" and "LIMIT" have a edit distance of less than 1 since "Y" and "T" look so similar. So the algorithm will treat "LIMIY" as "LIMIT." The likeness scores are stored in a symmetric matrix called the distance matrix.
The distance matrix is given here. It comes from research done by studying what letters a sample of children most confused when learning the alphabet. The published paper documenting this study is:
“An analysis of critical features of letters, tested by a confusion matrix.”
If "SPEED" or "LIMIT" are found in the image, then it will search the remaining components for a number divisible by 5 that lies below "SPEED" or "LIMIT" in the image. If such a number is in the sign, then it's deemed a speed sign and given a value. A tracking flag is toggled and the last step is performed.
Tracking:
After a sign is correctly detected it will be tracked for the next few seconds. Tracking is mainly performed to allow quicker performance in case numerous frames are needed to be analyzed and to alert the program when the sign is off the screen, so that it can resume steps 1-3. In order to track the sign a good cropping of it needs to be known. In the first example image given above, there is a nice (green) bounding box containing primarily the sign. However, this isn't always the case. For example, consider the cropped region given by the FRST of an image containing a 70 mph sign taken from a video:
Although the region contains the sign, it might not track well in subsequent images. This is because the tracking could pick some other feature-rich object to focus on. To address this, we can exploit the standard layout of a sign's speed limit number to generalize a better cropping scheme. In this way the cropping will know to go out a certain number of pixels from the start of the number component based off its size:
The above image demonstrates how a better image can be cropped from the sign's number component. The rectangle starting at the green dot, point r, and it's width W and height H are known from the detection phase. From r, the algorithm goes to the left a*W pixels and then up b*H pixels to the point P. The point P defines the upper left corner of a new rectangle to be drawn around the sign. This new rectangles' width and height are also some proportion of W and H.
When done to the FRST cropped image we get:
Context cropping explanation |
The above image demonstrates how a better image can be cropped from the sign's number component. The rectangle starting at the green dot, point r, and it's width W and height H are known from the detection phase. From r, the algorithm goes to the left a*W pixels and then up b*H pixels to the point P. The point P defines the upper left corner of a new rectangle to be drawn around the sign. This new rectangles' width and height are also some proportion of W and H.
When done to the FRST cropped image we get:
Context based cropping object |
Lets compare the tracker's performance of the FRST cropping vs context based cropping on the same video segment:
FRST cropping results |
Context Based cropping results |
Notice the sequence on the left has the sign sliding off the screen. Where as the right sequence keeps it in frame as the car drives by.
One might ask why the number was used? Why not "SPEED" or "LIMIT?" The answer is that perhaps "SPEED" or "LIMIT" were detected but not both. So in the demonstration image, the green dot might be in one of three different locations. This would require different constants to move left and up for each case. The number was chosen as an anchor point because in the current program, it must be known to make it to the tracking phase. In future updates I will have different values for a and b in cases containing "SPEED" or "LIMIT."
The new cropped
image is then used as the object for openTLD to track. In future
updates I can exploit the position of “SPEED” or “LIMIT” to
quickly get more images of the sign to analyze in case the speed
number itself could not have been found (the more images, the
better the odds of detection). When the tracked
object is no longer on screen, it simply clears the tracking flag
allowing for the 3 other phases to again be run.
These 3-4 steps are
preformed on each frame captured from a video camera. The idea is to
mount the camera on the dash board and point it at the middle of the
road. GPS or OBD2 can also be used to alert the driver they're
speeding. However, that might require more accuracy to reduce
potential annoyances.
Moving Forward:
Moving Forward:
This project still needs a bit of work. It's gotten around 80% detection rate with still images, which can be increased after some
tweaking. For comparison, most published sign detection boasts a 90+%
rate.
The main area for
improvement is speed. On a Raspberry Pi 3 one can expect 5-10 fps depending
on things that the OCR gets hung on: billboards or other road signs. If it had high accuracy this wouldn't be too much of an issue, however, it is still prone to errors and false positives. A significant increase in
accuracy could be had, if there where more frames for it to examine
hypothesized sign regions. So increasing frames could be a venue to
increase accuracy. But how to make it run faster?
The FRST runs at
around 100hz, which is much faster than the camera itself so no
problems there. Detection is by far the slowest. It can tank at 2hz
or even less. This is due to the OCR running on dozens of regions each frame. The vast majority of components processed by the OCR are junk. So reserving the OCR for only the most promising regions would save a lot of time.
Computers can be trained to recognize objects in a
process known as cascade training. One can pass a sufficiently
large amount of positive and negative sample images into an
algorithm, have it build up a description of the desired object and
quickly identify it in future scenes. The problem: "sufficiently
large", is often thousands upon thousands of positive and negative
images. The prospect of searching google for 4-6 thousand images of
signs and preparing them by hand is too monotonous (and that's just
the positive images). On top of that, I was not able to find
publicly accessible databases containing stock speed sign images. This is why I tried to avoid machine training when I stated my goals at the start of this post.
I still have a decent speed sign detection program though. So why not mine video data from streaming sites like youtube?
I wrote a program to do just that. It takes a list of desired URLs,
downloads them one at a time, scans the entire video saving positive
and negative images, deletes the video and repeats the process until
completion. Even with a moderately successful speed sign detection
program one can find a huge amount of sign images by going through
hours of footage. Fortunately for me there are a number of youtube
channels that are dedicated to sharing footage of trucks driving across America.
For instance this channel:
has over 6000 videos
with durations ranging from 2 minutes to 8 hours, with the majority being in the 20 minute range. I just have to be sure that
the videos have a varied amount of signs and not just 60-70 mph signs
on highways. A cursory scan of the video will confirm this as well as
good driving conditions for better detection rates. Currently I qued
around 50 hours of footage to mine and it's expected to finish some
time next week. After that I'll preform the training and update how
it goes. I figured I'd share it because I thought it was an
interesting idea and perhaps someone down the line can find it
useful as well.
Thanks for reading,
- Joseph
No comments:
Post a Comment