A video-based car detector for cyclists
I enjoy cycling quite a bit because of the fresh air, the speed, the physical exertion, and the magnificent views. I have been cycling throughout the San Francisco bay area for about 2 years now, increasing in my ability and frequency of riding. I commuted by bike twice or thrice per week while attending the Metis bootcamp throughout three months. From this I have learned that cycling is fun but also dangerous. There are plenty of routes that require cyclists to "share the road" with cars and as we know, some are better at sharing than others. Also cars have a significant physical advantage in this jockeying for space. When a car unexpectedly passes a cyclist it can be quite alarming and feel very unsafe. Also terrible things can sometimes happen, like this. So my idea was formed to create a tool that alerts a rider when a car is approaching from behind. For those that don't know, "Car Back!" is what one cyclist will shout to another to alert them of an approaching car from behind - so I took this name for my project.
My vision is to be able to attach a camera to the back of my bike, near the seat which captures video in real time and alerts of any cars that are approaching from behind. The alert would be an audio cue that is played in one of the apps that is already running -- Strava, Spotify, or Audible as examples.
A picture of me riding with a group on our way to Mt Zion, Utah
One of my goals was to figure out how to make cycling a part of my work, and I achieved it in this project. I strapped a GoPro to the back of my bike and set out for a number of routes to collect video data to train a model. I needed to be thorough in capturing a variety of weather conditions, lighting conditions, and traffic conditions. From these videos I extracted frames at 6 frames per second using ffmpeg and set about hand-labelling these frames for approaching cars. I drew rectangles around approaching and not-approaching cars and labelled them appropriately using a tool called RectLabel. This was certainly one of the most time-consuming parts of the project as I had hundreds of frames to draw bounding boxes and label. Luckily the fun of collecting the data through bike rides was not lost in this process.
Here the green rectangles represent the positive class and the purple represents the negative (not-approaching car) class.
Labelling images results in a collection of associated annotation files for each image. These annotations define the location and size of the bounding boxes as well as the associated class for each image. These annotation files will be used later for training the model.
For modeling I used a pre-trained object detector from which I could apply transfer learning. A long term goal is for this detector to run real time on a mobile device so I searched for a mobile-friendly model. Despite my best searching efforts (and I believe myself to be an expert googler), I did not find a Keras built object detector with bounding boxes that I could apply transfer learning on. Therefore I found the next best thing which was a MobileNet SSD model trained on the CoCo data set, found here. The MobileNet models are specifically built to run light and fast so that they can run fast on mobile devices. MobileNet is built in tensorflow which is a bit messier to deal with than Keras so I followed this tutorial for how to set up the model and apply transfer-learning. I repurposed this jupyter notebook to solve my specific use-case.
Because the model is in tensorflow, it required tfrecords which you can read about here. The process for creating tfrecords starts with the annotations created through the labeling process done previously. In order to generate the tfrecords I first created csv files from the annotations by using json_to_csv.py. Using split_labels.ipynb I generated test and train groups from the dataset that I created. With these groups I generated test and train tfrecords using generate_tfrecord.py. You can find all of these files in the github repo.
Once the tfrecords were created I was ready to apply transfer learning. First I downloaded the model - I chose ssd_mobilenet_v1_coco_11_06_2017.tar.gz. The steps to complete training are listed in the jupyter notebook ApproachingCars.ipynb on the repo. They involve setting up the environment correctly and then executing the following command:
The training converged after about 1.5-2 hours using tensorflow-gpu on on an NVIDIA GeForce GTX 1080 Ti GPU. One of the nice things about this training is that it outputs a checkpoint periodically that saves the model at that point in time. If you find one checkpoint that performs better than another you can choose which one works best.
Once training was complete it was time to test the model. I used the following command to export the inference graph based on the best checkpoint:
After the model is trained it is time to see it in action! This is the point at which I realized that I needed a variety of conditions for lighting, weather and traffic. So I had to repeat the data collection, labelling, and training several times with increasing numbers of images after gathering more data.
Some things I learned in this process:
In order to process videos I used MoviePy's VideoFileClip function to break the input video into frames and apply classification to each individual clip and reconstruct video at the end. An example output video is shown below:
You may be wondering how I got that nice audio beep into the video since I haven't described it. Well, I cheated with that one. I added it with iMovie after creating the video. I could easily use code to play a sound in real time whenever an approaching car is detected but constructing the video with audio synced is a much more complicated problem that, frankly, I didn't think made sense to tackle programmatically just for the purpose of a demo. So I fudged it but I think it was worthwhile for the effect that it produced and made the demo more effective.
In the end the model had a 97% recall which is excellent. Out of the 72 approaching cars in the test set it only mis-identified 2 as not approaching and these were special case vehicles (a trolly tour bus and a big rig). The model had a precision of 17% which feels low. But let me explain why this is okay. First - I'd rather have more false-positives that keep riders on their toes than missing some positives because of being conservative. Second, most of these false positives happened in highly dense streets or high trafficked situations. Cyclists are much more aware (with good reason) in these environments so I think sounding the alarm makes a lot of sense to keep them highly defensive in their riding. Overall the performance of the model is great and demonstrates the very real potential for this tool to be effective in improving the safety of cyclists.
Going forward I plan to add more data to the model from a variety of locales and in more varied conditions. If we had a number of cyclists contributing to this project by gathering video of their rides and submitting it for labeling, this could greatly improve the performance of the model. I would also like to prove out a full prototype by running it on a Raspberry Pi on my bike and see how it feels. Based on this blog I believe that the model should be able to classify on a Raspberry Pi at a rate of about 4 frames per second which could be fast enough to make my cycling more safe. And just imagine if GoPro and Strava or Spotify teamed up with me to create a real device that a cyclist can add to their bike and save some lives. Wouldn't that be grand? Yes, yes it would.
The fourth project in Metis required the use of Natural Language Processing, so naturally I decided to work with programming languages. I built a tool that recommends an existing Stack Overflow question, answer and a GitHub Repository when given a new Stack Overflow question.
In order to acquire the data I used the Stack Overflow API. I quickly gathered 100,000 Stack Overflow questions and 100,000 answers. Next I decided to use the ReadMe's from GitHub repositories to identify relevant repositories to suggest with Stack Overflow questions.
GitHub's API only allows 5000 requests per user per hour and the API timed out regularly. With a tight deadline, this greatly reduced the amount of ReadMe's that I was able to acquire in the time that I had. After 4 days of effort I ended up with 100,000 ReadMe's. Many of those were empty and there were a fair number of duplicates as well. After cleaning I had roughly 50,000 unique ReadMe's remaining. That's less than 1% of the total repositories existing on GitHub. My time was limited and in order to stay on track I needed to move forward, so I moved on to modeling with this limited data.
The first thing I did was use Count Vectorizer with LSI to see how easily I could build a model. It was immediately clear that there was a lot of html formatting code that should be removed in order to pick up the more relevant terms in the text. So I added lots of stop words to cut out html keywords in addition to the built-in sklearn English stop words. After experimenting with a number of parameters, I found that word n-grams of 1-3 captured a lot of the patterns of the code that would indicate a match for recommended content.
I combined the Stack Overflow questions, answers, and readmes into one dataset in which one document corresponded to either a question, an answer or a readme. I stemmed the documents using LancasterStemmer and then trained a tf-idf model on these documents. Next I created vectors using my tf-idf model and passed the vectors to LSI to transform them into the semantic space and create an index that could be referenced to find the closest match for an incoming question.
For the tool that I created, I built a Flask app that accepts a string, applies pre-processing and converts it to a vector using the same methods as I used to create my model. I then use this vector in the MatrixSimilarities function of gensim to recommend a similar question, answer and github repository.
The model works well with certain patterns. It identifies java, jQuery, python, SQL and others. It performs very well in certain highly repetitive code syntax. An example is shown here with a python question:
I'm trying to learn python and came across some code that is nice and short but doesn't totally make sense
the context was:
I get what it's doing, but why does python do this - ie return the value rather than True/False?
Is this legit/reliable style, or are there any gotchas on this?
Above shows the question being put into the model.
Below is the returned question, answer, and suggested GitHub repository from the model.
In python, if I have a function
g(x) is defined before and takes time to calculate.
some thing like this
When I am calculating f(x), will python calculate g(x) two times when substitute g(x) in to the equation? If it does calculate two times, how people usually handle it in python?
Yes, I'm sure you're talking about subroutine args! In Python, you don't receive arguments in a single list '@_' as in Perl. instead, the simplest way to do this is:
This is roughly equivalent to:
Now if you must receive your args as a list or as a tuple (both are collections), this should do:
You can replace `args' with anything you like, staying within Python variable naming rules.
I recommend you read up on the book linked to in the previous answer and especially on functions, as Python has a plethora of possibilities when it comes to functions - positional and keyword arguments and the likes..
Created based on the need for an easy to use and understand GA library.
There are a few simple steps to usage that is explained in the
As you can see from the example, the first three matched relatively well. They all had basic python functions and minor questions about the correct syntax or usage. The ReadMe is a less successful match and I will explain why here.
The problem with GitHub
GitHub ReadMe's are unreliable. They are often incomplete, there are a lot of junk repositories, and tons of clones. It would be a huge undertaking for someone working through the API to capture enough data about the repos to make the model work better. Still, I would love to include even just the ReadMe's from the top 100,000 repositories - the model would be greatly improved and return better content. I hope that this will be achieved in some manner in the future.
Where to go from here
GitHub and Stack Overflow host a wealth of technical talent. It seems natural that they should work together in some harmonious way. We have the ability to use a tool like this to suggest contributing developers from Stack Overflow to Github projects that are relevant. Similarly we can help developers on Stack Overflow uncover relevant projects to their efforts. Perhaps they will discover some new code that will help with their own project or possibly they might find a new project to contribute to.
From here, I hope to have more time to tune my model with more data, much improved GitHub ReadMe's, better n-grams through more detailed regular expressions matching, and explore other improvements too.
The assignment for this project was to use machine learning for classification. I found The National Survey on Drug Use and Health (NSDUH) which is sponsored by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency within the Department of Health and Human Services. The survey queries a random sample of the population ages 12 and older across the 50 states to answer a series of questions about drug use and health as well as some general demographics. They allow anonymity by offering private computer use in answering sensitive drug-use questions in order to promote honesty. The questionnaire takes about an hour to complete and the respondent receives $30 once it is done.
During initial data exploration I immediately became interested in numbers representing the age of first use of a variety of drugs. My intuition led me to wonder if children tried cigarettes, alcohol, or any drugs at an early age would they be more likely to be dependent or heavy users later in life? As I explored the data, I found that cigarettes were not very largely impactful but marijuana and alcohol were much more so.
Since the survey provides data a simple flag for drug dependence in the dataset, I was able to look at the normalized distribution of age of first use for both marijuana and alcohol in two groups - drug addicts and non-addicts. I subtracted the two distributions and the result is shown in the chart below:
Drug experimentation in youth
The chart shows the difference in normalized distribution of age that addicts first used marijuana or alcohol from that of non-addicts
There is a clear change at around age 14 between the addict and non-addict group. The mean age of first marijuana use of the addict group was 14.65 years old, whereas for the non-addict group it was 17.24. The mean age of first alcohol use for the addict group was about 14.29 years old, and 15.60 for the non-addict group. In both cases the mean was shifted by at least a year and much more dramatically (almost 3 years) in the case of marijuana. This shows just how important it is to keep younger children away from these substances until they are in their later teens or even twenties.
The survey results contained 3157 columns to sift through. This was a huge undertaking as it was a lot of information to take in. To keep it simple I started by taking the data for age of first use of a variety of drugs from the survey. However these weren't enough to produce a very predictive model. I added some additional flags of my own including a flag to indicate if the user started using in their early teens. I hoped that this would help emphasize the relationship I discovered above. I also pulled in general drug usage flags - simple yes/no questions about having used a large list of drugs from marijuana, to pain killers, to crack - these were represented numerically. Next I included general demographics like age, sex, household income, etc.
My next challenge was imbalanced data. Only 2.5% of the data represented the 'addicted' group. Predicting on the full dataset produced high accuracy by never predicting the minority class, so I needed to do something to boost my under-represented data. I tried out a few tools for this purpose:
SMOTE (Synthetic Minority Over-sampling TEchnique) is a tool that generates fake data in a black-box manor within your dataset based on the existing data. Using this produced fabulous results on my training set. My ROC curve was nearly a triangle. First came excitement, and next came suspicion. I resampled my data with a different random seed and had much poorer results. The original model was overfitting. SMOTE would not work for my purposes.
RUS (Random Under Sampler) is a tool that removes a defined amount of the over-sampled data to even out the sample size (or produce the ratio that you request). Using this did not improve my model much and was less exciting than the SMOTE failure.
In the end the best results came from Logistic Regression using C=10 and passing the 'balanced' flag to the 'class_weight' parameter.
In my model the top three indicators of drug addiction were if the respondent had ever tried any illicit drugs, followed by specifically using marijuana or psycho-therapeutics ever. My model provided a Recall of .89 which I am happy with. Precision was only .11, which I find acceptable. On the surface this Precision number might seem small but it is due to a high number of false positives. The false positives are shown in the chart below:
False Positives. This chart shows the distribution of predicted probability (above 50%) of my model that a respondent will be dependent on illicit drugs.
I consider this graph to represent the high risk individuals. That is, survey respondents who have indicated through their answers that they have high risk behaviors which may indicate an undiagnosed drug addiction or who may be on a path toward drug addiction. As you can see, there are a large number of Young Adults predicted to be in this group. This makes intuitive sense since in American culture Young Adults (defined here as 19-25 years old) tend to drink heavily and may also dabble in drug use. My model would pick up on these types of behavior as risks. The Adult group (26 and older) is a smaller number of people and could indicate some people have been not entirely truthful on the survey or have not crossed over to full addiction yet.
Save our children
The most important group are the light blue bars in the graph above. These indicate the youths -- children aged 12-18 who are participating in risky behaviors that could lead to drug addiction later in life. There is still time to educate, engage, or otherwise intervene with these groups to redirect their path away from drug use and toward more meaningful ways of life.
I built a very preliminary and basic regression model to look at the youth group respondents in the data. I was curious what sort of factors impact a youth's ability to avoid the pitfalls of drug-abuse. I found several things -- some of which I expected and some I did not. Some of the more impactful indicators were: experience selling drugs or not, having experience with guns, the respondents sentiment toward other people using drugs, the level of interest from parents into the child's schoolwork, and also the level of involvement of the child in after school activities. The bottom line is that kids - even older teenagers - need adults to be involved and take an interest in their lives. Perhaps this is not surprising but it is good to be reminded of where we can make an impact.
This week's project was focused on web scraping, cleaning data, feature engineering, and understanding linear regression. I decided to scrape a housing rental website in order to gather data about rent, amenities, and other features in order to predict rental prices for various locations around the bay area.
I accumulated around 3000 data points from Oakland, Marin, San Francisco and the South Bay and dug in. There was a large amount of information ranging from your basics - beds, baths, square footage - to balconies, parking, heating, and even ceiling fans. BeautifulSoup made it very easy to extract the data I needed from all of the web pages. I stored away the numerical data - bedrooms, bathrooms, sqft, and rent. But the rest of the data was tricky - how to handle whether pets are allowed, or what sort of parking is available was difficult to sort out. In the end I turned all of the non-numerical data into dummy variables leaving only address behind for further tinkering.
Most people have an instinct that location impacts rental pricing, so I wanted to treat that data carefully to make sure I was picking up any signal it provided. Out of curiosity I tried fitting with no location data using a variety of regression models, but my numbers weren't great and I knew I could do better. I tried using zip code for modeling and found that I got about .72 R-squared as a best result using RandomForest. This was about 10% improvement over models without location but I suspected I could improve it further with some feature engineering.
I explored the idea of creating a distance relative to city center or public transportation but didn't find a standard that would work well. As I was thinking through the problem, I took time to geocode the locations in order to obtain the latitudes and longitudes of each address using GeoPandas. In the end I decided to assign neighborhoods to each location based on their geo-location. In some cases neighborhood became city (for example, for cities in Marin). But in other cases I assigned more granular values for neighborhood names - SOMA, Nob Hill, etc. Then I calculated average rent per neighborhood and sorted those values in order to define a rank for locations.
heat maps of neighborhood rent (left) and rank (right) demonstrates how transforming the data picks up more of the signal.
The next step was modeling. I tried a variety of models to see if there were any unexpected fits but as I had expected, the best fits came from tree-based models. Both Random Forest and Gradient Boost models fit at an R-squared of .77. I settled on Gradient Boost because the features that stood out for Random Forest were suspect. Random Forest put ceiling fan in the top four influencers which I found highly dubious. Gradient Boost results were much more explainable, ranking feature influencers as Sqft, Rank, Beds, Baths followed by the lesser features with heavy importance - High Speed Internet, Washer/Dryer, Pets Allowed, and Parking.
My model could improve with more data and more varied data. Having current active rental information combined with a variety of sources - including some websites that are free to list would strengthen the predictive power of the model.
This model can provide renters with the ability to determine how much rent they can increase if they add a feature like a washer and dryer or if they allow pets. They can also evaluate if the cost of internet would be offset by the increase in rent they could charge for the value that it adds. I am sure a lot of people can find value in that.
Well, I've done it. I've started a 12 week Data Science bootcamp at Metis. The most difficult transition in this first week has been the requirement of unmasking my weaknesses to the rest of the class and instructors. Prior to this week I had gotten quite comfortable sitting in my home office, banging my head against the wall and aggressively asking Google for answers. Google doesn't judge when I can't remember the correct syntax for selecting a column in pandas. But sitting in class, solving problems with others exposes exactly what I don't know to a group of people that I don't know, about a subject that I don't know enough about. So, at least we got that out of the way quickly.
How I got here
Many years ago I worked for a gaming company and was eventually in charge of the internal game production teams - 3 teams in total which each consisted of designers, developers, QA, and producers. The leadership of the company was looking for ways to expand the business and decided to investigate micro-transactions. At the time this was cutting edge - users were just starting to drop small amounts of money ($1 or less) on dressing up their avatar or gaining special tools for a game. Some companies were making loads of money this way and so company leadership thought this a great avenue to explore.
Since my role was to oversee how internal teams focused time and resources, I was tasked with making a business model for this new direction. I was invited into the office of the CEO and given direction for how the business model should look - and what sort of potential a successful game could provide our organization. So I went home and bought a book about writing a business plan. I researched various success scenarios and made up some potential revenue models. I went back and showed it to the CEO. He didn't think the model showed enough of the potential upside. My opinion was that the upside was extremely unlikely and the downside was incredibly costly. I decided to show three models in the end - great success, moderate success, and failure. The failure case was basically catastrophic for the company - and I had the sense that it was the most likely. This particular project had a longer than average timeline, larger scope, much higher production costs. Failure was likely because developing a new "hit" game is incredibly difficult and our company had failed many times at predicting a game would be a hit that ended up failing. Creative projects are fickle and subjective; appealing to a massive audience takes iterations, time, and investment. Much of which our fast-paced work environment didn't allow for.
I left the company shortly after my business plan experience. I felt bullied and cornered into producing a document that would support a direction that I didn't agree with. There were other factors that contributed to me leaving but this was certainly a big part of it. And it stuck with me. For what it's worth they didn't pursue the micro-transactions for this particular project and scaled down the scope quite a bit. I'm not sure how they made that decision. But I've always wanted to redo that business plan with correct predictions, real market research, and have the gall to present it even though it wasn't what they wanted.
I realize this particular example is small relative to the expansive reach of data science overall but it was a turning point for me. I don't want to spend my time scratching the surface, I want to get my hands dirty, to dig deep. I want to find the best truth that I can with the tools and information available.
The next several years after leaving my job were a combination of mothering and exploring various career paths. I went back to work a couple of times in a part-time capacity and found it dissatisfying. In the end I found data science a good match for me because I've always wanted to dive deeper into coding, I've missed math, and love giving a clean answer to a messy question. I liked Metis because I wanted quick progress and support with job placement.
Pursuit of creativity
A major goal that I have for my time at Metis is to improve on my ability to ask the right questions, to communicate more visually and to reach into the unknown. I often claim that I am not creative. I don't think this is entirely true however. I am simply not in touch with my creativity and I squash creative expression before it makes it out of my subconscious and onto the floor. I believe this is self preservation. I don't like failing. It is uncomfortable, vulnerable and overall a feeling that should be avoided. The problem is that failure is a necessity if I want to grow and I want to grow. So, for week 2 at Metis I make a promise to try something risky, to allow myself to fail and to ask for help.
I'll let you know how that goes.
Only 58 days left - not that I'm counting.