The assignment for this project was to use machine learning for classification. I found The National Survey on Drug Use and Health (NSDUH) which is sponsored by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency within the Department of Health and Human Services. The survey queries a random sample of the population ages 12 and older across the 50 states to answer a series of questions about drug use and health as well as some general demographics. They allow anonymity by offering private computer use in answering sensitive drug-use questions in order to promote honesty. The questionnaire takes about an hour to complete and the respondent receives $30 once it is done.
During initial data exploration I immediately became interested in numbers representing the age of first use of a variety of drugs. My intuition led me to wonder if children tried cigarettes, alcohol, or any drugs at an early age would they be more likely to be dependent or heavy users later in life? As I explored the data, I found that cigarettes were not very largely impactful but marijuana and alcohol were much more so.
Since the survey provides data a simple flag for drug dependence in the dataset, I was able to look at the normalized distribution of age of first use for both marijuana and alcohol in two groups - drug addicts and non-addicts. I subtracted the two distributions and the result is shown in the chart below:
Drug experimentation in youth
The chart shows the difference in normalized distribution of age that addicts first used marijuana or alcohol from that of non-addicts
There is a clear change at around age 14 between the addict and non-addict group. The mean age of first marijuana use of the addict group was 14.65 years old, whereas for the non-addict group it was 17.24. The mean age of first alcohol use for the addict group was about 14.29 years old, and 15.60 for the non-addict group. In both cases the mean was shifted by at least a year and much more dramatically (almost 3 years) in the case of marijuana. This shows just how important it is to keep younger children away from these substances until they are in their later teens or even twenties.
The survey results contained 3157 columns to sift through. This was a huge undertaking as it was a lot of information to take in. To keep it simple I started by taking the data for age of first use of a variety of drugs from the survey. However these weren't enough to produce a very predictive model. I added some additional flags of my own including a flag to indicate if the user started using in their early teens. I hoped that this would help emphasize the relationship I discovered above. I also pulled in general drug usage flags - simple yes/no questions about having used a large list of drugs from marijuana, to pain killers, to crack - these were represented numerically. Next I included general demographics like age, sex, household income, etc.
My next challenge was imbalanced data. Only 2.5% of the data represented the 'addicted' group. Predicting on the full dataset produced high accuracy by never predicting the minority class, so I needed to do something to boost my under-represented data. I tried out a few tools for this purpose:
SMOTE (Synthetic Minority Over-sampling TEchnique) is a tool that generates fake data in a black-box manor within your dataset based on the existing data. Using this produced fabulous results on my training set. My ROC curve was nearly a triangle. First came excitement, and next came suspicion. I resampled my data with a different random seed and had much poorer results. The original model was overfitting. SMOTE would not work for my purposes.
RUS (Random Under Sampler) is a tool that removes a defined amount of the over-sampled data to even out the sample size (or produce the ratio that you request). Using this did not improve my model much and was less exciting than the SMOTE failure.
In the end the best results came from Logistic Regression using C=10 and passing the 'balanced' flag to the 'class_weight' parameter.
In my model the top three indicators of drug addiction were if the respondent had ever tried any illicit drugs, followed by specifically using marijuana or psycho-therapeutics ever. My model provided a Recall of .89 which I am happy with. Precision was only .11, which I find acceptable. On the surface this Precision number might seem small but it is due to a high number of false positives. The false positives are shown in the chart below:
False Positives. This chart shows the distribution of predicted probability (above 50%) of my model that a respondent will be dependent on illicit drugs.
I consider this graph to represent the high risk individuals. That is, survey respondents who have indicated through their answers that they have high risk behaviors which may indicate an undiagnosed drug addiction or who may be on a path toward drug addiction. As you can see, there are a large number of Young Adults predicted to be in this group. This makes intuitive sense since in American culture Young Adults (defined here as 19-25 years old) tend to drink heavily and may also dabble in drug use. My model would pick up on these types of behavior as risks. The Adult group (26 and older) is a smaller number of people and could indicate some people have been not entirely truthful on the survey or have not crossed over to full addiction yet.
Save our children
The most important group are the light blue bars in the graph above. These indicate the youths -- children aged 12-18 who are participating in risky behaviors that could lead to drug addiction later in life. There is still time to educate, engage, or otherwise intervene with these groups to redirect their path away from drug use and toward more meaningful ways of life.
I built a very preliminary and basic regression model to look at the youth group respondents in the data. I was curious what sort of factors impact a youth's ability to avoid the pitfalls of drug-abuse. I found several things -- some of which I expected and some I did not. Some of the more impactful indicators were: experience selling drugs or not, having experience with guns, the respondents sentiment toward other people using drugs, the level of interest from parents into the child's schoolwork, and also the level of involvement of the child in after school activities. The bottom line is that kids - even older teenagers - need adults to be involved and take an interest in their lives. Perhaps this is not surprising but it is good to be reminded of where we can make an impact.
This week's project was focused on web scraping, cleaning data, feature engineering, and understanding linear regression. I decided to scrape a housing rental website in order to gather data about rent, amenities, and other features in order to predict rental prices for various locations around the bay area.
I accumulated around 3000 data points from Oakland, Marin, San Francisco and the South Bay and dug in. There was a large amount of information ranging from your basics - beds, baths, square footage - to balconies, parking, heating, and even ceiling fans. BeautifulSoup made it very easy to extract the data I needed from all of the web pages. I stored away the numerical data - bedrooms, bathrooms, sqft, and rent. But the rest of the data was tricky - how to handle whether pets are allowed, or what sort of parking is available was difficult to sort out. In the end I turned all of the non-numerical data into dummy variables leaving only address behind for further tinkering.
Most people have an instinct that location impacts rental pricing, so I wanted to treat that data carefully to make sure I was picking up any signal it provided. Out of curiosity I tried fitting with no location data using a variety of regression models, but my numbers weren't great and I knew I could do better. I tried using zip code for modeling and found that I got about .72 R-squared as a best result using RandomForest. This was about 10% improvement over models without location but I suspected I could improve it further with some feature engineering.
I explored the idea of creating a distance relative to city center or public transportation but didn't find a standard that would work well. As I was thinking through the problem, I took time to geocode the locations in order to obtain the latitudes and longitudes of each address using GeoPandas. In the end I decided to assign neighborhoods to each location based on their geo-location. In some cases neighborhood became city (for example, for cities in Marin). But in other cases I assigned more granular values for neighborhood names - SOMA, Nob Hill, etc. Then I calculated average rent per neighborhood and sorted those values in order to define a rank for locations.
heat maps of neighborhood rent (left) and rank (right) demonstrates how transforming the data picks up more of the signal.
The next step was modeling. I tried a variety of models to see if there were any unexpected fits but as I had expected, the best fits came from tree-based models. Both Random Forest and Gradient Boost models fit at an R-squared of .77. I settled on Gradient Boost because the features that stood out for Random Forest were suspect. Random Forest put ceiling fan in the top four influencers which I found highly dubious. Gradient Boost results were much more explainable, ranking feature influencers as Sqft, Rank, Beds, Baths followed by the lesser features with heavy importance - High Speed Internet, Washer/Dryer, Pets Allowed, and Parking.
My model could improve with more data and more varied data. Having current active rental information combined with a variety of sources - including some websites that are free to list would strengthen the predictive power of the model.
This model can provide renters with the ability to determine how much rent they can increase if they add a feature like a washer and dryer or if they allow pets. They can also evaluate if the cost of internet would be offset by the increase in rent they could charge for the value that it adds. I am sure a lot of people can find value in that.