The assignment for this project was to use machine learning for classification. I found The National Survey on Drug Use and Health (NSDUH) which is sponsored by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency within the Department of Health and Human Services. The survey queries a random sample of the population ages 12 and older across the 50 states to answer a series of questions about drug use and health as well as some general demographics. They allow anonymity by offering private computer use in answering sensitive drug-use questions in order to promote honesty. The questionnaire takes about an hour to complete and the respondent receives $30 once it is done.
During initial data exploration I immediately became interested in numbers representing the age of first use of a variety of drugs. My intuition led me to wonder if children tried cigarettes, alcohol, or any drugs at an early age would they be more likely to be dependent or heavy users later in life? As I explored the data, I found that cigarettes were not very largely impactful but marijuana and alcohol were much more so.
Since the survey provides data a simple flag for drug dependence in the dataset, I was able to look at the normalized distribution of age of first use for both marijuana and alcohol in two groups - drug addicts and non-addicts. I subtracted the two distributions and the result is shown in the chart below:
Drug experimentation in youth
The chart shows the difference in normalized distribution of age that addicts first used marijuana or alcohol from that of non-addicts
There is a clear change at around age 14 between the addict and non-addict group. The mean age of first marijuana use of the addict group was 14.65 years old, whereas for the non-addict group it was 17.24. The mean age of first alcohol use for the addict group was about 14.29 years old, and 15.60 for the non-addict group. In both cases the mean was shifted by at least a year and much more dramatically (almost 3 years) in the case of marijuana. This shows just how important it is to keep younger children away from these substances until they are in their later teens or even twenties.
The survey results contained 3157 columns to sift through. This was a huge undertaking as it was a lot of information to take in. To keep it simple I started by taking the data for age of first use of a variety of drugs from the survey. However these weren't enough to produce a very predictive model. I added some additional flags of my own including a flag to indicate if the user started using in their early teens. I hoped that this would help emphasize the relationship I discovered above. I also pulled in general drug usage flags - simple yes/no questions about having used a large list of drugs from marijuana, to pain killers, to crack - these were represented numerically. Next I included general demographics like age, sex, household income, etc.
My next challenge was imbalanced data. Only 2.5% of the data represented the 'addicted' group. Predicting on the full dataset produced high accuracy by never predicting the minority class, so I needed to do something to boost my under-represented data. I tried out a few tools for this purpose:
SMOTE (Synthetic Minority Over-sampling TEchnique) is a tool that generates fake data in a black-box manor within your dataset based on the existing data. Using this produced fabulous results on my training set. My ROC curve was nearly a triangle. First came excitement, and next came suspicion. I resampled my data with a different random seed and had much poorer results. The original model was overfitting. SMOTE would not work for my purposes.
RUS (Random Under Sampler) is a tool that removes a defined amount of the over-sampled data to even out the sample size (or produce the ratio that you request). Using this did not improve my model much and was less exciting than the SMOTE failure.
In the end the best results came from Logistic Regression using C=10 and passing the 'balanced' flag to the 'class_weight' parameter.
In my model the top three indicators of drug addiction were if the respondent had ever tried any illicit drugs, followed by specifically using marijuana or psycho-therapeutics ever. My model provided a Recall of .89 which I am happy with. Precision was only .11, which I find acceptable. On the surface this Precision number might seem small but it is due to a high number of false positives. The false positives are shown in the chart below:
False Positives. This chart shows the distribution of predicted probability (above 50%) of my model that a respondent will be dependent on illicit drugs.
I consider this graph to represent the high risk individuals. That is, survey respondents who have indicated through their answers that they have high risk behaviors which may indicate an undiagnosed drug addiction or who may be on a path toward drug addiction. As you can see, there are a large number of Young Adults predicted to be in this group. This makes intuitive sense since in American culture Young Adults (defined here as 19-25 years old) tend to drink heavily and may also dabble in drug use. My model would pick up on these types of behavior as risks. The Adult group (26 and older) is a smaller number of people and could indicate some people have been not entirely truthful on the survey or have not crossed over to full addiction yet.
Save our children
The most important group are the light blue bars in the graph above. These indicate the youths -- children aged 12-18 who are participating in risky behaviors that could lead to drug addiction later in life. There is still time to educate, engage, or otherwise intervene with these groups to redirect their path away from drug use and toward more meaningful ways of life.
I built a very preliminary and basic regression model to look at the youth group respondents in the data. I was curious what sort of factors impact a youth's ability to avoid the pitfalls of drug-abuse. I found several things -- some of which I expected and some I did not. Some of the more impactful indicators were: experience selling drugs or not, having experience with guns, the respondents sentiment toward other people using drugs, the level of interest from parents into the child's schoolwork, and also the level of involvement of the child in after school activities. The bottom line is that kids - even older teenagers - need adults to be involved and take an interest in their lives. Perhaps this is not surprising but it is good to be reminded of where we can make an impact.