The fourth project in Metis required the use of Natural Language Processing, so naturally I decided to work with programming languages. I built a tool that recommends an existing Stack Overflow question, answer and a GitHub Repository when given a new Stack Overflow question.
In order to acquire the data I used the Stack Overflow API. I quickly gathered 100,000 Stack Overflow questions and 100,000 answers. Next I decided to use the ReadMe's from GitHub repositories to identify relevant repositories to suggest with Stack Overflow questions.
GitHub's API only allows 5000 requests per user per hour and the API timed out regularly. With a tight deadline, this greatly reduced the amount of ReadMe's that I was able to acquire in the time that I had. After 4 days of effort I ended up with 100,000 ReadMe's. Many of those were empty and there were a fair number of duplicates as well. After cleaning I had roughly 50,000 unique ReadMe's remaining. That's less than 1% of the total repositories existing on GitHub. My time was limited and in order to stay on track I needed to move forward, so I moved on to modeling with this limited data.
The first thing I did was use Count Vectorizer with LSI to see how easily I could build a model. It was immediately clear that there was a lot of html formatting code that should be removed in order to pick up the more relevant terms in the text. So I added lots of stop words to cut out html keywords in addition to the built-in sklearn English stop words. After experimenting with a number of parameters, I found that word n-grams of 1-3 captured a lot of the patterns of the code that would indicate a match for recommended content.
I combined the Stack Overflow questions, answers, and readmes into one dataset in which one document corresponded to either a question, an answer or a readme. I stemmed the documents using LancasterStemmer and then trained a tf-idf model on these documents. Next I created vectors using my tf-idf model and passed the vectors to LSI to transform them into the semantic space and create an index that could be referenced to find the closest match for an incoming question.
For the tool that I created, I built a Flask app that accepts a string, applies pre-processing and converts it to a vector using the same methods as I used to create my model. I then use this vector in the MatrixSimilarities function of gensim to recommend a similar question, answer and github repository.
The model works well with certain patterns. It identifies java, jQuery, python, SQL and others. It performs very well in certain highly repetitive code syntax. An example is shown here with a python question:
I'm trying to learn python and came across some code that is nice and short but doesn't totally make sense
the context was:
I get what it's doing, but why does python do this - ie return the value rather than True/False?
Is this legit/reliable style, or are there any gotchas on this?
Above shows the question being put into the model.
Below is the returned question, answer, and suggested GitHub repository from the model.
In python, if I have a function
g(x) is defined before and takes time to calculate.
some thing like this
When I am calculating f(x), will python calculate g(x) two times when substitute g(x) in to the equation? If it does calculate two times, how people usually handle it in python?
Yes, I'm sure you're talking about subroutine args! In Python, you don't receive arguments in a single list '@_' as in Perl. instead, the simplest way to do this is:
This is roughly equivalent to:
Now if you must receive your args as a list or as a tuple (both are collections), this should do:
You can replace `args' with anything you like, staying within Python variable naming rules.
I recommend you read up on the book linked to in the previous answer and especially on functions, as Python has a plethora of possibilities when it comes to functions - positional and keyword arguments and the likes..
Created based on the need for an easy to use and understand GA library.
There are a few simple steps to usage that is explained in the
As you can see from the example, the first three matched relatively well. They all had basic python functions and minor questions about the correct syntax or usage. The ReadMe is a less successful match and I will explain why here.
The problem with GitHub
GitHub ReadMe's are unreliable. They are often incomplete, there are a lot of junk repositories, and tons of clones. It would be a huge undertaking for someone working through the API to capture enough data about the repos to make the model work better. Still, I would love to include even just the ReadMe's from the top 100,000 repositories - the model would be greatly improved and return better content. I hope that this will be achieved in some manner in the future.
Where to go from here
GitHub and Stack Overflow host a wealth of technical talent. It seems natural that they should work together in some harmonious way. We have the ability to use a tool like this to suggest contributing developers from Stack Overflow to Github projects that are relevant. Similarly we can help developers on Stack Overflow uncover relevant projects to their efforts. Perhaps they will discover some new code that will help with their own project or possibly they might find a new project to contribute to.
From here, I hope to have more time to tune my model with more data, much improved GitHub ReadMe's, better n-grams through more detailed regular expressions matching, and explore other improvements too.