Be Cunning
  • Home
  • Blog
  • Home
  • Blog

cunning blog

Stack Overflow Immediate Response Engine

11/13/2017

2 Comments

 
The fourth project in Metis required the use of Natural Language Processing, so naturally I decided to work with programming languages.  I built a tool that recommends an existing Stack Overflow question, answer and a GitHub Repository when given a new Stack Overflow question.

Data Acquisition

In order to acquire the data I used the Stack Overflow API.  I quickly gathered 100,000 Stack Overflow questions and 100,000 answers.  Next I decided to use the ReadMe's from GitHub repositories to identify relevant repositories to suggest with Stack Overflow questions.  

GitHub's API only allows 5000 requests per user per hour and the API timed out regularly.  With a tight deadline, this greatly reduced the amount of ReadMe's that I was able to acquire in the time that I had.  After 4 days of effort I ended up with 100,000 ReadMe's.  Many of those were empty and there were a fair number of duplicates as well.  After cleaning I had roughly 50,000 unique ReadMe's remaining.  That's less than 1% of the total repositories existing on GitHub.  My time was limited and in order to stay on track I needed to move forward, so I moved on to modeling with this limited data.

Modeling

The first thing I did was use Count Vectorizer with LSI to see how easily I could build a model. It was immediately clear that there was a lot of html formatting code that should be removed in order to pick up the more relevant terms in the text.  So I added lots of stop words to cut out html keywords in addition to the built-in sklearn English stop words.    After experimenting with a number of parameters, I found that word n-grams of 1-3 captured a lot of the patterns of the code that would indicate a match for recommended content.  

I combined the Stack Overflow questions, answers, and readmes into one dataset in which one document corresponded to either a question, an answer or a readme.  I stemmed the documents using LancasterStemmer and then trained a tf-idf model on these documents.  Next I created vectors using my tf-idf model and passed the vectors to LSI to transform them into the semantic space and create an index that could be referenced to find the closest match for an incoming question.  

For the tool that I created, I built a Flask app that accepts a string, applies pre-processing and converts it to a vector using the same methods as I used to create my model.  I then use this vector in the MatrixSimilarities function of gensim to recommend a similar question, answer and github repository.

Results

The model works well with certain patterns.  It identifies java, jQuery, python, SQL and others.  It performs very well in certain highly repetitive code syntax.  An example is shown here with a python question:

QUESTION:

I'm trying to learn python and came across some code that is nice and short but doesn't totally make sense

the context was:

def fn(*args):
    return len(args) and max(args)-min(args)

I get what it's doing, but why does python do this - ie return the value rather than True/False?

10 and 7-2

returns 5.

Is this legit/reliable style, or are there any gotchas on this?

Above shows the question being put into the model. 

Below is the returned question, answer, and suggested GitHub repository from the model.

RELATED QUESTION:


In python, if I have a function

f(x)=(g(x)+1)(g(x))

g(x) is defined before and takes time to calculate.

some thing like this

def g(x):
  return value
def f(x):
  return (g(x)+1)*(g(x))

When I am calculating f(x), will python calculate g(x) two times when substitute g(x) in to the equation? If it does calculate two times, how people usually handle it in python?

ANSWER:


Yes, I'm sure you're talking about subroutine args! In Python, you don't receive arguments in a single list '@_' as in Perl. instead, the simplest way to do this is:

def func(arg1, arg2):
   min = arg1
   max = arg2
   #do your stuff

This is roughly equivalent to:

sub func{
   my($min, $max) = @_;
   #your stuff
}

Now if you must receive your args as a list or as a tuple (both are collections), this should do:

def func(args):
    min, max = args[0], args[1]
    #your stuff

You can replace `args' with anything you like, staying within Python variable naming rules.

I recommend you read up on the book linked to in the previous answer and especially on functions, as Python has a plethora of possibilities when it comes to functions - positional and keyword arguments and the likes..

GITHUB REPO:


Github Repo

Genetix.py

Created based on the need for an easy to use and understand GA library.

Installation

$ pip install https://github.com/ryanleland/Genetix.py

Usage

There are a few simple steps to usage that is explained in the demo.py file in the root of the project. Please feel free to run it, and change values to see how it changes the outcome.

  1. Import and instantiate a Population class.

    python from genetix.population import Population population = Population()

  2. Set the population size and a blueprint for the chromosome.

    • Note that each item in the dictionary represents a named Gene, which can have any possibility based on a provided range, or list.

    python population.populate(10, { 0: range(0, 100), 1: range(0, 100), 2: range(0, 100), 3: range(0, 100), 4: range(0, 100), 5: range(0, 100), 6: range(0, 100), 7: range(0, 100), 8: range(0, 100), 9: range(0, 100) })

  3. Decorate a function to evaluate fitness on each cromosome. It simply has to return a numeric value that can be sorted on (higher is better).

    python @population.fitness def max(chromosome): # Return a sum of all the gene values. return sum([g.value for g in chromosome.genes])

  4. Evolve the population.

    python for g in population.evolve(100): print population.fittest()

As you can see from the example, the first three matched relatively well. They all had basic python functions and minor questions about the correct syntax or usage.  The ReadMe is a less successful match and I will explain why here.

The problem with GitHub

GitHub ReadMe's are unreliable.  They are often incomplete, there are a lot of junk repositories, and tons of clones.  It would be a huge undertaking for someone working through the API to capture enough data about the repos to make the model work better.  Still, I would love to include even just the ReadMe's from the top 100,000 repositories - the model would be greatly improved and return better content.  I hope that this will be achieved in some manner in the future.  

Where to go from here

GitHub and Stack Overflow host a wealth of technical talent.  It seems natural that they should work together in some harmonious way.  We have the ability to use a tool like this to suggest contributing developers from Stack Overflow to Github projects that are relevant.  Similarly we can help developers on Stack Overflow uncover relevant projects to their efforts.  Perhaps they will discover some new code that will help with their own project or possibly they might find a new project to contribute to.  

From here, I hope to have more time to tune my model with more data, much improved GitHub ReadMe's, better n-grams through more detailed regular expressions matching, and explore other improvements too.  
2 Comments
professional resume writing service reviews link
1/30/2020 04:01:12 am

It is important that we have an efficient engine for all of this. If you have no plans of overflowing the work force, then that is good. This blog is for everyone who has no idea how overworking can hinder their progress. It does sound good at first, but believe me, it will not age well. As long as you keep on doing ridiculous things, then nothing good will ever happen for you and your machinery, believe me in that.

Reply
Grand Prairie Professional Organizer link
7/28/2022 09:42:52 am

Interesting rread

Reply



Leave a Reply.

    Rebekah Cunningham

    In pursuit of a balanced life

    Archives

    November 2017
    October 2017
    September 2017

    Categories

    All

    RSS Feed

Proudly powered by Weebly