This week's project was focused on web scraping, cleaning data, feature engineering, and understanding linear regression. I decided to scrape a housing rental website in order to gather data about rent, amenities, and other features in order to predict rental prices for various locations around the bay area.
I accumulated around 3000 data points from Oakland, Marin, San Francisco and the South Bay and dug in. There was a large amount of information ranging from your basics - beds, baths, square footage - to balconies, parking, heating, and even ceiling fans. BeautifulSoup made it very easy to extract the data I needed from all of the web pages. I stored away the numerical data - bedrooms, bathrooms, sqft, and rent. But the rest of the data was tricky - how to handle whether pets are allowed, or what sort of parking is available was difficult to sort out. In the end I turned all of the non-numerical data into dummy variables leaving only address behind for further tinkering.
Most people have an instinct that location impacts rental pricing, so I wanted to treat that data carefully to make sure I was picking up any signal it provided. Out of curiosity I tried fitting with no location data using a variety of regression models, but my numbers weren't great and I knew I could do better. I tried using zip code for modeling and found that I got about .72 R-squared as a best result using RandomForest. This was about 10% improvement over models without location but I suspected I could improve it further with some feature engineering.
I explored the idea of creating a distance relative to city center or public transportation but didn't find a standard that would work well. As I was thinking through the problem, I took time to geocode the locations in order to obtain the latitudes and longitudes of each address using GeoPandas. In the end I decided to assign neighborhoods to each location based on their geo-location. In some cases neighborhood became city (for example, for cities in Marin). But in other cases I assigned more granular values for neighborhood names - SOMA, Nob Hill, etc. Then I calculated average rent per neighborhood and sorted those values in order to define a rank for locations.
heat maps of neighborhood rent (left) and rank (right) demonstrates how transforming the data picks up more of the signal.
The next step was modeling. I tried a variety of models to see if there were any unexpected fits but as I had expected, the best fits came from tree-based models. Both Random Forest and Gradient Boost models fit at an R-squared of .77. I settled on Gradient Boost because the features that stood out for Random Forest were suspect. Random Forest put ceiling fan in the top four influencers which I found highly dubious. Gradient Boost results were much more explainable, ranking feature influencers as Sqft, Rank, Beds, Baths followed by the lesser features with heavy importance - High Speed Internet, Washer/Dryer, Pets Allowed, and Parking.
My model could improve with more data and more varied data. Having current active rental information combined with a variety of sources - including some websites that are free to list would strengthen the predictive power of the model.
This model can provide renters with the ability to determine how much rent they can increase if they add a feature like a washer and dryer or if they allow pets. They can also evaluate if the cost of internet would be offset by the increase in rent they could charge for the value that it adds. I am sure a lot of people can find value in that.