Melbourne House Price Prediction using XGBoost

Jun 20, 2017

In the previous article we used XGBoost for classification problem on the titanic dataset, in this article we will do regression on Melbourne Housing Dataset using the Extreme Gradient Boosting from the Caret Package available in R. We will predict the prices of houses and will try to get the best accuracy by avoiding over fitting issues.

We have no information regarding the variables of the dataset but you can find out easily by searching on Google. Most of the variables are easy to understand.

Download and load Dataset

The first step is to download the dataset from Kaggle or Github. The housing dataset comes in variety of forms, you may find housing dataset having more than 70 variables, but this dataset is quite simple and analysis is performed for the sake of learning.

train <- read.csv('Aus_housing.csv', stringsAsFactors =FALSE)

Install and Load Libraries

We will need to install many libraries and then load them into R.

#install and load required packages
install.packages(c("gridExtra","ggplot2","e1071", "caret", "doSNOW", "ipred", "xgboost", "dplyr"))
library(e1071)
library(caret)
library(doSNOW)
library(ipred)
library(xgboost)
library(dplyr)
library(ggplot2)
library(gridExtra)

Clean Dataset

Before we analyze the data we need to clean the dataset. Cleaning process takes a lot of time, we have to remove duplicate records and missing value records. We can see by viewing the dataset that it contains many missing values, we can remove them.

The outcome variable Price contains lot of missing values we can remove them (We can also store them separately and predict once our model is ready). Similarly, we can see many attributes values missing in each rows, we can remove all of them with the following:

#removing missing records based on Price and Bedroom2
train <- train[!is.na(train$Price),]
train <- train[!is.na(train$Bedroom2),]

You will still see many missing values like BuildingArea but they are now in only few columns we can impute them later.

Also, BuildingArea and LandSize contain many zero values, these are actually missing values we can replace them with NA.

# replacing 0 with NA
train$BuildingArea[train$BuildingArea == 0] <- NA
train$Landsize[train$Landsize == 0] <- NA

Data Analysis

Now comes the data analysis step where we will explore data, also known as EDA (Exploratory Data Analysis). We need to analyze every variable individually and with the outcome variable but here I will analyze only important variables.

The first step we will plot a histogram of the outcome variable just to know how the data is distributed.

# For plotting continious individual plotsggplot(aes(Price),data=train)+geom_histogram()

Similarly, seeing distribution of YearBuilt. This graph seems to show that the majority of houses are from the the mid 20th century.

ggplot(aes(YearBuilt),data=train)+ geom_histogram()

Nowmultiple histograms to see distribution.

# Multiple Plots
a <- ggplot(aes(Rooms),data=train)+  geom_bar()
b <- ggplot(aes(Bedroom2),data=subset(train, Bedroom2 < 20))+
  geom_bar()+  scale_x_discrete(breaks=seq(0,8,2), limits= c(0:8))
c <- ggplot(aes(Bathroom),data=train)+  geom_bar()
d <- ggplot(aes(Car),data=train)+  geom_bar()
grid.arrange(a, b, c, d, ncol=2, nrow=2)

PlottingDate vs Price, the plot does not give enough information, the Price seems to be independent of the Date. It would have worked great if the rangewas atleast five years.

# Date plotggplot(aes(as.Date(Date,'%d/%m/%Y'),Price), data= train)+  geom_point()

Plotting multiple graphs for Landsize vs Price and BuildingArea vs Price with respect to Type and Method. It will tell how the Price ischanging with the change in other variables.

# Multi plot
w <- ggplot(aes(Landsize,Price), data= train)+
  geom_point(alpha=1/3, aes(color= Type))+
  coord_cartesian(xlim = 0:10000)
x <- ggplot(aes(Landsize,Price), data= train)+
  geom_point(alpha=1/3, aes(color= Method))+
  coord_cartesian(xlim = 0:10000)
y <- ggplot(aes(BuildingArea,Price), data= train)+
  geom_point(alpha=1/5, aes(color= Type))+
  coord_cartesian(xlim = 0:1000, ylim = 0:5e+06)
z <- ggplot(aes(BuildingArea,Price), data= train)+
  geom_point(alpha=1/5, aes(color= Method))+
  coord_cartesian(xlim = 0:1000, ylim = 0:5e+06)
grid.arrange(w, x, y, z, ncol=2, nrow=2)

If we look at the first column, the second plot (`BuildingArea vs Price vs Type`) is far more meaningful than the first one. The BuildingArea and Type both are showing how effective they could be in predicting the Price.
PlottingLattitude vs Longitude vs CouncilArea, to see how Council Areas are spread in the form of map.

ggplot(aes(Lattitude,Longitude), data= train)+  geom_point(alpha=1/5,aes(color= CouncilArea))

Plotting Lattitude vs Longitude vs Council Area

Similarly, you can create your own visualization and interpret them in a different way. These visualization will be helpful in feature selection and dimensionality reduction. I would recommend you to dive deep and do more EDA.

Feature Selection

Now its time to reduce the number of features, in other words reduce the dimensions. There are several irrelevant features that we can directly filter out and then use a feature subset selection approach to find one since the number of variables are not that high.

Variables like Suburb, Address and SellerG are categorical and have many categories, including this would not improve the model and further the model will take more time in training. Postcode is also not relevant because that is more like a categorical variable. (Visualize them for better understanding)

I did not get the purpose of Distance variable in the dataset, even their is no such meaningful relation between Distance and Price, so I neglect it out. (Dig Deep!)

Landsize and Date seem to be a common feature in predicting the prices, but in this case its not. Even after trying out with them, the model seems to start overfitting. (See the Visualization)

Now selecting the features, from the visualization presented above and some commonsense, we can come up with the following features:

# Feature selection
train <- subset(train, select = c(Rooms, Bedroom2, Bathroom, Car, Type,
                                  Method, BuildingArea, YearBuilt, CouncilArea,
                                  Lattitude, Longitude, Price))

Impute Missing Values

In order to impute missing values we need to create dummy variables but before that if we have any non-missing categorical variable that has many categories we can remove them and add them back once all missing values are computed.

In our case we have CouncilArea, so storing it in a temporary dataframe and then removing it. This will help us to impute the missing values quickly.

# For back up of values
temp = data.frame(train$CouncilArea)
train <- subset(train, select = -c(CouncilArea))

Now using Caret package to create dummy variables, then impute them and finally replace the imputed variables in the original dataset.

# Transform all feature to dummy variables.
dummy.vars <- dummyVars(~ ., data = train[, -1])
train.dummy <- predict(dummy.vars, train[, -1])
View(train.dummy)

# Now, impute!
pre.process <- preProcess(train.dummy, method = "bagImpute")
imputed.data <- predict(pre.process, train.dummy)
View(imputed.data)

#Assignign imputed values to original dataset
train$BuildingArea <- imputed.data[, 12]
train$YearBuilt <- imputed.data[, 13]
train$Lattitude <- imputed.data[, 14]
train$Longitude <- imputed.data[, 15]

#Adding back the CouncilArea
train$CouncilArea <- temp$train.CouncilArea

Split Data and Train Model

The next few steps are pretty easy, all you have to do is to setup stuff for the model, no need for deep explanation. The code is similar to what we did in the Titanic Dataset.

# Using 10-fold cross validation repeated 3 times
# Using a grid search for optimal model hyperparamter values.
train.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3,
                              search = "grid")

# Leverage a grid search of hyperparameters for xgboost
tune.grid <- expand.grid(eta = c(0.05, 0.075, 0.1),
                         nrounds = c(50, 75, 100),
                         max_depth = 6:8,
                         min_child_weight = c(2.0, 2.25, 2.5),
                         colsample_bytree = c(0.3, 0.4, 0.5),
                         gamma = 0,
                         subsample = 1)

# For Parallel Processing use doSNOW
# Tune the number of thread or core based on your PC
cl <- makeCluster(4, type = "SOCK")
# Register cluster so that caret will know to train in parallel.
registerDoSNOW(cl)

caret.cv <- train(Price ~ .,
                  data = housing.train,
                  method = "xgbTree",
                  tuneGrid = tune.grid,
                  trControl = train.control)

stopCluster(cl)

# Examine caret's processing results
caret.cv

The last line above will show how the model performed based on different combination of hyperparameters, and will pick the best one.

Test and Evaluate Model

The model is ready, we need to test it on the test dataset which we obtained during the splitting process. For continuous output variable, we can evaluate the model using Root Mean Squared Error, R-Squared, Minimum and Maximum Accuracy and Mean Absolute Percentage Error (MAPE).

# Making Predictions
preds <- predict(caret.cv, housing.test)
#Evaluation using RMSE, R-Squared
postResample(preds,housing.test$Price)

# Evaluation using min max accuracy and MAPE
actuals_preds <- data.frame(cbind(actuals=housing.test$Price, predicteds=preds))
head(actuals_preds)
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max)) 
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)

Similarly, we can reuse the above code for training dataset so that we can see how well our model has predicted on the test dataset. Also, we can see in different settings of the features whether there is an overfitting or not. If more variables are added in the feature list, then you would see an increase in training accuracy but slight decrease in test accuracy, meaning that the model is start to overfit.

Output

Training Dataset

Testing Dataset

Ideal Values

RMSE

1.523547e+05

2.730161e+05

R-Squared

9.474203e-01

8.274691e-01

M_M_Accuracy

0.9024803

0.8723883

MAPE

0.0977454

0.1545473

The M_M_Accuracy is quite well in this scenario, the reason for that is the Price contains big numbers so a 10,000 difference in prediction would show a big decrease in the accuracy, but in reality it will not matter that much when house price is 1,000,000.

`RMSE` in our case is quite high considering the exponential term, the reason again is the big numbers in Price. So neglect the exponential, then it becomes quite good and meaningful.

To improve the accuracy, feature engineering is now required. I created few features from Address and YearBuilt variables but that didn’t help, will update this article in case if I come up with better result.

You can find the full R code at the Github Repo.

For any question, feedback or suggestion, feel free to comment below.