Description

This report provides detail implementaton of house price prediction using Liniear Regression in R.

The dataset used in this report is House Price Prediction data hosted in Kaggle https://www.kaggle.com/shree1992/housedata .

Outline:
1. Data Extraction
2. Data Exploration
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Extraction

Read datset in csv file format and assign to R frame.

house_df <- read.csv(file = "data/datahs.csv")

2. Data Exploration

See the data dimension. The dataset has 4600 rows (observations) and 18 colums (variales).

dim(house_df)
## [1] 4600   18

2.1. Univariate Analysis

Analyze a single variable: “Price” as a target variable.

library(ggplot2)
ggplot(data = house_df, aes(y = price)) +
  geom_boxplot()

Based on the boxplot above, the target variable has:
- Outliers
- Incorrect values (price == 0)

2.2. Bivariate Analysis

Analyze two variables. The relationship between number of bedrooms and price.

ggplot(data = house_df, aes(x = as.factor(house_df$bedrooms), 
                            y = price)) +
  geom_boxplot()+
  ylim(0, 1000000)

The outliers make visualization not really clear. However, we can still see that the price has positive correlation with number of bedrooms.

2.3. Multivariate Analysis

Analyze multivariate. Compute the correlation coefficient (Pearson) between all numerical variables with the target variable.

house_num <- house_df[, c("price", "bedrooms", "bathrooms",
                          "sqft_living", "sqft_lot", "floors",
                          "waterfront", "view", "condition", 
                          "sqft_above", "sqft_basement") ]

r <- cor(house_num)
r[, c("price")]
##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##    1.00000000    0.20033629    0.32710992    0.43041003    0.05045130 
##        floors    waterfront          view     condition    sqft_above 
##    0.15146080    0.13564832    0.22850417    0.03491454    0.36756960 
## sqft_basement 
##    0.21042657

Based on the Pearson’s Correlation Coefficient score, the most influential varables are sqft_living, bathrooms, and sqft_above.

Visualize the scatterplot and smoother line between the ost influential variables.

library(car)
scatterplotMatrix(house_num[, c("price", "sqft_living", 
                                "bathrooms", "sqft_above")], 
                  spread=FALSE, 
                  smoother.args = list(lty = 2))

The variables have positive correlation with target. We can see that most of the house price are low (< US$ 1M). The outliers in price will significantly influence the model.

3. Data Preparation

3.1 Data Cleaning

Remove Incorrect Data

Remove observations with zero price

house_num <- house_num[house_num$price > 0,]

The new dataset has 4551 rows (49 rows are removed).

Remove Outliers

Remove observations with outliers

# get outlier value
out_price <- boxplot.stats(house_num$price)$out

# get row index of outliers values
out_idx <- which(house_num$price %in% c(out_price))

# remove rows with outliers
house_num <- house_num[-out_idx,]

dim(house_num)
## [1] 4311   11

3.2 Feature Extraction

One Hot Encoding

Create new features from categorical variable(s). Adding location feature to the dataset to improve prediction performance.

library(caret)
## Loading required package: lattice
rn <- row.names(house_num)

house_loc <- data.frame(house_df[rn, c("statezip")])
colnames(house_loc) <- "loc"

dmy <- dummyVars("~.", data = house_loc)
house_loc <- data.frame(predict(dmy, newdata = house_loc))

house_num <- cbind(house_num, house_loc)
dim(house_num)
## [1] 4311   88

The new dataset has 88 columns (77 new columns are added). There ar 77 unique statezip in the dataset.

Principal Component Analysis (PCA)

Create new features in new dimensions.

Split Training and Testing Data

Randomly divide dataset into training and testing data (Ratio = 70:30)

set.seed(2021)
m <- nrow(house_num)
train_idx <- sample(m, 0.7 * m)


train_df <- house_num[train_idx, ]
test_df <- house_num[-train_idx, ]

4. Modeling

Multivariate Liniear Regression

We use all variables (except target variable) in the dataset as features.

fit.mlr <- lm(price ~ .,
              data = train_df)

5. Evaluation

Create function to compute Root Mean Squared (RSME) adn Pearson’s Correlelation Coeficient (R) betweeb predicted and actual values.

performanceReg <- function(actual, predicted){
  e = actual - predicted # error
  se = e ^ 2             # squared error
  sse = sum(se)          # sum of squared error
  mse = mean(se)         # mean squared error
  rmse = sqrt(mse)       # root mean squared error
  
  r = cor(actual, predicted)
  
  result <- paste("RMSE = ", round(rmse,3), "\n",
                  "R = ", round(r,3), "\n")
  cat(result)
}

Compute prediction values using the learned model. Then, caculate the model’s performance base on RMSE and R.

pred.mlr <- predict(fit.mlr, test_df)
performanceReg(test_df$price, pred.mlr)
## RMSE =  97530.559 
##  R =  0.888

Visualize Actual vs Predicted Values.

actual <- test_df$price

p3 <- ggplot(data = test_df, aes(x = actual, y = pred.mlr)) +
  geom_point() +
  scale_x_continuous(breaks = c(250000, 500000, 750000),
                     labels = c("$250KM", "$500K", "$750K"),
                     limits = c(0,1000000)) +
  scale_y_continuous(breaks = c(250000, 500000, 750000),
                     labels = c("$250KM", "$500K", "$750K"),
                     limits = c(0,1000000))+ 
  labs(title = "Multivariate Linear Regression", 
       x = "Actual", y = "Predicted")

p3

Most of the point are around the diagonal. It means the predictied values are close to the actual values.

6. Recomendation

Base on the analytics and evaluation, there are several conclution and recommendation:
1. Data cleaning (remove outliers and incorrect values) and feature extraction (OHE) significantly improve the prediction performances.
2. The RMSE is relatively still high. We believe that the model can be improve.
3. Using PCA or other data preparation methods ar promising to improve the performance.