This report provides detail implementaton of house price prediction using Liniear Regression in R.
The dataset used in this report is House Price Prediction data hosted in Kaggle https://www.kaggle.com/shree1992/housedata .
Outline:
1. Data Extraction
2. Data Exploration
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Read datset in csv file format and assign to R frame.
house_df <- read.csv(file = "data/datahs.csv")
See the data dimension. The dataset has 4600 rows (observations) and 18 colums (variales).
dim(house_df)
## [1] 4600 18
Analyze a single variable: “Price” as a target variable.
library(ggplot2)
ggplot(data = house_df, aes(y = price)) +
geom_boxplot()
Based on the boxplot above, the target variable has:
- Outliers
- Incorrect values (price == 0)
Analyze two variables. The relationship between number of bedrooms and price.
ggplot(data = house_df, aes(x = as.factor(house_df$bedrooms),
y = price)) +
geom_boxplot()+
ylim(0, 1000000)
The outliers make visualization not really clear. However, we can still see that the price has positive correlation with number of bedrooms.
Analyze multivariate. Compute the correlation coefficient (Pearson) between all numerical variables with the target variable.
house_num <- house_df[, c("price", "bedrooms", "bathrooms",
"sqft_living", "sqft_lot", "floors",
"waterfront", "view", "condition",
"sqft_above", "sqft_basement") ]
r <- cor(house_num)
r[, c("price")]
## price bedrooms bathrooms sqft_living sqft_lot
## 1.00000000 0.20033629 0.32710992 0.43041003 0.05045130
## floors waterfront view condition sqft_above
## 0.15146080 0.13564832 0.22850417 0.03491454 0.36756960
## sqft_basement
## 0.21042657
Based on the Pearson’s Correlation Coefficient score, the most influential varables are sqft_living, bathrooms, and sqft_above.
Visualize the scatterplot and smoother line between the ost influential variables.
library(car)
scatterplotMatrix(house_num[, c("price", "sqft_living",
"bathrooms", "sqft_above")],
spread=FALSE,
smoother.args = list(lty = 2))
The variables have positive correlation with target. We can see that most of the house price are low (< US$ 1M). The outliers in price will significantly influence the model.
Remove Incorrect Data
Remove observations with zero price
house_num <- house_num[house_num$price > 0,]
The new dataset has 4551 rows (49 rows are removed).
Remove Outliers
Remove observations with outliers
# get outlier value
out_price <- boxplot.stats(house_num$price)$out
# get row index of outliers values
out_idx <- which(house_num$price %in% c(out_price))
# remove rows with outliers
house_num <- house_num[-out_idx,]
dim(house_num)
## [1] 4311 11
One Hot Encoding
Create new features from categorical variable(s). Adding location feature to the dataset to improve prediction performance.
library(caret)
## Loading required package: lattice
rn <- row.names(house_num)
house_loc <- data.frame(house_df[rn, c("statezip")])
colnames(house_loc) <- "loc"
dmy <- dummyVars("~.", data = house_loc)
house_loc <- data.frame(predict(dmy, newdata = house_loc))
house_num <- cbind(house_num, house_loc)
dim(house_num)
## [1] 4311 88
The new dataset has 88 columns (77 new columns are added). There ar 77 unique statezip in the dataset.
Principal Component Analysis (PCA)
Create new features in new dimensions.
Split Training and Testing Data
Randomly divide dataset into training and testing data (Ratio = 70:30)
set.seed(2021)
m <- nrow(house_num)
train_idx <- sample(m, 0.7 * m)
train_df <- house_num[train_idx, ]
test_df <- house_num[-train_idx, ]
Multivariate Liniear Regression
We use all variables (except target variable) in the dataset as features.
fit.mlr <- lm(price ~ .,
data = train_df)
Create function to compute Root Mean Squared (RSME) adn Pearson’s Correlelation Coeficient (R) betweeb predicted and actual values.
performanceReg <- function(actual, predicted){
e = actual - predicted # error
se = e ^ 2 # squared error
sse = sum(se) # sum of squared error
mse = mean(se) # mean squared error
rmse = sqrt(mse) # root mean squared error
r = cor(actual, predicted)
result <- paste("RMSE = ", round(rmse,3), "\n",
"R = ", round(r,3), "\n")
cat(result)
}
Compute prediction values using the learned model. Then, caculate the model’s performance base on RMSE and R.
pred.mlr <- predict(fit.mlr, test_df)
performanceReg(test_df$price, pred.mlr)
## RMSE = 97530.559
## R = 0.888
Visualize Actual vs Predicted Values.
actual <- test_df$price
p3 <- ggplot(data = test_df, aes(x = actual, y = pred.mlr)) +
geom_point() +
scale_x_continuous(breaks = c(250000, 500000, 750000),
labels = c("$250KM", "$500K", "$750K"),
limits = c(0,1000000)) +
scale_y_continuous(breaks = c(250000, 500000, 750000),
labels = c("$250KM", "$500K", "$750K"),
limits = c(0,1000000))+
labs(title = "Multivariate Linear Regression",
x = "Actual", y = "Predicted")
p3
Most of the point are around the diagonal. It means the predictied values are close to the actual values.
Base on the analytics and evaluation, there are several conclution and recommendation:
1. Data cleaning (remove outliers and incorrect values) and feature extraction (OHE) significantly improve the prediction performances.
2. The RMSE is relatively still high. We believe that the model can be improve.
3. Using PCA or other data preparation methods ar promising to improve the performance.