Description

This Report provides detail implementation of house price prediction using linear Regression in R

The dataset used in this report is House Prediction data hosted in kaggle https://www.kaggle.com/shreayan98c/boston-house-price-prediction

Report details…

Report Outline:
1. Data Extraction
2. Data Exploration
3. Data Preparation
4. Modelling
5. Evaluation

1. Data Extraction

Read dataset in CSV file format and assign ro R data frame.

house_df <- read.csv(file = "data/data.csv")

2. Data Exploration

See the data dimension. The dataset has 4600 rows (Observations) and column 18 column (variable)

dim(house_df)
## [1] 4600   18

2.1 Univariate Analysis

Analyze a single Variable : “price” as a target variable.

library(ggplot2)

ggplot(data = house_df, aes(y = price)) +
  geom_boxplot()

Based on the boxplot above, the target variable has : - outliers - incorrect values (price == 0)

2.2 Bivariate Analysis

Analyze tow variable. The Relationship between number of bedrooms and price

ggplot(data = house_df, aes(x = as.factor(house_df$bedrooms), 
                            y = price)) +
  geom_boxplot()+
    ylim(0,1000000)

The outliers make visualization not really clear. However in general, we can still see that the price has positive correlation with number of bedrooms

2.3 Multivariate Analysis

Analyze multiple variable. Compute the correlation coeficient between all numerical variables

house_num <- house_df[, c("price", "bedrooms", "bathrooms",
                 "sqft_living", "sqft_lot", "floors",
                 "waterfront", "view", "condition", 
                 "sqft_above", "sqft_basement") ]

r <- cor(house_num)
r[, c("price")]
##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##    1.00000000    0.20033629    0.32710992    0.43041003    0.05045130 
##        floors    waterfront          view     condition    sqft_above 
##    0.15146080    0.13564832    0.22850417    0.03491454    0.36756960 
## sqft_basement 
##    0.21042657

Based on the Pearson’s Corellation Coeficient score the most influential variable are Sqft_Living,Bathrooms,Sqft_above

Visualize the scatterplot and smother line between the most influential variables

library(car)
## Loading required package: carData
scatterplotMatrix(house_num[, c("price", "sqft_living", 
                                "bathrooms", "sqft_above")], 
                  spread=FALSE, 
                  smoother.args = list(lty = 2))

The variable have the positive correlation to target. We can see that most of of the house price are low(< US$2.5M). The outliers in price will significantly influence the model

3. Data Preparation

3.1 Data Cleaning

Remove Incorrect Data

Remove Obs with Zero Proce

Remove Outliers

Remove Obs with Outliers

3.2 Features Extraction

One Hot Encoding (OHE)

Create New Features from categorical variable(s)

Principal Component Analysis (PCA)

Create new features in new dimensions

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.