This Report provides detail implementation of house price prediction using linear Regression in R
The dataset used in this report is House Prediction data hosted in kaggle https://www.kaggle.com/shreayan98c/boston-house-price-prediction
Report details…
Report Outline:
1. Data Extraction
2. Data Exploration
3. Data Preparation
4. Modelling
5. Evaluation
Read dataset in CSV file format and assign ro R data frame.
house_df <- read.csv(file = "data/data.csv")
See the data dimension. The dataset has 4600 rows (Observations) and column 18 column (variable)
dim(house_df)
## [1] 4600 18
Analyze a single Variable : “price” as a target variable.
library(ggplot2)
ggplot(data = house_df, aes(y = price)) +
geom_boxplot()
Based on the boxplot above, the target variable has : - outliers - incorrect values (price == 0)
Analyze tow variable. The Relationship between number of bedrooms and price
ggplot(data = house_df, aes(x = as.factor(house_df$bedrooms),
y = price)) +
geom_boxplot()+
ylim(0,1000000)
The outliers make visualization not really clear. However in general, we can still see that the price has positive correlation with number of bedrooms
Analyze multiple variable. Compute the correlation coeficient between all numerical variables
house_num <- house_df[, c("price", "bedrooms", "bathrooms",
"sqft_living", "sqft_lot", "floors",
"waterfront", "view", "condition",
"sqft_above", "sqft_basement") ]
r <- cor(house_num)
r[, c("price")]
## price bedrooms bathrooms sqft_living sqft_lot
## 1.00000000 0.20033629 0.32710992 0.43041003 0.05045130
## floors waterfront view condition sqft_above
## 0.15146080 0.13564832 0.22850417 0.03491454 0.36756960
## sqft_basement
## 0.21042657
Based on the Pearson’s Corellation Coeficient score the most influential variable are Sqft_Living,Bathrooms,Sqft_above
Visualize the scatterplot and smother line between the most influential variables
library(car)
## Loading required package: carData
scatterplotMatrix(house_num[, c("price", "sqft_living",
"bathrooms", "sqft_above")],
spread=FALSE,
smoother.args = list(lty = 2))
The variable have the positive correlation to target. We can see that most of of the house price are low(< US$2.5M). The outliers in price will significantly influence the model
Remove Obs with Zero Proce
Remove Obs with Outliers
Create New Features from categorical variable(s)
Create new features in new dimensions
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.