Discription

This report details implementation of house price prediction using Linear Regression in R

The dataset used in this report is House Price Prediction data hosted in Kaggle https://www.kaggle.com/shree1992/housedata

Outline :
1. Data Extraction
2. Data Exploration
3. Data Preparation
4. Modeling
5. Evaluation

Data Extraction

Read dataset in csv file format and assign to R frame

house_df <- read.csv(file = "../data/house_data.csv")

Data Exploration

Sea the data dimention. The dataset has 4600 rows (observations) dan 18 coloms

dim(house_df)
## [1] 4600   18
summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

2.1 Univariate Analysis

Analyze a single variable:“price” as a target variable.

library(ggplot2)
ggplot(data = house_df, aes(y = price)) +
geom_boxplot()

Based on the bpxplot above, the target variable has :
- Outliers
- Incorrect values (price==0)

2.2 Bivariate Analysis

Analyze two variables. The relationship between Number of Bedrooms and price

ggplot(data = house_df, aes(x = as.factor(house_df$bedrooms), 
                            y = price)) +
  geom_boxplot()+
  ylim(0,1000000)

The outliers make visualization not really clear. However, we can still see that the price has posstive correlation with number of bedrooms.

2.3 Multivariate Analysis

Analyze multiple variables compute the correlation coefficient (Pearson) between all numerical variables.

house_num <- house_df[, c("price", "bedrooms", "bathrooms",
                 "sqft_living", "sqft_lot", "floors",
                 "waterfront", "view", "condition", 
                 "sqft_above", "sqft_basement") ]
r<-cor(house_num)
r[,c("price")]
##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##    1.00000000    0.20033629    0.32710992    0.43041003    0.05045130 
##        floors    waterfront          view     condition    sqft_above 
##    0.15146080    0.13564832    0.22850417    0.03491454    0.36756960 
## sqft_basement 
##    0.21042657

Based on the Pearson’s Correlation Coefficient score, the most influential variables are sqft_living,bathrooms and sqft_above.

Visualize the scatterplot and smoother line between the most influential variables.

library(car)
scatterplotMatrix(house_num[, c("price", "sqft_living", 
                                "bathrooms", "sqft_above")], 
                  spread=FALSE, 
                  smoother.args = list(lty = 2))

The Variables have posstive correlation with target. We can see most of the house price are low (<US$2.5M). The Outliers in Price Significantly influence the model.

Data Preparation

Modeling

Evaluation

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.