Library

library(corrplot)
library(ggplot2)

About Dataset

The aim of this report is to predict the house sales in King County, Washington State, USA using Multiple Linear Regression (MLR). The dataset consisted of historic data of houses sold between May 2014 to May 2015.

The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle. The dataset consisted of 11 variables and 21613 observations.

data<-read.csv("C:/FIDYS/KULIAH/SMT 7/Bisnis Analitik/after ets/housesales.csv")
str(data)
## 'data.frame':    21597 obs. of  11 variables:
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ sqft_living  : int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot     : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

Data Summary

First let’s take a look at the data characteristic below.

summary(data)
##      price            bedrooms        bathrooms         floors     
##  Min.   :  78000   Min.   : 1.000   Min.   :0.500   Min.   :1.000  
##  1st Qu.: 322000   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.:1.000  
##  Median : 450000   Median : 3.000   Median :2.250   Median :1.500  
##  Mean   : 540297   Mean   : 3.373   Mean   :2.116   Mean   :1.494  
##  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.:2.000  
##  Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :3.500  
##    condition        grade          sqft_above   sqft_basement   
##  Min.   :1.00   Min.   : 3.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:3.00   1st Qu.: 7.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :3.00   Median : 7.000   Median :1560   Median :   0.0  
##  Mean   :3.41   Mean   : 7.658   Mean   :1789   Mean   : 291.7  
##  3rd Qu.:4.00   3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.: 560.0  
##  Max.   :5.00   Max.   :13.000   Max.   :9410   Max.   :4820.0  
##     yr_built     sqft_living      sqft_lot     
##  Min.   :1900   Min.   : 399   Min.   :   651  
##  1st Qu.:1951   1st Qu.:1490   1st Qu.:  5100  
##  Median :1975   Median :1840   Median :  7620  
##  Mean   :1971   Mean   :1987   Mean   : 12758  
##  3rd Qu.:1997   3rd Qu.:2360   3rd Qu.: 10083  
##  Max.   :2015   Max.   :6210   Max.   :871200
ggplot(data, aes(price)) + 
  geom_histogram(col="pink", aes(fill=..count..)) +
  scale_fill_gradient("Count", low="white", high="coral") + 
  labs(title = "Price histogram", x = "Price", y = "Count")

Exploratory Data

Boxplot

Scatter Plot

Numerical correlation

After that let’s start taking a look at how the 11 numeric correlate with the target “Price” with a correlation matrix.

corrplot(cor(data), type="upper", order = "hclust", tl.col="black", tl.srt=45)

From the correlation plot above, we know that Price have a high correlation to some variables such as Bathrooms, Square footage of living room, Grade, Square footage of house.

Regression

model1<-lm(price~., data = data)
summary(model1)
## 
## Call:
## lm(formula = price ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1272195  -115700   -12087    89874  4482102 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.943e+06  1.360e+05  51.066  < 2e-16 ***
## bedrooms      -4.931e+04  2.118e+03 -23.280  < 2e-16 ***
## bathrooms      5.027e+04  3.641e+03  13.807  < 2e-16 ***
## floors         3.314e+04  3.961e+03   8.366  < 2e-16 ***
## condition      1.939e+04  2.581e+03   7.512 6.07e-14 ***
## grade          1.258e+05  2.356e+03  53.390  < 2e-16 ***
## sqft_above     1.663e+02  4.082e+00  40.753  < 2e-16 ***
## sqft_basement  1.947e+02  4.824e+00  40.362  < 2e-16 ***
## yr_built      -3.978e+03  6.981e+01 -56.986  < 2e-16 ***
## sqft_living    3.560e+01  3.734e+00   9.536  < 2e-16 ***
## sqft_lot      -5.051e-01  5.857e-02  -8.624  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 226400 on 21586 degrees of freedom
## Multiple R-squared:  0.6204, Adjusted R-squared:  0.6203 
## F-statistic:  3529 on 10 and 21586 DF,  p-value: < 2.2e-16

Result

We found empirical support for Null Hypothesis and Alternative Hypothesis. We found that the p<0.05 which insists that we reject the Null Hypothesis that Price of house are independent. The model is the best fit to the available data as the multiple R-Squared value is higher than the former one. The model accounts to 62,04% of the variance in price and it is a good model as the p-value is very small. We have also predicted the prices, but they varied a bit from the exact values.

Conclusion

We investigated the price of houses with varied features. We found that as the amenities increased the price also increased. We observed houses in King County were priced high for providing people with special amenities like the basement.