library(corrplot)
library(ggplot2)
The aim of this report is to predict the house sales in King County, Washington State, USA using Multiple Linear Regression (MLR). The dataset consisted of historic data of houses sold between May 2014 to May 2015.
The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle. The dataset consisted of 11 variables and 21613 observations.
data<-read.csv("C:/FIDYS/KULIAH/SMT 7/Bisnis Analitik/after ets/housesales.csv")
str(data)
## 'data.frame': 21597 obs. of 11 variables:
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ sqft_living : int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
First let’s take a look at the data characteristic below.
summary(data)
## price bedrooms bathrooms floors
## Min. : 78000 Min. : 1.000 Min. :0.500 Min. :1.000
## 1st Qu.: 322000 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.:1.000
## Median : 450000 Median : 3.000 Median :2.250 Median :1.500
## Mean : 540297 Mean : 3.373 Mean :2.116 Mean :1.494
## 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.:2.000
## Max. :7700000 Max. :33.000 Max. :8.000 Max. :3.500
## condition grade sqft_above sqft_basement
## Min. :1.00 Min. : 3.000 Min. : 370 Min. : 0.0
## 1st Qu.:3.00 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0
## Median :3.00 Median : 7.000 Median :1560 Median : 0.0
## Mean :3.41 Mean : 7.658 Mean :1789 Mean : 291.7
## 3rd Qu.:4.00 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0
## Max. :5.00 Max. :13.000 Max. :9410 Max. :4820.0
## yr_built sqft_living sqft_lot
## Min. :1900 Min. : 399 Min. : 651
## 1st Qu.:1951 1st Qu.:1490 1st Qu.: 5100
## Median :1975 Median :1840 Median : 7620
## Mean :1971 Mean :1987 Mean : 12758
## 3rd Qu.:1997 3rd Qu.:2360 3rd Qu.: 10083
## Max. :2015 Max. :6210 Max. :871200
ggplot(data, aes(price)) +
geom_histogram(col="pink", aes(fill=..count..)) +
scale_fill_gradient("Count", low="white", high="coral") +
labs(title = "Price histogram", x = "Price", y = "Count")
After that let’s start taking a look at how the 11 numeric correlate with the target “Price” with a correlation matrix.
corrplot(cor(data), type="upper", order = "hclust", tl.col="black", tl.srt=45)
From the correlation plot above, we know that Price have a high correlation to some variables such as Bathrooms, Square footage of living room, Grade, Square footage of house.
model1<-lm(price~., data = data)
summary(model1)
##
## Call:
## lm(formula = price ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1272195 -115700 -12087 89874 4482102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.943e+06 1.360e+05 51.066 < 2e-16 ***
## bedrooms -4.931e+04 2.118e+03 -23.280 < 2e-16 ***
## bathrooms 5.027e+04 3.641e+03 13.807 < 2e-16 ***
## floors 3.314e+04 3.961e+03 8.366 < 2e-16 ***
## condition 1.939e+04 2.581e+03 7.512 6.07e-14 ***
## grade 1.258e+05 2.356e+03 53.390 < 2e-16 ***
## sqft_above 1.663e+02 4.082e+00 40.753 < 2e-16 ***
## sqft_basement 1.947e+02 4.824e+00 40.362 < 2e-16 ***
## yr_built -3.978e+03 6.981e+01 -56.986 < 2e-16 ***
## sqft_living 3.560e+01 3.734e+00 9.536 < 2e-16 ***
## sqft_lot -5.051e-01 5.857e-02 -8.624 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 226400 on 21586 degrees of freedom
## Multiple R-squared: 0.6204, Adjusted R-squared: 0.6203
## F-statistic: 3529 on 10 and 21586 DF, p-value: < 2.2e-16
We found empirical support for Null Hypothesis and Alternative Hypothesis. We found that the p<0.05 which insists that we reject the Null Hypothesis that Price of house are independent. The model is the best fit to the available data as the multiple R-Squared value is higher than the former one. The model accounts to 62,04% of the variance in price and it is a good model as the p-value is very small. We have also predicted the prices, but they varied a bit from the exact values.
We investigated the price of houses with varied features. We found that as the amenities increased the price also increased. We observed houses in King County were priced high for providing people with special amenities like the basement.