MATH1324 - Introduction to Statistics - Assignment3

Simple Linear Regression - Predict House Prices

Hari Krishnan S3797479

Last updated: 31 May, 2019

RPubs link information

RPubs (see here)
Rpubs link comes here: http://rpubs.com/harisr/SimpleLinearRegressionPredictHousePrices

Introduction

Housing prices are subject to both external factors (Interest rate,price of goods, manufacturing, import/export and government subsidies) and Internal factors (number of bedrooms, bathrooms,square foot, etc.).
Our goal for this assignment is to use simple linear regression to predict the Sale price of a house in King county,USA using an internal factor Living Area.

Introduction Cont.

We will use Simple linear regression to examine the relationship between House price and Living Area(Sqft).
Linear regression assumes that the relationship between the predictor and dependent variable is explained by a straight line relationship
We write a simple linear regression equation as:

y=α+βx+ϵ

where y is the dependent variable
α is the constant/intercept
β is the slope
x is the predictor and
ϵ is the random error/residuals.

Problem Statement

Can Living Area (Sqft) be used to predict the House Price?

Solution:

Use Scatter Plot and Visualise the relationship between House Price and Living Area.
If the relationship is non-linear then it is traced back to skewness. To correct the skewness use transformation.
Use Ordinary least squares to fit a linear regression model to the data.
Assess the fit.
Test the main assumptions for linear regression.
Interpret and test the statistical significance of the regression intercept and slope.
Use the estimated linear regression model and your Living Area (Sql Ft) to predict House Prices.
Compare your predicted House Price to the actual House Price.
Draw an overall conclusion.

Data

The dataset contains prices of Houses in King County, USA. It includes homes sold between May 2014 and May 2015.
Data set (kc_house_data.csv) was obtained from kaggle and has Creative Commons Licence.
URL - https://www.kaggle.com/harlfoxem/housesalesprediction/metadata
The dataset has 21613 observations and 21 columns

## Import the data
House_Data <- read_csv("kc_house_data.csv")
head(House_Data,10)

Data Cont.

Price (House Price) is the dependent variable (US Dollars)
sqft_living (Living Area) is the predictor varaible (Square foot)
remove Outliers

x <- boxplot(House_Data$price)

outliers<-x$out
# Remove rows containing the outliers
House_Data <- House_Data[-which(House_Data$price %in% outliers),]

Descriptive Statistics and Visualisation

Summary statistics for variable Price

                           House_Data %>%  summarise(Min = min(price,na.rm = TRUE),
                                           Q1 = quantile(price,probs = .25,na.rm = TRUE),
                                           Median = median(price, na.rm = TRUE),
                                           Q3 = quantile(price,probs = .75,na.rm = TRUE),
                                           Max = max(price,na.rm = TRUE),
                                           Mean = mean(price, na.rm = TRUE),
                                           SD = sd(price, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(price))) -> table1
knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
75000	315000	437500	6e+05	1127500	476984.6	208371.3	20467	0

Summarise variable sqft_living

                           House_Data %>%  summarise(Min = min(sqft_living,na.rm = TRUE),
                                           Q1 = quantile(sqft_living,probs = .25,na.rm = TRUE),
                                           Median = median(sqft_living, na.rm = TRUE),
                                           Q3 = quantile(sqft_living,probs = .75,na.rm = TRUE),
                                           Max = max(sqft_living,na.rm = TRUE),
                                           Mean = mean(sqft_living, na.rm = TRUE),
                                           SD = sd(sqft_living, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(sqft_living))) -> table1
knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
290	1400	1860	2431	7480	1975.558	774.8335	20467	0

Descriptive Statistics Cont.

Use Scatter Plot and Visualise the relationship between House Price (Dollars) and Living Area (Sql Ft).
The relationship is not linear

plot(price ~ sqft_living, data = House_Data)

Descriptive Statistics Cont.

If the relationship is non-linear then it is traced back to skewness. To correct the skewness use log transformation.
The left columns shows the Histogram are skewed.
The right column shows histogram after transformation

par(mfrow=c(2,2))
House_Data$price %>% hist(main = "House Price")
log(House_Data$price) %>% hist(main = "log(House Price)")
House_Data$sqft_living %>% hist(main = "sqft living Area")
log(House_Data$sqft_living) %>% hist(main = "log(sqft living Area)")

par(mfrow=c(1,1))

Decsriptive Statistics Cont.

After log transformation , Use Scatter Plot and Visualise the relationship.
the relationship is positve linear

plot(log(price) ~ log(sqft_living), data = House_Data)

Hypothesis Testing - Testing the overall Linear Regression Model

H0:The data do not fit the linear regression model
HA:The data fit the linear regression model.
We test the overall model using the F-test.

model1 <- lm(log(price) ~ log(sqft_living), data = House_Data)
model1 %>% summary()

## 
## Call:
## lm(formula = log(price) ~ log(sqft_living), data = House_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17058 -0.27537  0.03114  0.26054  1.10186 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.876884   0.046980   167.7   <2e-16 ***
## log(sqft_living) 0.679216   0.006245   108.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3551 on 20465 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3662 
## F-statistic: 1.183e+04 on 1 and 20465 DF,  p-value: < 2.2e-16

model1 %>% confint()

##                      2.5 %   97.5 %
## (Intercept)      7.7847985 7.968969
## log(sqft_living) 0.6669743 0.691457

The P-value for the F-test is really small, F(1, 20465)= 1.183e+04
p < .001 The model is statistically significant as we reject the null hypothesis(H0).
The data fits the linear regression model
Living Area explained 36.63 % of the variability in House prices.

Hypthesis Testing Cont. Testing the Linear Regression Model Parameters

Hypotheses for linear regression model parameters:

Intercept:

H0 : α = 0 HA : α ≠ 0

The estimated average house price when living Area = 0 was 7.88.
The intercept of the regression was statistically significant, a = 7.88, p < .001, 95% CI (7.78,7.97).
H0:α=0 is clearly not captured by this interval, so was rejected

Slope:

H0 : β = 0 HA : β ≠ 0

For every one unit increase in living Area (sqft), House Price was estimated to increase on average by 0.67.
The slope of the regression for living Area was statistically significant, b = 0.68., p < .001, 95% CI (0.67 0.69).
H0 : β = 0 is clearly not captured by this interval, so was rejected.

The best line fit: log(price)= 7.88 + 0.68 * log(sqft_living)

Hypthesis Testing Cont. Check Assumptions

Independence: Independence was assumed as each house price and lving Area came from different houses.
Linearity: The scatter plot suggested a linear relationship. Other non-linear relationships were ruled out. The re were no non-linear trends in the Residual vs. fitted plot.

model1 %>% plot(which = 1)

Hypthesis Testing Cont. Check Assumptions

Normality of residuals: Normal Q-Q plot didn’t show any obvious departures from normality.

model1 %>% plot(which = 2)

Hypthesis Testing Cont. Check Assumptions

Homoscedasticity: Homoscedasticity looked OK according to the scale-location plot. The variance in residuals appeared constant across predicted values.

model1 %>% plot(which = 3)

Hypthesis Testing Cont. Check Assumptions

Influential cases: There appeared to be no influential cases.

model1 %>% plot(which = 5)

Correlation

A Pearson’s correlation was calculated to measure the strength of the linear relationship between House Price minutes and Living Area.
The positive correlation was statistically significant, r=0.60, p<.001, 95% CI [0.60, 0.61]

library(psychometric)
r=cor(log(House_Data$price),log(House_Data$sqft_living))
CIr(r = r, n = 20467, level = .95)

## [1] 0.5964446 0.6138103

Linear Regression - Interpretation

Simple linear regression summary:

Linearity was assumed, normality of residuals OK, homoscedasticity OK, no influential cases.
r = 0.60 r2 = 0.37
Model ANOVA, F(1,20465)= 1.183e+04 ,p<.001
a = 7.88, p < .001 , 95% CI (7.78,7.97)
b = 0.68, p < .001 , 95% CI (0.67,0.69)

Decision:

Overall model: Reject H0.
Intercept: Reject H0.
Slope: Reject H0.
log(price)= 7.88 + 0.68 * log(sqft_living)

What do we conclude?:

There was a statistically significant positve linear relationship between Living Area (Sq ft) and the House Price.

Discussion

Major Findings

There was a statistically significant positve linear relationship between Living Area (Sq ft) and the House Price
log(price)= 7.88 + 0.68 * log(sqft_living)
identify avg price of a house with living area of 1000 sqft?
log(1000)=6.91
log(price)= 7.88 + 0.68 * 6.91 = 12.58
exp(12.58)= 290686.32
Predicted Avg price = 290686.32
Actual Avg Price = 318933.40

Strengths

Used the Techniques learnt in lectures efficiently
Identfied living area as the predictor variable as it had high correlation than other variables
Log transformation was the best of all transformations

Limitations

Couldn’t find the dataset for the recent years.
Only one year of data was available.
Accuracy of the data.
There are other factors which affect the House prices hence errors are unavoidable.

Propose directions for future investigations.

The analysis suggest that as the square footage of living Area increases the price of the house increases.
In future I would like to consider other variables. Factors which may also affect house prices.

References

Dataset Obtained from Kaggle - Url https://www.kaggle.com/harlfoxem/housesalesprediction
Image in Introduction obtained from https://www.kaggle.com/c/house-prices-advanced-regression-techniques and googleMaps.
Lecture Notes
Lecture Class Work Sheet
Lecture Module Exercises
Lecture Slides
Understand R functions from GooGle search.