MATH1324 Assignment 3

House sales price prediction using Linear Regression analysis

VENKATESH RAMSHETTY VENKATARAMANA (s3779142), JEEVAN HEMMANNU THARANATHA (s3755598)

Last updated: 02 June, 2019

Introduction

In this investigation we try to understand the relationship between two continuous variables to make precise estimates given one of the variable.

Below procedures are carried out to achieve this goal

• Infer simple scatter plots visualising bivariate data.

• Fit a simple linear regression and perform hypothesis tests of the various model components.

• Test the various assumptions behind linear regression analysis and detect when conventions are uncertain.

• Interpret the outcome of simple linear regression analysis.

Introduction Cont.

Linear Regression methods assume that a predictor variable, x, provides information about some dependent variable, y. We write a simple linear regression equation as:

\[y = \alpha + \beta x+ \epsilon\]

where y is the dependent variable, α is the constant/intercept, β is the slope, x is the predictor and ϵ is the random error/residuals.

Problem Statement

Below is the problem statement of our dataset,

• Can size of living room area (square foot) of a house in King Country, USA be used to predict its house sales price (dollar)?

We follow below step by step procedure to solve this problem statement

• First, we try to identify if the relation between x and y variable (sqft_living and price in our case) is Linear. If it doesn’t, then we try transformation method by applying some mathematical function like log or Square root on our variables.

• Then, fitting linear regression and performing hypothesis test for the overall model.

• If the model is statistically significant then, testing model parameters, all assumptions and conclude our Hypothesis.

Data

The used dataset is Open Data from Kaggle.com. This dataset contains house sale prices for King County, USA. It includes homes sold between May 2014 and May 2015. Below is the information about this dataset.

• Dataset link: https://www.kaggle.com/harlfoxem/housesalesprediction

• Number of Observations: 21613

• Number of attributes: 21

• Number of Missing values: 0

• Number of Duplicate entries: 0

• Attributes from our dataset used in this Statistic Analysis are price and sqft_living

Data Cont.

• Price is prediction (target) (in Dollar) and Sqft_living is square footage of the home (in square feet)

• Below box plot depicts the outliers in price and sqft_living, where sqft_living >7000 and price > 4000000 can be scooped

house <- read_csv("C:/Users/jeeva/Desktop/kc_house_data.csv")

par(mfrow=c(1, 2))
boxplot(house$price, ylab = "Price in $ ")
boxplot(house$sqft_living, ylab = "Size of living area in sqft ")

Descriptive Statistics and Visualisation

• Plot the scatter plot between sqft_living and price as X and Y values respectively.

plot(price ~ sqft_living, data = house)

Decsriptive Statistics Cont.

• Filter out extreme value of price and sqft_living, sqft_living >7000, and sqft_living <600, additionally, price higher than 4000000 are all outliers and plot scatter plet again.

house1 <- house %>% filter( sqft_living <7000 & sqft_living>600 & price<4000000)
plot(price ~ sqft_living, data = house1)

Decsriptive Statistics Cont.

• The relationship of linear trend in now more concentrated and clearer.

• However, the relationship between sqft_living and price doesn’t appear to be perfectly liner.

• We try transformation technique by using log function and check if it can solve our problem.

plot(log(price) ~ log(sqft_living), data = house1)

• We can see that our problem is now solved. The relationship between sqft_living and price appear to be perfectly liner.

• It clearly says that there is a positive linear relationship between sqft_living and price. With the increase in sqft_living, price of house will increase.

• Relating to simple linear regression equation, we estimate a linear regression using the sample estimates

\[\log(price) = \alpha + \beta * \log(sqft_living)+ \epsilon\]

Decsriptive Statistics Cont.

• We have also successfully corrected the right-skewed using log() transformation as shown in the right side of figure below.

par(mfrow=c(2,2))
house1$sqft_living %>% hist(main = "sqft_living")
log(house1$sqft_living) %>% hist(main = "log(sqft_living)")
house1$price %>% hist(main = "price")
log(house1$price) %>% hist(main = "log(price)")

Summarise variables

• Price Summary

house1 %>% summarise(Min = min(price,na.rm = TRUE),
Q1 = quantile(price,probs = .25,na.rm = TRUE),
Median = median(price, na.rm = TRUE),
Q3 = quantile(price,probs = .75,na.rm = TRUE),
Max = max(price,na.rm = TRUE),
Mean = mean(price, na.rm = TRUE),
SD = sd(price, na.rm = TRUE),
n = n(),
Missing = sum(is.na(price))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
75000 323000 450000 645000 3850000 536650.7 343063.6 21525 0

• Square foot of living area Summary

house1%>% summarise(Min = min(sqft_living,na.rm = TRUE),
Q1 = quantile(sqft_living,probs = .25,na.rm = TRUE),
Median = median(sqft_living, na.rm = TRUE),
Q3 = quantile(sqft_living,probs = .75,na.rm = TRUE),
Max = max(sqft_living,na.rm = TRUE),
Mean = mean(sqft_living, na.rm = TRUE),
SD = sd(sqft_living, na.rm = TRUE),
n = n(),
Missing = sum(is.na(sqft_living))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
610 1430 1910 2550 6980 2075.384 886.1564 21525 0

Hypothesis Testing

Linear regression – Overall Model

• H0: The data do not fit the linear regression model

• HA: The data fit the linear regression model

We test the overall model using the F-test with the help of linear regression model function lm().

model <- lm(log(price) ~ log(sqft_living), data = house1)
model %>% summary()
## 
## Call:
## lm(formula = log(price) ~ log(sqft_living), data = house1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1045 -0.2917  0.0134  0.2568  1.3197 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.742470   0.047880   140.8   <2e-16 ***
## log(sqft_living) 0.834880   0.006331   131.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.387 on 21523 degrees of freedom
## Multiple R-squared:  0.4469, Adjusted R-squared:  0.4469 
## F-statistic: 1.739e+04 on 1 and 21523 DF,  p-value: < 2.2e-16
model %>% confint()
##                      2.5 %    97.5 %
## (Intercept)      6.6486228 6.8363177
## log(sqft_living) 0.8224716 0.8472884

• From above test, we can see that Pr(F1,n−2>F) = 2.2e-16 which is very small. Hence, we can say that p<0.001.

• Therefore, we reject Ho.

• The model is statistically significant, the data fits the linear regression model.

Hypthesis Testing Cont.

Linear regression – Model Parameters

INTERCEPT

\[H_o : α = 0\]

\[H_A : α ≠ 0\]

• Referring to summary table above, t-value of intercept is 140.8, and its p-value is 2e-16 which is p < .001. So we reject H0, the test is statistically significant, HA: α ≠ 0

• The intercept is the average value for, when x = 0.

SLOPE

\[H_o : \beta = 0\]

\[H_A : \beta ≠ 0\]

• Referring to summary table above, t-value of slope is 131.9, and its p-value is 2e-16 which is p < .001. So, we reject H0, the test is statistically significant, HA: β ≠ 0

• Slope represents the average change in y when there is one unit change in x.

• As log(sqft_living) increases by 1 unit, log(price) increase by 0.834 units.

• The best line fit: log(price) = 6.742 + 0.834 × log(sqft_living)

Hypothesis testing Cont. - Check assumptions

Before we report the final regression model, we must validate all the following assumptions for linear regression.

• Independence

• Linearity

• Normality of residuals

• Homoscedasticity

Independence

• In our dataset we assume that no two observations are taken from same house in King County region of United states.

Linearity

model %>% plot(which = 1)

• From the above graph we can see that relationship between fitted values and residuals is almost flat , this is a good indication that our modelling is linear relationship.

Normality of residuals

model %>% plot(which = 2)

• The plot above suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.

Homoscedasticity

model %>% plot(which = 3)

• Red line is close to flat and the variance in the square root of the standardised residuals is consistent across predicted (fitted values). It is safe to assume homogeneity of variance for the two-sample t-test.

Influential Cases

model %>% plot(which = 5)

• In the below graph, no value falls close to upper and lower right-hand side of the plot beyond the red bands. In fact, the bands are not even visible. We are good to say that there are no influential outliers.

Linear Regression – R2

• R2 is the proportion of variability in y that can be explained by a linear relationship with x.

• In our regression model R2 = 0.4469. It records square foot of living (sqft_living) explained 44.69% of the variability in house sales prices.

• Positive correlation between sqft_living and price is quite likely.

• Even dough our model is poorly fitted, it is still better than no model at all.

Correlation Coefficient : r

• Measures the strength and direction of a linear relationship between two variables. Ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation) and 0 means no correlation.

• We can see that Correlation Coefficient between our variables is 0.702 which indicates that it is strongly positive.

r <- cor(house$price, house$sqft_living)
r
## [1] 0.7020351

Linear Regression - Interpretation

Summary:

• Independence and linearity was assumed, normality of residuals OK, homoscedasticity OK, no influential cases.

• r = 0.702 and R2 = 0.4469.

• Model ANOVA, F(1, 21523) = 1.739e+04, p< .001

• α = 6.742, p< .001, 95% CI (6.648 6.8363)

• β = 0.834, p< .001, 95% CI (0.822 0.847)

Decision:

• Overall model: Reject H0.

• Intercept: Reject H0.

• Slope: Reject H0.

• Log(price) = 6.742 + 0.834 * log(sqft_living)

Conclusion:

• There was a statistically significant positive linear relationship between house sale price and its square feet of living area in king country, USA.

Discussion

• There is a statistically significant positive linear relationship between house sales price and its square feet of living area in King Country, USA. A house sales price was estimated to explain up to 45% of the variability in square foot of living area.

• Strengths

  1. In-depth and rational explanation and model analysis.

  2. Outstanding choices of two variables.

  3. Extreme outliner has been filtered out that these values not impact on our regression model.

• Limitations

  1. This analysis is just limited to one town in USA.

  2. Open data published by individual; it may have some data entering errors to result in lacking accuracy and credibility slightly.

• Propose directions for future investigations.

  1. Is there any difference among different suburbs? Can we create regression models for each suburb in king Country?

  2. Is there any other attribute that affects the cost price of the house other than size of living area?

References

• House Sales in King County, USA (2016). Kaggle. Retrieved Oct. 9, 2017, from https://www.kaggle.com/harlfoxem/housesalesprediction/data