VENKATESH RAMSHETTY VENKATARAMANA (s3779142), JEEVAN HEMMANNU THARANATHA (s3755598)
Last updated: 02 June, 2019
You must publish your presentation to RPubs (see here) and add this link to your presentation here.
Rpubs link comes here: www………
This online version of the presentation will be used for marking. Failure to add your link will delay your feedback and risk late penalties.
In this investigation we try to understand the relationship between two continuous variables to make precise estimates given one of the variable.
Below procedures are carried out to achieve this goal
• Infer simple scatter plots visualising bivariate data.
• Fit a simple linear regression and perform hypothesis tests of the various model components.
• Test the various assumptions behind linear regression analysis and detect when conventions are uncertain.
• Interpret the outcome of simple linear regression analysis.
Linear Regression methods assume that a predictor variable, x, provides information about some dependent variable, y. We write a simple linear regression equation as:
\[y = \alpha + \beta x+ \epsilon\]
where y is the dependent variable, α is the constant/intercept, β is the slope, x is the predictor and ϵ is the random error/residuals.
Below is the problem statement of our dataset,
• Can size of living room area (square foot) of a house in King Country, USA be used to predict its house sales price (dollar)?
We follow below step by step procedure to solve this problem statement
• First, we try to identify if the relation between x and y variable (sqft_living and price in our case) is Linear. If it doesn’t, then we try transformation method by applying some mathematical function like log or Square root on our variables.
• Then, fitting linear regression and performing hypothesis test for the overall model.
• If the model is statistically significant then, testing model parameters, all assumptions and conclude our Hypothesis.
The used dataset is Open Data from Kaggle.com. This dataset contains house sale prices for King County, USA. It includes homes sold between May 2014 and May 2015. Below is the information about this dataset.
• Dataset link: https://www.kaggle.com/harlfoxem/housesalesprediction
• Number of Observations: 21613
• Number of attributes: 21
• Number of Missing values: 0
• Number of Duplicate entries: 0
• Attributes from our dataset used in this Statistic Analysis are price and sqft_living
• Price is prediction (target) (in Dollar) and Sqft_living is square footage of the home (in square feet)
• Below box plot depicts the outliers in price and sqft_living, where sqft_living >7000 and price > 4000000 can be scooped
house <- read_csv("C:/Users/jeeva/Desktop/kc_house_data.csv")
par(mfrow=c(1, 2))
boxplot(house$price, ylab = "Price in $ ")
boxplot(house$sqft_living, ylab = "Size of living area in sqft ")• Plot the scatter plot between sqft_living and price as X and Y values respectively.
plot(price ~ sqft_living, data = house)• Filter out extreme value of price and sqft_living, sqft_living >7000, and sqft_living <600, additionally, price higher than 4000000 are all outliers and plot scatter plet again.
house1 <- house %>% filter( sqft_living <7000 & sqft_living>600 & price<4000000)
plot(price ~ sqft_living, data = house1)• The relationship of linear trend in now more concentrated and clearer.
• However, the relationship between sqft_living and price doesn’t appear to be perfectly liner.
• We try transformation technique by using log function and check if it can solve our problem.
plot(log(price) ~ log(sqft_living), data = house1)• We can see that our problem is now solved. The relationship between sqft_living and price appear to be perfectly liner.
• It clearly says that there is a positive linear relationship between sqft_living and price. With the increase in sqft_living, price of house will increase.
• Relating to simple linear regression equation, we estimate a linear regression using the sample estimates
\[\log(price) = \alpha + \beta * \log(sqft_living)+ \epsilon\]
• We have also successfully corrected the right-skewed using log() transformation as shown in the right side of figure below.
par(mfrow=c(2,2))
house1$sqft_living %>% hist(main = "sqft_living")
log(house1$sqft_living) %>% hist(main = "log(sqft_living)")
house1$price %>% hist(main = "price")
log(house1$price) %>% hist(main = "log(price)")• Price Summary
house1 %>% summarise(Min = min(price,na.rm = TRUE),
Q1 = quantile(price,probs = .25,na.rm = TRUE),
Median = median(price, na.rm = TRUE),
Q3 = quantile(price,probs = .75,na.rm = TRUE),
Max = max(price,na.rm = TRUE),
Mean = mean(price, na.rm = TRUE),
SD = sd(price, na.rm = TRUE),
n = n(),
Missing = sum(is.na(price))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 75000 | 323000 | 450000 | 645000 | 3850000 | 536650.7 | 343063.6 | 21525 | 0 |
• Square foot of living area Summary
house1%>% summarise(Min = min(sqft_living,na.rm = TRUE),
Q1 = quantile(sqft_living,probs = .25,na.rm = TRUE),
Median = median(sqft_living, na.rm = TRUE),
Q3 = quantile(sqft_living,probs = .75,na.rm = TRUE),
Max = max(sqft_living,na.rm = TRUE),
Mean = mean(sqft_living, na.rm = TRUE),
SD = sd(sqft_living, na.rm = TRUE),
n = n(),
Missing = sum(is.na(sqft_living))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 610 | 1430 | 1910 | 2550 | 6980 | 2075.384 | 886.1564 | 21525 | 0 |
Linear regression – Overall Model
• H0: The data do not fit the linear regression model
• HA: The data fit the linear regression model
We test the overall model using the F-test with the help of linear regression model function lm().
model <- lm(log(price) ~ log(sqft_living), data = house1)
model %>% summary()##
## Call:
## lm(formula = log(price) ~ log(sqft_living), data = house1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1045 -0.2917 0.0134 0.2568 1.3197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.742470 0.047880 140.8 <2e-16 ***
## log(sqft_living) 0.834880 0.006331 131.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.387 on 21523 degrees of freedom
## Multiple R-squared: 0.4469, Adjusted R-squared: 0.4469
## F-statistic: 1.739e+04 on 1 and 21523 DF, p-value: < 2.2e-16
model %>% confint()## 2.5 % 97.5 %
## (Intercept) 6.6486228 6.8363177
## log(sqft_living) 0.8224716 0.8472884
• From above test, we can see that Pr(F1,n−2>F) = 2.2e-16 which is very small. Hence, we can say that p<0.001.
• Therefore, we reject Ho.
• The model is statistically significant, the data fits the linear regression model.
Linear regression – Model Parameters
INTERCEPT
\[H_o : α = 0\]
\[H_A : α ≠ 0\]
• Referring to summary table above, t-value of intercept is 140.8, and its p-value is 2e-16 which is p < .001. So we reject H0, the test is statistically significant, HA: α ≠ 0
• The intercept is the average value for, when x = 0.
SLOPE
\[H_o : \beta = 0\]
\[H_A : \beta ≠ 0\]
• Referring to summary table above, t-value of slope is 131.9, and its p-value is 2e-16 which is p < .001. So, we reject H0, the test is statistically significant, HA: β ≠ 0
• Slope represents the average change in y when there is one unit change in x.
• As log(sqft_living) increases by 1 unit, log(price) increase by 0.834 units.
• The best line fit: log(price) = 6.742 + 0.834 × log(sqft_living)
Before we report the final regression model, we must validate all the following assumptions for linear regression.
• Independence
• Linearity
• Normality of residuals
• Homoscedasticity
Independence
• In our dataset we assume that no two observations are taken from same house in King County region of United states.
Linearity
model %>% plot(which = 1) • From the above graph we can see that relationship between fitted values and residuals is almost flat , this is a good indication that our modelling is linear relationship.
Normality of residuals
model %>% plot(which = 2) • The plot above suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.
Homoscedasticity
model %>% plot(which = 3)• Red line is close to flat and the variance in the square root of the standardised residuals is consistent across predicted (fitted values). It is safe to assume homogeneity of variance for the two-sample t-test.
Influential Cases
model %>% plot(which = 5)• In the below graph, no value falls close to upper and lower right-hand side of the plot beyond the red bands. In fact, the bands are not even visible. We are good to say that there are no influential outliers.
• R2 is the proportion of variability in y that can be explained by a linear relationship with x.
• In our regression model R2 = 0.4469. It records square foot of living (sqft_living) explained 44.69% of the variability in house sales prices.
• Positive correlation between sqft_living and price is quite likely.
• Even dough our model is poorly fitted, it is still better than no model at all.
Correlation Coefficient : r
• Measures the strength and direction of a linear relationship between two variables. Ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation) and 0 means no correlation.
• We can see that Correlation Coefficient between our variables is 0.702 which indicates that it is strongly positive.
r <- cor(house$price, house$sqft_living)
r## [1] 0.7020351
Summary:
• Independence and linearity was assumed, normality of residuals OK, homoscedasticity OK, no influential cases.
• r = 0.702 and R2 = 0.4469.
• Model ANOVA, F(1, 21523) = 1.739e+04, p< .001
• α = 6.742, p< .001, 95% CI (6.648 6.8363)
• β = 0.834, p< .001, 95% CI (0.822 0.847)
Decision:
• Overall model: Reject H0.
• Intercept: Reject H0.
• Slope: Reject H0.
• Log(price) = 6.742 + 0.834 * log(sqft_living)
Conclusion:
• There was a statistically significant positive linear relationship between house sale price and its square feet of living area in king country, USA.
• There is a statistically significant positive linear relationship between house sales price and its square feet of living area in King Country, USA. A house sales price was estimated to explain up to 45% of the variability in square foot of living area.
• Strengths
In-depth and rational explanation and model analysis.
Outstanding choices of two variables.
Extreme outliner has been filtered out that these values not impact on our regression model.
• Limitations
This analysis is just limited to one town in USA.
Open data published by individual; it may have some data entering errors to result in lacking accuracy and credibility slightly.
• Propose directions for future investigations.
Is there any difference among different suburbs? Can we create regression models for each suburb in king Country?
Is there any other attribute that affects the cost price of the house other than size of living area?
• House Sales in King County, USA (2016). Kaggle. Retrieved Oct. 9, 2017, from https://www.kaggle.com/harlfoxem/housesalesprediction/data