For this homework, we will investigate the Nashville_housing.csv dataset located on D2L. The dataset contains information on houses sold in the Nashville area between 2013 and 2016. Be sure to download the file to your computer and import using “Import Dataset”, then “From Text File…”
We will be considering the same model you found in the last homework.
SP = housing$Sale.Price
FA = housing$Finished.Area
mod1<-lm(SP~FA,data=housing)
1) Obtain a residual plot. Be sure to comment on the plot including your conclusions.
#Enter Code Here
plot(resid(mod1))
plot(rstandard(mod1))
We are looking at his plot to observe problems such as Heteroscedastic in the data, data that is non-linearly associated, data with outliers. You can see that quite a few of the residual points lie far outside the line. Meaning form this point forward we need to do further investigation into the model.
2) Obtain a quantile-quantile plot. Be sure to comment on the plot including your conclusions.
#Enter Code Here
qqnorm(resid(mod1));qqline(resid(mod1))
The QQ plot helps us determine if the data is normally distributed or not. We can see that for the most part the data follows along the line that is supposed to be at a 45 degreee angle. However, on both ends there appears to be 2 tails that would indicate that the data might have Fat tails. MLooking at other plots would be helpful to really make this determination.
3) Use the Box-Cox procedure to find an appropriate power transformation of \(Y\). What transformation of \(Y\) is suggested?
#Enter Code Here
boxcox(mod1)
library(forecast)
lambda = BoxCox.lambda(SP)
lambda
## [1] 0.02356233
# Lambda is approx. .03 indicating that we want to log transform the variable of y. i.e. Y'= log(Y).
#Lambda = -0.03987188
#lambda=c(-2,-1,-.5,0,.5,1,2)
#SSE=vector()
#for(l in lambda){
# ytemp=Nashville_housing$Sale.Price^l
# modtemp=lm(ytemp~Nashville_housing$Finished.Area)
# SSEtemp=sum((Nashville_housing$Sale.Price-modtemp$fitted.values^(1/l))^2)
# SSE=c(SSE,SSEtemp)
#}
#rbind(lambda,SSE)
We can see that lambda is 0.0 which means that we could do a log transformation.
4) Use the tranformation found in part (3) and obtain the estimated linear regression function for the transformed data.
#Enter Code Here
SPTrans = log(SP)
mod1.1 = lm(SPTrans~FA)
summary(mod1.1)
##
## Call:
## lm(formula = SPTrans ~ FA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4562 -0.3136 0.0179 0.4149 3.5722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.124e+01 8.967e-03 1253.1 <2e-16 ***
## FA 5.098e-04 4.451e-06 114.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5879 on 23539 degrees of freedom
## Multiple R-squared: 0.3578, Adjusted R-squared: 0.3578
## F-statistic: 1.312e+04 on 1 and 23539 DF, p-value: < 2.2e-16
The plotted line is Y’ = 1.12 + .0005X and you can see that is still significant at a .01.
5) Express the estimated regression function in the original units.
#Enter Code Here
Y’ = Ln(Intercept + B1(Finished Area)) Y’ = LN(1.12 + .0005X) The units can be interpreted as a unit increase in Finished Area would multiply prodictive response by e^B1. 6) Repeat parts 1 and 2 for the transformed model.
#Enter Code Here
plot(resid(mod1.1))
qqnorm(resid(mod1.1));qqline(resid(mod1.1))
1 We can see even after th y transformation that some data still lies far outside the line. Indicating we need to do further analysis of the data to and possible other transformations. 2 After the transsformation we can see see that the plot is similar that before it was transformed. We might conculde from looking at this plot that that data has 2 thin fat on both ends and further investigation into it being normally distributed.
7) Now fit a linear regression for \(Sale.Price\) using \(Finished.Area\), \(Bedrooms\) and \(Grade\) as explanatory variables. State the model and interpret each of the coefficients in the context of the problem.
#Enter Code Here
Mod2 = lm(SP ~ FA + housing$Bedrooms + housing$Grade)
summary(Mod2)
##
## Call:
## lm(formula = SP ~ FA + housing$Bedrooms + housing$Grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1221694 -72388 -20087 57198 9932057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.150e+05 1.100e+04 28.636 < 2e-16 ***
## FA 1.251e+02 2.132e+00 58.667 < 2e-16 ***
## housing$Bedrooms -1.129e+04 1.748e+03 -6.458 1.08e-10 ***
## housing$GradeB -1.758e+05 8.333e+03 -21.098 < 2e-16 ***
## housing$GradeC -2.805e+05 8.909e+03 -31.491 < 2e-16 ***
## housing$GradeD -2.891e+05 1.012e+04 -28.568 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 174300 on 23530 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.4164, Adjusted R-squared: 0.4162
## F-statistic: 3357 on 5 and 23530 DF, p-value: < 2.2e-16
Y = 3.15 + 1.25(FA) - 1.12(BedRooms) -1.75(GradeB) - 2.8 (GradeC) - 2.89 (GradeD)
Grade A: Y = 3.15 + 1.25(FA) - 1.12(BedRooms) Grade B: Y = 3.15 + 1.25(FA) - 1.12(BedRooms) -1.75(GradeB) If it is Grade B then SP goes down 1.75 in comparison to A. Grade C: Y = 3.15 + 1.25(FA) - 1.12(BedRooms) - 2.8 (GradeC) If it is Grade C then SP goes down 2.80 in comparison to A. Grade D: Y = 3.15 + 1.25(FA) - 1.12(BedRooms) - 2.89 (GradeD) If it is Grade C then SP goes down 2.89 in comparison to A.
8) Repeat parts 1 to 6 for the new model.
#Enter Code Here
#1
plot(resid(Mod2))
#2
qqnorm(resid(Mod2));qqline(resid(Mod2))
#3
boxcox(Mod2)
lambda = BoxCox.lambda(SP)
lambda
## [1] 0.02356233
#4
SPTrans = log(SP)
Mod2.1 = lm(SPTrans~FA + housing$Bedrooms + housing$Grade)
summary(Mod2.1)
##
## Call:
## lm(formula = SPTrans ~ FA + housing$Bedrooms + housing$Grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4733 -0.2896 0.0232 0.3797 3.8279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.172e+01 3.620e-02 323.634 < 2e-16 ***
## FA 4.213e-04 7.015e-06 60.061 < 2e-16 ***
## housing$Bedrooms -2.660e-02 5.750e-03 -4.625 3.76e-06 ***
## housing$GradeB 4.097e-02 2.742e-02 1.494 0.135
## housing$GradeC -2.665e-01 2.931e-02 -9.091 < 2e-16 ***
## housing$GradeD -5.711e-01 3.330e-02 -17.147 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5734 on 23530 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.3886, Adjusted R-squared: 0.3884
## F-statistic: 2991 on 5 and 23530 DF, p-value: < 2.2e-16
# 5 State the original Equation
# Y = 1.17 +.0004(FA) - .02(BedRooms) + .04(GradeB) - .26 (GradeC) - .57(GradeD)
# 6
plot(resid(Mod2.1))
qqnorm(resid(Mod2.1));qqline(resid(Mod2.1))
From this plot it appears that the there is a much better fit than the last model. However, it can be observed that that are outliers.
Here we are looking to see if the data is normally distributed. Usually it would follow along a 45 degreee angle. In this plot the data deviates from the line on both ends indicating that the data might have fat tails and further investigation into it being normally distributed.
We can see the lambda is .02 meaning we can take the log to trasnform y.
The fitted regression line is Sale.Price’ = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) + 4.097e-02(GradeB) - 2.66e-01(GradeC) - 5.711e-01(GradeD)
Grade A: Y= 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) Grade B: Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) +4.097e-02(GradeB) If it is Grade B then SP goes up 4.097e-0 in comparison to A. Grade C: Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) - 2.66e-01(GradeC) If it is Grade C then SP goes down 2.66e-01 in comparison to A. Grade D: Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) - 5.711e-01(GradeD) If it is Grade C then SP goes down 5.711e-01 in comparison to A. #6 From this residual plot we can still observe that there are outliers. There needs to be further investigation to the model to determine potental further transformations.
We can see that similar to before the data seems to have 2 fat tails at either end and further investigation into the data is needed to determine normal distribution.
9) Start thinking about the project! On Monday you will need to have your group (4-5) and a topic. Search for datasets where you can do original linear regression analysis. The cleaner the data the better. You will want a variety of explanatory variables (both categorical and numerical).