For this homework, we will investigate the Nashville_housing.csv dataset located on D2L. The dataset contains information on houses sold in the Nashville area between 2013 and 2016. Be sure to download the file to your computer and import using “Import Dataset”, then “From Text File…”
1) Fit a linear regression for predicting \(logSP\) from \(Finished.Area\). Conduct a t-test to determine if \(Finish.Area\) is a significant predictor for \(logSP\), using level of significance \(.05\). Be sure to state the alternatives, decision rule, p-value, and conclusion.
#Enter Code Here
Model1 = lm(housing$logSP~housing$Finished.Area)
summary1 = summary(lm(housing$logSP~housing$Finished.Area))
summary1
##
## Call:
## lm(formula = housing$logSP ~ housing$Finished.Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4565 -0.3140 0.0177 0.4146 3.5721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.124e+01 8.960e-03 1254.0 <2e-16 ***
## housing$Finished.Area 5.101e-04 4.448e-06 114.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5873 on 23534 degrees of freedom
## Multiple R-squared: 0.3585, Adjusted R-squared: 0.3584
## F-statistic: 1.315e+04 on 1 and 23534 DF, p-value: < 2.2e-16
#t*
tstar=summary1$coefficients[2,1]/summary1$coefficients[2,2]
tstar
## [1] 114.6751
#Calculate the t cutoff
qt(1-.05/2,nrow(housing)-2)
## [1] 1.960065
#P-Value
2*(1-pt(tstar,nrow(housing)-2))
## [1] 0
summary1
##
## Call:
## lm(formula = housing$logSP ~ housing$Finished.Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4565 -0.3140 0.0177 0.4146 3.5721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.124e+01 8.960e-03 1254.0 <2e-16 ***
## housing$Finished.Area 5.101e-04 4.448e-06 114.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5873 on 23534 degrees of freedom
## Multiple R-squared: 0.3585, Adjusted R-squared: 0.3584
## F-statistic: 1.315e+04 on 1 and 23534 DF, p-value: < 2.2e-16
t.test(housing$Finished.Area,housing$logSP)
##
## Welch Two Sample t-test
##
## data: housing$Finished.Area and housing$logSP
## t = 322.46, df = 23535, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1797.958 1819.949
## sample estimates:
## mean of x mean of y
## 1821.11782 12.16426
H0 = b1 = 0 H1 = b1 != 0
logSP = 11.24 + 0.0005101 Finished Area Decision Rule: If T <= 1.96, conclude H0, else conclude H1.
Since T is > than 1.96 we reject the null and conculde that b1 is significantly greater than 0. We can also see from the p value that it is significnat at .05 level p = (2.2e-16). There is a linear relationship between logSP and Finished Area.
2) Using the same model described in problem 1, conduct an F-test to determine if \(Finish.Area\) is a significant predictor for \(logSP\), using level of significance \(.05\). Be sure to state the alternatives, decision rule, p-value, and conclusion.
#Enter Code Here
myAnova = anova(Model1)
myAnova
## Analysis of Variance Table
##
## Response: housing$logSP
## Df Sum Sq Mean Sq F value Pr(>F)
## housing$Finished.Area 1 4535.9 4535.9 13150 < 2.2e-16 ***
## Residuals 23534 8117.5 0.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fstar=myAnova$`Mean Sq`[1]/myAnova$`Mean Sq`[2]
Fstar
## [1] 13150.39
# F cutoff
qf(1-.05/2,1,nrow(housing)-2)
## [1] 5.024529
#Calculate the F cutoff
qf(1-.05/2,1,nrow(housing)-2)
## [1] 5.024529
1-pf(Fstar,1,nrow(housing)-2)
## [1] 0
var.test(housing$Finished.Area, housing$logSP)
##
## F test to compare two variances
##
## data: housing$Finished.Area and housing$logSP
## F = 1377600, num df = 23535, denom df = 23535, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1342866 1413276
## sample estimates:
## ratio of variances
## 1377621
H0:b1= 0, H1: b1 != 0
Decision Rule: If F <= 5.024529, conclude H0, else conclude H1.
Since F is > 5.024529, we reject H0 and conclude that b1 is significantly different from zero and is siginifact at a .05 level p = (2.2e-16). Meaning there is a linear association between logSP and Finsihed area.
3) How do your test statistics from problem 1 and problem 2 relate? How about the p-values? Explain.
#Enter Code Here
The t-test looks for difference in means of logSP and Finished.Area are 0. The f test is testing whether or not the difference in variance of logSP and Finished.Area are 0. T-stat = 114.6751 where as the f-stat = 13150.39. The p-values remained the same at 2.2e-16
4) Fit a linear regression for predicting \(logSP\) from \(Finished.Area\), \(Bedrooms\), and \(Grade\). Construct the ANOVA table (similar to Table 6.1 on page 225). You can make a nice table in R Markdown by using kable in the knitr package.
#Enter Code Here
library(knitr)
Mlin =lm(housing$logSP~housing$Finished.Area + housing$Bedrooms + housing$Grade)
Anova2 = anova(Mlin)
kable(Anova2, digits = 4)
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
housing$Finished.Area | 1 | 4535.9360 | 4535.9360 | 13795.1630 | 0 |
housing$Bedrooms | 1 | 12.3392 | 12.3392 | 37.5272 | 0 |
housing$Grade | 3 | 368.3843 | 122.7948 | 373.4563 | 0 |
Residuals | 23530 | 7736.8114 | 0.3288 | NA | NA |
5) Using the same model described in problem 4, test whether there is a regression relation, using level of significance \(.05\). State the alternatives, decision rule, and conclusion. What does your test result imply about \(\beta _1,\beta_2,\beta_3,\beta_4\) and \(\beta _5\)? What is the \(P\)-value of the test?
#Enter Code Here
myANOVA1<-anova(Mlin)
MSE=myANOVA1$`Mean Sq`[4]
MSR=(myANOVA1$`Sum Sq`[1]+myANOVA1$`Sum Sq`[2]+myANOVA1$`Sum Sq`[3])/5
Fstar1=MSR/MSE
Fstar1
## [1] 2990.612
Fcrit=qf(1-.05,3,nrow(housing)-6)
Fcrit
## [1] 2.605287
pval=1-pf(Fstar,3,nrow(housing)-6)
pval
## [1] 0
nullmod<-lm(logSP~1,data=housing)
df2 <- anova(nullmod,Mlin)
## Warning in anova.lmlist(object, ...): models with response '"housing
## $logSP"' removed because response differs from model 1
df2
## Analysis of Variance Table
##
## Response: logSP
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 23535 12654 0.53764
kable(df2,method="markdown", row.names=TRUE, caption="ANOVA Analysis of Variance Table")
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
Residuals | 23535 | 12653.47 | 0.5376448 | NA | NA |
H0: b1 = b2 | = b3= b4 | = b5 = 0 H | 1: not all = | 0 |
Fstar = 2990.612 and Fcrit = 2.214479 pval = 0
We will reject H0 and conclude that not all βi are zero (p=0)
6) Using the same model described in problem 4, conduct individual t-tests for testing the significance of each coefficient. Be sure to state the alternatives, decision rule, and conclusion for each test.
#Enter Code Here
logSP=log(housing$Sale.Price)
mod1=lm(logSP~housing$Finished.Area+housing$Bedrooms+housing$Grade)
summary1=summary(mod1)
summary1
##
## Call:
## lm(formula = logSP ~ housing$Finished.Area + housing$Bedrooms +
## housing$Grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4733 -0.2896 0.0232 0.3797 3.8279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.172e+01 3.620e-02 323.634 < 2e-16 ***
## housing$Finished.Area 4.213e-04 7.015e-06 60.061 < 2e-16 ***
## housing$Bedrooms -2.660e-02 5.750e-03 -4.625 3.76e-06 ***
## housing$GradeB 4.097e-02 2.742e-02 1.494 0.135
## housing$GradeC -2.665e-01 2.931e-02 -9.091 < 2e-16 ***
## housing$GradeD -5.711e-01 3.330e-02 -17.147 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5734 on 23530 degrees of freedom
## Multiple R-squared: 0.3886, Adjusted R-squared: 0.3884
## F-statistic: 2991 on 5 and 23530 DF, p-value: < 2.2e-16
#t*
summary1$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.7155546786 3.620004e-02 323.633713 0.000000e+00
## housing$Finished.Area 0.0004213336 7.015069e-06 60.061225 0.000000e+00
## housing$Bedrooms -0.0265962846 5.750336e-03 -4.625171 3.762810e-06
## housing$GradeB 0.0409717846 2.741827e-02 1.494324 1.351043e-01
## housing$GradeC -0.2664956372 2.931421e-02 -9.091004 1.056810e-19
## housing$GradeD -0.5710756137 3.330379e-02 -17.147466 1.642834e-65
summary1$coefficients[2,1]
## [1] 0.0004213336
tstar_beta1=(summary1$coefficients[2,1])/summary1$coefficients[2,2]
tstar_beta1
## [1] 60.06122
p.value = summary1$coefficients[1,4]
p.value
## [1] 0
tstar_beta2=(summary1$coefficients[3,1])/summary1$coefficients[3,2]
tstar_beta2
## [1] -4.625171
p.value = summary1$coefficients[2,4]
p.value
## [1] 0
tstar_beta3=(summary1$coefficients[4,1])/summary1$coefficients[4,2]
tstar_beta3
## [1] 1.494324
p.value = summary1$coefficients[3,4]
p.value
## [1] 3.76281e-06
tstar_beta4=(summary1$coefficients[5,1])/summary1$coefficients[5,2]
tstar_beta4
## [1] -9.091004
p.value = summary1$coefficients[4,4]
p.value
## [1] 0.1351043
tstar_beta5=(summary1$coefficients[6,1])/summary1$coefficients[6,2]
tstar_beta5
## [1] -17.14747
p.value = summary1$coefficients[5,4]
p.value
## [1] 1.05681e-19
H0: b1= 0 b2= 0 b3= 0 b4= 0 b5=0 H1: not = 0
The fitted regression line is Sale.Price’ = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms)- 2.66e-01(GradeC) - 5.711e-01(GradeD)
Grade A: Y= 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) Grade B: insignifcant Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) +4.097e-02(GradeB) If it is Grade B then SP goes up 4.097e-0 in comparison to A. Grade C: Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) - 2.66e-01(GradeC) If it is Grade C then SP goes down 2.66e-01 in comparison to A. Grade D: Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) - 5.711e-01(GradeD) If it is Grade C then SP goes down 5.711e-01 in comparison to A.
7) Explain why problem 5 and problem 6 are fundamentally different tests. Be sure to talk about the hypotheses.
The F test can test he model as a whole while the T test only tests individual beta values. The F test looks to see is all betas are different from 0 while t test looks at each individual is differnt from 0.