Day 5 Homework

For this homework, we will investigate the Nashville_housing.csv dataset located on D2L. The dataset contains information on houses sold in the Nashville area between 2013 and 2016. Be sure to download the file to your computer and import using “Import Dataset”, then “From Text File…”

1) Fit a linear regression for predicting $logSP$ from $Finished.Area$. Conduct a t-test to determine if $Finish.Area$ is a significant predictor for $logSP$, using level of significance $.05$. Be sure to state the alternatives, decision rule, p-value, and conclusion.

#Enter Code Here
Model1 = lm(housing$logSP~housing$Finished.Area)
summary1 = summary(lm(housing$logSP~housing$Finished.Area))
summary1

## 
## Call:
## lm(formula = housing$logSP ~ housing$Finished.Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4565 -0.3140  0.0177  0.4146  3.5721 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.124e+01  8.960e-03  1254.0   <2e-16 ***
## housing$Finished.Area 5.101e-04  4.448e-06   114.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5873 on 23534 degrees of freedom
## Multiple R-squared:  0.3585, Adjusted R-squared:  0.3584 
## F-statistic: 1.315e+04 on 1 and 23534 DF,  p-value: < 2.2e-16

#t*
tstar=summary1$coefficients[2,1]/summary1$coefficients[2,2]
tstar

## [1] 114.6751

#Calculate the t cutoff
qt(1-.05/2,nrow(housing)-2)

## [1] 1.960065

#P-Value
2*(1-pt(tstar,nrow(housing)-2))

## [1] 0

summary1

## 
## Call:
## lm(formula = housing$logSP ~ housing$Finished.Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4565 -0.3140  0.0177  0.4146  3.5721 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.124e+01  8.960e-03  1254.0   <2e-16 ***
## housing$Finished.Area 5.101e-04  4.448e-06   114.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5873 on 23534 degrees of freedom
## Multiple R-squared:  0.3585, Adjusted R-squared:  0.3584 
## F-statistic: 1.315e+04 on 1 and 23534 DF,  p-value: < 2.2e-16

t.test(housing$Finished.Area,housing$logSP)

## 
##  Welch Two Sample t-test
## 
## data:  housing$Finished.Area and housing$logSP
## t = 322.46, df = 23535, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1797.958 1819.949
## sample estimates:
##  mean of x  mean of y 
## 1821.11782   12.16426

H0 = b1 = 0 H1 = b1 != 0

logSP = 11.24 + 0.0005101 Finished Area Decision Rule: If T <= 1.96, conclude H0, else conclude H1.

Since T is > than 1.96 we reject the null and conculde that b1 is significantly greater than 0. We can also see from the p value that it is significnat at .05 level p = (2.2e-16). There is a linear relationship between logSP and Finished Area.

2) Using the same model described in problem 1, conduct an F-test to determine if $Finish.Area$ is a significant predictor for $logSP$, using level of significance $.05$. Be sure to state the alternatives, decision rule, p-value, and conclusion.

#Enter Code Here
myAnova = anova(Model1)
myAnova

## Analysis of Variance Table
## 
## Response: housing$logSP
##                          Df Sum Sq Mean Sq F value    Pr(>F)    
## housing$Finished.Area     1 4535.9  4535.9   13150 < 2.2e-16 ***
## Residuals             23534 8117.5     0.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fstar=myAnova$`Mean Sq`[1]/myAnova$`Mean Sq`[2]
Fstar

## [1] 13150.39

# F cutoff
qf(1-.05/2,1,nrow(housing)-2)

## [1] 5.024529

#Calculate the F cutoff
qf(1-.05/2,1,nrow(housing)-2)

## [1] 5.024529

1-pf(Fstar,1,nrow(housing)-2)

## [1] 0

var.test(housing$Finished.Area, housing$logSP)

## 
##  F test to compare two variances
## 
## data:  housing$Finished.Area and housing$logSP
## F = 1377600, num df = 23535, denom df = 23535, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1342866 1413276
## sample estimates:
## ratio of variances 
##            1377621

H0:b1= 0, H1: b1 != 0

Decision Rule: If F <= 5.024529, conclude H0, else conclude H1.

Since F is > 5.024529, we reject H0 and conclude that b1 is significantly different from zero and is siginifact at a .05 level p = (2.2e-16). Meaning there is a linear association between logSP and Finsihed area.

3) How do your test statistics from problem 1 and problem 2 relate? How about the p-values? Explain.

#Enter Code Here

The t-test looks for difference in means of logSP and Finished.Area are 0. The f test is testing whether or not the difference in variance of logSP and Finished.Area are 0. T-stat = 114.6751 where as the f-stat = 13150.39. The p-values remained the same at 2.2e-16

4) Fit a linear regression for predicting $logSP$ from $Finished.Area$, $Bedrooms$, and $Grade$. Construct the ANOVA table (similar to Table 6.1 on page 225). You can make a nice table in R Markdown by using kable in the knitr package.

#Enter Code Here


library(knitr)

Mlin =lm(housing$logSP~housing$Finished.Area + housing$Bedrooms + housing$Grade)
Anova2 = anova(Mlin)
kable(Anova2, digits = 4)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
housing$Finished.Area	1	4535.9360	4535.9360	13795.1630	0
housing$Bedrooms	1	12.3392	12.3392	37.5272	0
housing$Grade	3	368.3843	122.7948	373.4563	0
Residuals	23530	7736.8114	0.3288	NA	NA

5) Using the same model described in problem 4, test whether there is a regression relation, using level of significance $.05$. State the alternatives, decision rule, and conclusion. What does your test result imply about $\beta _1,\beta_2,\beta_3,\beta_4$ and $\beta _5$? What is the $P$-value of the test?

#Enter Code Here
myANOVA1<-anova(Mlin)
MSE=myANOVA1$`Mean Sq`[4]
MSR=(myANOVA1$`Sum Sq`[1]+myANOVA1$`Sum Sq`[2]+myANOVA1$`Sum Sq`[3])/5
Fstar1=MSR/MSE
Fstar1

## [1] 2990.612

Fcrit=qf(1-.05,3,nrow(housing)-6)
Fcrit

## [1] 2.605287

pval=1-pf(Fstar,3,nrow(housing)-6)
pval

## [1] 0

nullmod<-lm(logSP~1,data=housing)
df2 <- anova(nullmod,Mlin)

## Warning in anova.lmlist(object, ...): models with response '"housing
## $logSP"' removed because response differs from model 1

df2

## Analysis of Variance Table
## 
## Response: logSP
##              Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 23535  12654 0.53764

kable(df2,method="markdown", row.names=TRUE, caption="ANOVA Analysis of Variance Table")

ANOVA Analysis of Variance Table
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Residuals	23535	12653.47	0.5376448	NA	NA
H0: b1 = b2	= b3= b4	= b5 = 0 H	1: not all =	0

Fstar = 2990.612 and Fcrit = 2.214479 pval = 0

We will reject H0 and conclude that not all βi are zero (p=0)

6) Using the same model described in problem 4, conduct individual t-tests for testing the significance of each coefficient. Be sure to state the alternatives, decision rule, and conclusion for each test.

#Enter Code Here

logSP=log(housing$Sale.Price)
mod1=lm(logSP~housing$Finished.Area+housing$Bedrooms+housing$Grade)
summary1=summary(mod1)
summary1

## 
## Call:
## lm(formula = logSP ~ housing$Finished.Area + housing$Bedrooms + 
##     housing$Grade)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4733 -0.2896  0.0232  0.3797  3.8279 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.172e+01  3.620e-02 323.634  < 2e-16 ***
## housing$Finished.Area  4.213e-04  7.015e-06  60.061  < 2e-16 ***
## housing$Bedrooms      -2.660e-02  5.750e-03  -4.625 3.76e-06 ***
## housing$GradeB         4.097e-02  2.742e-02   1.494    0.135    
## housing$GradeC        -2.665e-01  2.931e-02  -9.091  < 2e-16 ***
## housing$GradeD        -5.711e-01  3.330e-02 -17.147  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5734 on 23530 degrees of freedom
## Multiple R-squared:  0.3886, Adjusted R-squared:  0.3884 
## F-statistic:  2991 on 5 and 23530 DF,  p-value: < 2.2e-16

#t*
summary1$coefficients

##                            Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)           11.7155546786 3.620004e-02 323.633713 0.000000e+00
## housing$Finished.Area  0.0004213336 7.015069e-06  60.061225 0.000000e+00
## housing$Bedrooms      -0.0265962846 5.750336e-03  -4.625171 3.762810e-06
## housing$GradeB         0.0409717846 2.741827e-02   1.494324 1.351043e-01
## housing$GradeC        -0.2664956372 2.931421e-02  -9.091004 1.056810e-19
## housing$GradeD        -0.5710756137 3.330379e-02 -17.147466 1.642834e-65

summary1$coefficients[2,1]

## [1] 0.0004213336

tstar_beta1=(summary1$coefficients[2,1])/summary1$coefficients[2,2]
tstar_beta1

## [1] 60.06122

p.value = summary1$coefficients[1,4]
p.value

## [1] 0

tstar_beta2=(summary1$coefficients[3,1])/summary1$coefficients[3,2]
tstar_beta2

## [1] -4.625171

p.value = summary1$coefficients[2,4]
p.value

## [1] 0

tstar_beta3=(summary1$coefficients[4,1])/summary1$coefficients[4,2]
tstar_beta3

## [1] 1.494324

p.value = summary1$coefficients[3,4]
p.value

## [1] 3.76281e-06

tstar_beta4=(summary1$coefficients[5,1])/summary1$coefficients[5,2]
tstar_beta4

## [1] -9.091004

p.value = summary1$coefficients[4,4]
p.value

## [1] 0.1351043

tstar_beta5=(summary1$coefficients[6,1])/summary1$coefficients[6,2]
tstar_beta5

## [1] -17.14747

p.value = summary1$coefficients[5,4]
p.value

## [1] 1.05681e-19

H0: b1= 0 b2= 0 b3= 0 b4= 0 b5=0 H1: not = 0

The fitted regression line is Sale.Price’ = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms)- 2.66e-01(GradeC) - 5.711e-01(GradeD)

Grade A: Y= 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) Grade B: insignifcant Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) +4.097e-02(GradeB) If it is Grade B then SP goes up 4.097e-0 in comparison to A. Grade C: Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) - 2.66e-01(GradeC) If it is Grade C then SP goes down 2.66e-01 in comparison to A. Grade D: Y = 1.172e+01 + 4.213e-04(Finished.Area) -2.660e-02(Bedrooms) - 5.711e-01(GradeD) If it is Grade C then SP goes down 5.711e-01 in comparison to A.

7) Explain why problem 5 and problem 6 are fundamentally different tests. Be sure to talk about the hypotheses.

The F test can test he model as a whole while the T test only tests individual beta values. The F test looks to see is all betas are different from 0 while t test looks at each individual is differnt from 0.

Day 5 Homework

Due: May 16, 2017