Problem 1

The stopping data in the alr4 package provides data on the relationship between the speed of cars (in miles per hour) and the stopping distance (in feet) that cars require at that speed. Fit a simple linear regression with Distance as the response and interpret the model. The model you fit must meet the assumptions of simple linear regression before you do all interpretations. Use diagnostic tools to check the assumptions and any tools such as transformations to make the model fit. What transformations did you choose and why?

library(alr4)
## Loading required package: car
## Loading required package: carData
## Loading required package: effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
head(stopping)
##   Speed Distance
## 1     4        4
## 2     5        2
## 3     5        4
## 4     5        8
## 5     5        8
## 6     7        7
summary(stopping)
##      Speed          Distance     
##  Min.   : 4.00   Min.   :  2.00  
##  1st Qu.:10.00   1st Qu.: 13.25  
##  Median :17.50   Median : 29.50  
##  Mean   :18.92   Mean   : 39.31  
##  3rd Qu.:26.75   3rd Qu.: 56.75  
##  Max.   :40.00   Max.   :138.00
g1<-lm(sqrt(Distance)~Speed,data=stopping)
summary(g1)
## 
## Call:
## lm(formula = sqrt(Distance) ~ Speed, data = stopping)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49948 -0.54761  0.00469  0.53153  1.54350 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.932396   0.197909   4.711  1.5e-05 ***
## Speed       0.252466   0.009274  27.223  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7209 on 60 degrees of freedom
## Multiple R-squared:  0.9251, Adjusted R-squared:  0.9239 
## F-statistic: 741.1 on 1 and 60 DF,  p-value: < 2.2e-16
plot(g1)

residualPlots(g1)

##            Test stat Pr(>|Test stat|)
## Speed         0.1477           0.8831
## Tukey test    0.1477           0.8826
ncvTest(g1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.331351, Df = 1, p = 0.12679
summary(g1)
## 
## Call:
## lm(formula = sqrt(Distance) ~ Speed, data = stopping)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49948 -0.54761  0.00469  0.53153  1.54350 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.932396   0.197909   4.711  1.5e-05 ***
## Speed       0.252466   0.009274  27.223  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7209 on 60 degrees of freedom
## Multiple R-squared:  0.9251, Adjusted R-squared:  0.9239 
## F-statistic: 741.1 on 1 and 60 DF,  p-value: < 2.2e-16

After doing the transformations, I saw there was a curve in the residuals plot. This meant that there was some kind of ncv and I needed to run an ncv test. I first did the log of distance, but there was still a curve in the data. The square root of the distance made the data scattered, which told me that was the correct transformation.

Problem 2

The MinnLand data in the alr4 package includes every farm sale in Minnesota from 2002-2011 enrolled in the federal Conservation Reserve Program. Fit a linear regression model to understand the impact of the acre price of the farm (variable called acrePrice HINT: THIS IS THE RESPONSE) as it relates to acres and the percentage of tillable land on the farm (variable is called tillable) (HINT: USE BOTH AS PREDICTORS in the same model, multiple linear regression). Check the assumptions of multiple linear regression using the diagnostic tools discussed in lecture. Using the tools discussed adjust the model to meet assumptions, and provide a discussion on the meaning of the model including but not limited to the interpretation of the slope coefficients. HINT: RULES OF THUMB COULD BE HELPFUL!

library(alr4)
head(MinnLand)
##   acrePrice    region improvements year acres tillable      financing crpPct
## 1       766 Northwest            0 2002    82       94 title_transfer      0
## 2       733 Northwest            0 2003    30       63 title_transfer      0
## 3       850 Northwest            4 2002   150       47 title_transfer      0
## 4       975 Northwest            0 2003   160       86 title_transfer      0
## 5       886 Northwest           62 2002    90       NA title_transfer      0
## 6       992 Northwest           30 2003   120       83 title_transfer      0
##   productivity
## 1           NA
## 2           NA
## 3           NA
## 4           NA
## 5           NA
## 6           NA
summary(MinnLand)
##    acrePrice               region      improvements          year     
##  Min.   :  108   Northwest    :3799   Min.   :  0.000   Min.   :2002  
##  1st Qu.: 1425   West Central :3297   1st Qu.:  0.000   1st Qu.:2004  
##  Median : 2442   Central      :4198   Median :  0.000   Median :2006  
##  Mean   : 2787   South West   :2583   Mean   :  4.493   Mean   :2006  
##  3rd Qu.: 3702   South Central:2832   3rd Qu.:  0.000   3rd Qu.:2008  
##  Max.   :15000   South East   :1991   Max.   :100.000   Max.   :2011  
##                                       NA's   :50                      
##      acres           tillable                financing         crpPct       
##  Min.   :   1.0   Min.   :  0.00   title_transfer :16601   Min.   :  0.000  
##  1st Qu.:  47.0   1st Qu.: 72.00   seller_financed: 2099   1st Qu.:  0.000  
##  Median :  80.0   Median : 92.00                           Median :  0.000  
##  Mean   : 112.7   Mean   : 80.67                           Mean   :  4.163  
##  3rd Qu.: 153.0   3rd Qu.: 97.00                           3rd Qu.:  0.000  
##  Max.   :6970.0   Max.   :100.00                           Max.   :100.000  
##                   NA's   :1212                                              
##   productivity  
##  Min.   : 1.00  
##  1st Qu.:59.00  
##  Median :68.00  
##  Mean   :66.63  
##  3rd Qu.:76.00  
##  Max.   :99.00  
##  NA's   :9717
g2<-lm(log(acrePrice)~log(acres)+sqrt(tillable), data=MinnLand)
plot(g2)

ncvTest(g2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 37.40519, Df = 1, p = 9.5966e-10
summary(g2)
## 
## Call:
## lm(formula = log(acrePrice) ~ log(acres) + sqrt(tillable), data = MinnLand)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.14351 -0.39930  0.08592  0.47220  2.48029 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     8.052680   0.040540  198.64   <2e-16 ***
## log(acres)     -0.250169   0.006883  -36.35   <2e-16 ***
## sqrt(tillable)  0.086580   0.003377   25.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6818 on 17485 degrees of freedom
##   (1212 observations deleted due to missingness)
## Multiple R-squared:  0.09238,    Adjusted R-squared:  0.09227 
## F-statistic: 889.8 on 2 and 17485 DF,  p-value: < 2.2e-16

The acrePrice, acres, and tillable data by itself shows some trends, so I knew I would have to do some transformations to get it to show variance. I ended up finding that the log of acrePrice, log of acres and square root of tillable ultimately did this. The p-value for acres and tillable shows there is a relationship to acrePrice because they are below 0.05. The intercept is at 805 and as the log of acres is decreased by one, acrePrice is decreased by 25 units. It also shows that as the square root of tillable is increased by one, acrePrice is increased by 8.6 units.