STAT 155-01,-02 Topic 12 In-class Activity

Load some potentially useful packages:

library(ggplot2)

library(dplyr)

Question 1

We can use Adjusted-R\(^2\) to compare nested, or non-nested, models. F-tests can only be used to compared nested models. In this problem, we will use the USCrime data.

USCrime = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/USCrime.csv' )

Consider the following two models based on the USCrime data:

\[CrimeRate \sim Expen60 + Poverty ~~~~\textbf{vs.}\]

\[CrimeRate \sim Expen60 + Poverty + Unem + PopSize\]

(a) Are these two models nested? If so, we can use either Adjusted-R\(^2\) or an F-test to decide between them.

Yes, they are.

(b) Which of these models would be selected if we use Adjusted-R\(^2\) as our criterion? Note: I realize that it is very close, but just follow the “bigger is better” rule.

model1=lm(formula=CrimeRate~Expen60+Poverty,data=USCrime)
model2=lm(formula=CrimeRate~Expen60+Poverty+Unem+PopSize,data=USCrime)
summary(model1)

## 
## Call:
## lm(formula = CrimeRate ~ Expen60 + Poverty, data = USCrime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.387 -12.181  -1.456  15.455  50.645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -94.4662    34.3947  -2.747  0.00869 ** 
## Expen60       1.2415     0.1637   7.582 1.62e-09 ***
## Poverty       0.4095     0.1220   3.357  0.00163 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.62 on 44 degrees of freedom
## Multiple R-squared:  0.5803, Adjusted R-squared:  0.5612 
## F-statistic: 30.42 on 2 and 44 DF,  p-value: 5.061e-09

summary(model2)

## 
## Call:
## lm(formula = CrimeRate ~ Expen60 + Poverty + Unem + PopSize, 
##     data = USCrime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.190 -12.049  -0.687  16.291  50.036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -114.77119   38.21229  -3.004 0.004484 ** 
## Expen60        1.40179    0.20239   6.926 1.85e-08 ***
## Poverty        0.46328    0.12923   3.585 0.000871 ***
## Unem           0.07991    0.46835   0.171 0.865348    
## PopSize       -0.17653    0.12444  -1.419 0.163389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.61 on 42 degrees of freedom
## Multiple R-squared:  0.5995, Adjusted R-squared:  0.5614 
## F-statistic: 15.72 on 4 and 42 DF,  p-value: 6.122e-08

the better model is model2, or CrimeRate=Expen60+Poverty+Unem+PopSize.

(c) Perform an F-test to decide between these two models. Which model is taken as the null hypothesis, and which one is the alternative hypothesis? State the test statistic, p-value, and your conclusion. In this case, does your conclusion from (b) agree with your conclusion from this F-test?

anova(model1,model2)

## Analysis of Variance Table
## 
## Model 1: CrimeRate ~ Expen60 + Poverty
## Model 2: CrimeRate ~ Expen60 + Poverty + Unem + PopSize
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     44 28878                           
## 2     42 27555  2    1322.9 1.0082 0.3735

In this case, the null hypothesis is that the addition of PopSize and Unem are not necessary to obtain significant results. The test statistic is 1.0082, and the p value is 0.3735. Because the p value is bigger than 0.05, the results seen in model 2 are quite plausible under the null hypothesis. This means that PopSize and Unem are redundant or they don’t add anything significant to the model.

(d) Your answer to (c) indicates a more general result about which of F-tests or Adjusted-R\(^2\) is more stringent about the inclusion of additional variables in a model. Which comparison method is more stringent about selecting a bigger model?

The F-test is much more strict about adding additional variables than adjusted R^2.

Question 2

`MacGrades.csv` contains a sub-sample (to help preserve anonymity) of every grade assigned to a former Macalester graduating class. For each of the 6146 rows of data, the following information is provided (with a few missing values):

sessionID: A section ID number

sid: A student ID number

grade: The grade obtained, as a numerical value (i.e. an \(A\) is a 4, an \(A-\) is a 3.67, etc.)

dept: A department identifier (these have been made ambiguous to maintain anonymity)

level: The course level (e.g. 100-, 200-, 300-, and 600-)

sem: A semester identifier

enroll: The section enrollment

iid: An instructor identifier (these have been made ambiguous to maintain anonymity)

Read in the data:

MacGrades = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/MacGrades.csv' )

(a) Make a model for `grade` that uses `level`. Although `level` is recorded as a number, it is really a categorical variable. To treat it as such in your model, type:

m = lm( grade ~ factor(level) , data=MacGrades )
summary(m)

## 
## Call:
## lm(formula = grade ~ factor(level), data = MacGrades)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4776 -0.3492  0.2089  0.5224  0.6508 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.34924    0.01208 277.166  < 2e-16 ***
## factor(level)200  0.11183    0.01995   5.606 2.17e-08 ***
## factor(level)300  0.09078    0.01949   4.659 3.25e-06 ***
## factor(level)400  0.12835    0.03168   4.052 5.15e-05 ***
## factor(level)600  0.63339    0.13624   4.649 3.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5915 on 5704 degrees of freedom
##   (437 observations deleted due to missingness)
## Multiple R-squared:  0.01085,    Adjusted R-squared:  0.01016 
## F-statistic: 15.65 on 4 and 5704 DF,  p-value: 9.713e-13

m2=lm(grade~1,data=MacGrades)
anova(m2,m)

## Analysis of Variance Table
## 
## Model 1: grade ~ 1
## Model 2: grade ~ factor(level)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5708 2017.5                                  
## 2   5704 1995.6  4    21.898 15.648 9.713e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is level significantly associated with grade? What are the hypotheses, test statistic, and p-value?

Also, by looking at the coefficients, comment on the nature of this relationship.

The null hypothesis is that level has no effect on grade, the alternative hypothesis is that level does have an effect on grade. The F test statistic is 15.648 and the p value is 9.713e-13. Because the p value is much less than 0.05, it can be concluded that level does have an effect on grade.

(b) Is `enroll` associated with `grade`? What are the hypotheses, test statistic, and p-value?

EG=lm(grade~enroll,data=MacGrades)
anova(m2,EG)

## Analysis of Variance Table
## 
## Model 1: grade ~ 1
## Model 2: grade ~ enroll
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5708 2017.5                                  
## 2   5707 2009.8  1    7.7436 21.989 2.806e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is that enroll has no effect on grade, the alternate hypothesis is that the enroll does have an effect on grade, the f test is 21.989, the p value is 2.806*10^-6. The fact that the P-value was so small means that we can reject the null hypothesis and come to the conclusion that enroll does have an effect on grade

(c) Is `enroll` associated with `grade` after we control for `level`? What are the hypotheses, test statistic, and p-value?

EGL=lm(grade~enroll+level,data=MacGrades)
anova(m2,EGL)

## Analysis of Variance Table
## 
## Model 1: grade ~ 1
## Model 2: grade ~ enroll + level
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5708 2017.5                                  
## 2   5706 1999.7  2    17.867 25.492 9.512e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is that when controlling for level, enroll has no effect on grade. The alternative hypothesis is that when controlling for level, enroll does have an effect on grade. The f statistic is 25.492, and the p value is 9.512*10^-12.

(d) Is `level` associated with `grade` after controling for `enroll`?

EG=lm(grade~level+enroll,data=MacGrades)
anova(m2,EG)

## Analysis of Variance Table
## 
## Model 1: grade ~ 1
## Model 2: grade ~ level + enroll
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5708 2017.5                                  
## 2   5706 1999.7  2    17.867 25.492 9.512e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

yes

(e) Make a model for `grade` that uses `dept`. Is `dept` significantly associated with `grade`? How can you tell?

DG=lm(grade~dept,data=MacGrades)
summary(DG)

## 
## Call:
## lm(formula = grade ~ dept, data = MacGrades)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5268 -0.2749  0.1475  0.4515  0.9039 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.5000000  0.4088676   8.560   <2e-16 ***
## deptb       -0.2536066  0.4155163  -0.610    0.542    
## deptB       -0.2614286  0.4278948  -0.611    0.541    
## deptC        0.0268398  0.4106338   0.065    0.948    
## deptd       -0.1825995  0.4098240  -0.446    0.656    
## deptD        0.0617623  0.4105399   0.150    0.880    
## depte       -0.0162500  0.4122607  -0.039    0.969    
## deptE        0.1391667  0.4416275   0.315    0.753    
## deptF       -0.2251373  0.4104679  -0.548    0.583    
## deptg        0.0956522  0.4176615   0.229    0.819    
## deptG       -0.3303306  0.4105537  -0.805    0.421    
## deptH       -0.0307812  0.4120495  -0.075    0.940    
## depti       -0.1474011  0.4111711  -0.358    0.720    
## deptI        0.0011538  0.4243020   0.003    0.998    
## deptj       -0.0625248  0.4108867  -0.152    0.879    
## deptJ       -0.2815172  0.4116777  -0.684    0.494    
## deptk        0.0124521  0.4104311   0.030    0.976    
## deptK       -0.2431624  0.4123474  -0.590    0.555    
## deptL        0.0468000  0.4169648   0.112    0.911    
## deptm        0.0224798  0.4099682   0.055    0.956    
## deptM       -0.4039440  0.4099067  -0.985    0.324    
## deptn        0.1487363  0.4111080   0.362    0.718    
## deptN        0.1670000  0.4139469   0.403    0.687    
## depto       -0.3616667  0.4255629  -0.850    0.395    
## deptO        0.0066856  0.4100242   0.016    0.987    
## deptp       -0.0139370  0.4120744  -0.034    0.973    
## deptP       -0.1140000  0.4249077  -0.268    0.788    
## deptq        0.0001132  0.4104076   0.000    1.000    
## deptQ       -0.0644094  0.4120744  -0.156    0.876    
## deptR       -0.0381579  0.4110139  -0.093    0.926    
## depts       -0.0154839  0.4218507  -0.037    0.971    
## deptS       -0.0247500  0.4139469  -0.060    0.952    
## deptt       -0.0126923  0.4166562  -0.030    0.976    
## deptT       -0.2733962  0.4165106  -0.656    0.512    
## deptU       -0.1136842  0.4298486  -0.264    0.791    
## deptV       -0.0242500  0.4189646  -0.058    0.954    
## deptW       -0.0507164  0.4100863  -0.124    0.902    
## deptX       -0.1189865  0.4116209  -0.289    0.773    
## deptY        0.0814894  0.4174763   0.195    0.845    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5782 on 5670 degrees of freedom
##   (437 observations deleted due to missingness)
## Multiple R-squared:  0.06037,    Adjusted R-squared:  0.05407 
## F-statistic: 9.586 on 38 and 5670 DF,  p-value: < 2.2e-16

anova(m2,DG)

## Analysis of Variance Table
## 
## Model 1: grade ~ 1
## Model 2: grade ~ dept
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5708 2017.5                                  
## 2   5670 1895.7 38    121.79 9.5862 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

yes, dept is associated with grade. In this model, we are assuming that the null hypothesis is that dept has no effect on grade, as such the P value associated with this null hypothesis is 2.2*10^-16. This value means that we reject the null hypothesis. As such we can conclude that department does have an effect on grade

(f) Are any of the individual department p-values significant? What do these p-values tell us, and why is this not contradictory with your answer to (e)?

Quite a few of the department values are significant. These P-values tell us that within each department, department does not have an effect on grade. The reason why this is not contradictory is because our research question is whether a difference in departments has an effect on grades, which means that individual department P-values are irrelevant.

Question 3

`FUELCON.csv` has the following variables for all 50 states and DC:

FUEL: Per capita fuel consumption in gallons

DRIVERS: Ratio of licensed drivers to vehicles registered

HWYMILES: Number of miles of federally funded highways

GASTAX: Tax per gallon of gasoline in cents

INCOME: Average household income in dollars

Read in the data:

FUELCON = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/FUELCON.csv' )

(a) Make a model for the `FUEL` response variable that uses the other 4 explanatory variables simultaneously. Test whether this model fits better than the constant only model. State the hypotheses, test statistic, and p-value.

A=lm(formula=FUEL~1,data=FUELCON)
f=lm(formula=FUEL~DRIVERS+HWYMILES+GASTAX+INCOME,data=FUELCON)
summary(A)

## 
## Call:
## lm(formula = FUEL ~ 1, data = FUELCON)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -196.466  -38.736    6.514   45.999  229.094 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   486.46      10.14   47.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.4 on 50 degrees of freedom

anova(A,f)

## Analysis of Variance Table
## 
## Model 1: FUEL ~ 1
## Model 2: FUEL ~ DRIVERS + HWYMILES + GASTAX + INCOME
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1     50 262054                                 
## 2     46 145705  4    116349 9.183 1.537e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is that every explanatory variable has no effect on fuel. the alternative hypothesis is that all the explanatory variables have an effect on fuel.

(b) Does adding the variables `GASTAX` and `INCOME` improve a model for `FUEL` which already contains `DRIVERS`? Answer this question by performing a formal test of hypotheses: state the hypotheses, test statistic, and p-value.

B=lm(formula=FUEL~DRIVERS,data=FUELCON)
C=lm(formula=FUEL~DRIVERS+GASTAX+INCOME,data=FUELCON)
summary(B)

## 
## Call:
## lm(formula = FUEL ~ DRIVERS, data = FUELCON)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -130.910  -47.129   -1.325   38.225  177.391 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   726.51      55.90  12.997  < 2e-16 ***
## DRIVERS      -281.12      64.66  -4.347 6.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.12 on 49 degrees of freedom
## Multiple R-squared:  0.2784, Adjusted R-squared:  0.2636 
## F-statistic:  18.9 on 1 and 49 DF,  p-value: 6.941e-05

summary(C)

## 
## Call:
## lm(formula = FUEL ~ DRIVERS + GASTAX + INCOME, data = FUELCON)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -153.90  -36.35    0.58   36.53  165.78 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.992e+02  6.963e+01  12.914  < 2e-16 ***
## DRIVERS     -2.129e+02  6.149e+01  -3.463  0.00115 ** 
## GASTAX      -3.531e+00  1.753e+00  -2.015  0.04968 *  
## INCOME      -5.463e-03  1.759e-03  -3.106  0.00321 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.04 on 47 degrees of freedom
## Multiple R-squared:  0.4368, Adjusted R-squared:  0.4008 
## F-statistic: 12.15 on 3 and 47 DF,  p-value: 5.208e-06

anova(B,C)

## Analysis of Variance Table
## 
## Model 1: FUEL ~ DRIVERS
## Model 2: FUEL ~ DRIVERS + GASTAX + INCOME
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     49 189111                                
## 2     47 147590  2     41521 6.6112 0.002951 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the null hypothesis in this case is that GASTAX and INCOME have no effect on the FUEL DRIVERS model. The alternate hypothesis is that GASTAX and INCOME do have an effect om the FUEL DRIVERS model. the test statistic is 6.6112 and the p value is 0.002951. Given these results, we can reject the null hypothesis and conclude that GASTAX and INCOME do have an effect on the FUEL DRIVERS model.

(c) Which of the following two models would you choose?

\[FUEL ~\sim~ DRIVERS ~~~~ \text{vs.}~~~~ FUEL ~\sim~ GASTAX + INCOME + HWYMILES\]

FD=lm(FUEL~DRIVERS,data=FUELCON)
FGIH=lm(FUEL~GASTAX+INCOME+HWYMILES,data=FUELCON)
summary(FD)

## 
## Call:
## lm(formula = FUEL ~ DRIVERS, data = FUELCON)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -130.910  -47.129   -1.325   38.225  177.391 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   726.51      55.90  12.997  < 2e-16 ***
## DRIVERS      -281.12      64.66  -4.347 6.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.12 on 49 degrees of freedom
## Multiple R-squared:  0.2784, Adjusted R-squared:  0.2636 
## F-statistic:  18.9 on 1 and 49 DF,  p-value: 6.941e-05

summary(FGIH)

## 
## Call:
## lm(formula = FUEL ~ GASTAX + INCOME + HWYMILES, data = FUELCON)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -175.869  -32.628    2.982   30.946  198.741 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.907e+02  7.128e+01  11.093 1.01e-14 ***
## GASTAX      -4.220e+00  1.967e+00  -2.145  0.03714 *  
## INCOME      -7.353e-03  1.878e-03  -3.916  0.00029 ***
## HWYMILES    -3.866e-04  1.113e-03  -0.347  0.72997    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.7 on 47 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2499 
## F-statistic: 6.553 on 3 and 47 DF,  p-value: 0.000857

anova(FGIH,FD)

## Analysis of Variance Table
## 
## Model 1: FUEL ~ GASTAX + INCOME + HWYMILES
## Model 2: FUEL ~ DRIVERS
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     47 184766                           
## 2     49 189111 -2   -4345.1 0.5526 0.5791

Briefly justify your choice.

I would choose the FD model(FUEL~DRIVERS) because its adjusted R^2 value is larger. In addition to this, the P-value of the F-test also shows that the null, or the fuel~drivers model, cannot be rejected.

Question 4

A real estate appraiser wants to explore the relationship between sale price of an apartment building and other characteristics of the property. The data available is contained in the file MNSALES.csv. The variables of interest are:

Price: Sale price of the building (in dollars)

NumApts: Number of apartments in the building

Age: Age of the building in years

LotSize: Size of the lot on which the building is built (in square feet)

Parking: Number of parking spots

Condition: Condition of the building (F=Fair, G=Good, E=Excellent)

Read in the data:

MNSALES = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/MNSALES.csv')

(a) Recall in the Topic 5 ICA, you fit the following linear regression models and then ranked them according to their R-squared values:

Model1: `Price` ~ `Age`

Model2: `Price` ~ `NumApts`

Model3: `Price` ~ `LotSize`

Model4: `Price` ~ `Parking`

Model5: `Price` ~ `Condition`

Fit a multiple linear regression model with `Price` as the response variable, and the two MOST predictive variables (on their own). Report the R-squared value and the Adjusted R-squared value from this multiple linear regression model.

Model1=lm(Price~Age,data=MNSALES)
Model2=lm(Price~NumApts,data=MNSALES)
Model3=lm(Price~LotSize,data=MNSALES) 
Model4=lm(Price~Parking,data=MNSALES)
Model5=lm(Price~Condition,data=MNSALES)
summary(Model1)

## 
## Call:
## lm(formula = Price ~ Age, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184075 -147228  -42019   56090  676336 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 340068.9    99308.9   3.424  0.00232 **
## Age           -935.3     1692.2  -0.553  0.58579   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214700 on 23 degrees of freedom
## Multiple R-squared:  0.01311,    Adjusted R-squared:  -0.0298 
## F-statistic: 0.3055 on 1 and 23 DF,  p-value: 0.5858

summary(Model2)

## 
## Call:
## lm(formula = Price ~ NumApts, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -114353  -53887  -21738   42961  254060 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   101786      23291    4.37 0.000224 ***
## NumApts        15525       1345   11.54 4.78e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82910 on 23 degrees of freedom
## Multiple R-squared:  0.8528, Adjusted R-squared:  0.8464 
## F-statistic: 133.2 on 1 and 23 DF,  p-value: 4.782e-11

summary(Model3)

## 
## Call:
## lm(formula = Price ~ LotSize, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -251992  -89484  -12529   20816  415674 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29070.745  66858.941  -0.435    0.668    
## LotSize         37.367      7.044   5.305  2.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 144900 on 23 degrees of freedom
## Multiple R-squared:  0.5503, Adjusted R-squared:  0.5307 
## F-statistic: 28.14 on 1 and 23 DF,  p-value: 2.195e-05

summary(Model4)

## 
## Call:
## lm(formula = Price ~ Parking, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -186977 -150926  -79277   33723  654799 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   266277      47487   5.607 1.05e-05 ***
## Parking         9642       8711   1.107     0.28    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 210500 on 23 degrees of freedom
## Multiple R-squared:  0.05057,    Adjusted R-squared:  0.009295 
## F-statistic: 1.225 on 1 and 23 DF,  p-value: 0.2798

summary(Model5)

## 
## Call:
## lm(formula = Price ~ Condition, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215850 -140581  -78250   91060  644419 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   350250      86374   4.055 0.000527 ***
## ConditionF   -173310     128114  -1.353 0.189865    
## ConditionG    -44669     103237  -0.433 0.669458    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211600 on 22 degrees of freedom
## Multiple R-squared:  0.08296,    Adjusted R-squared:  -0.0004118 
## F-statistic: 0.9951 on 2 and 22 DF,  p-value: 0.3857

Model6=lm(Price~LotSize+NumApts,data=MNSALES)
summary(Model6)

## 
## Call:
## lm(formula = Price ~ LotSize + NumApts, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -113953  -53014  -21042   40841  253055 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.930e+04  4.352e+04   2.282   0.0325 *  
## LotSize     4.681e-01  6.862e+00   0.068   0.9462    
## NumApts     1.540e+04  2.290e+03   6.724 9.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84760 on 22 degrees of freedom
## Multiple R-squared:  0.8528, Adjusted R-squared:  0.8394 
## F-statistic: 63.73 on 2 and 22 DF,  p-value: 7.024e-10

Model7=lm(Price~NumApts+Parking,data=MNSALES)
summary(Model7)

## 
## Call:
## lm(formula = Price ~ NumApts + Parking, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111199  -52429  -20383   44846  256229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100612.4    24352.2   4.132 0.000438 ***
## NumApts      15454.2     1409.6  10.964 2.21e-10 ***
## Parking        808.7     3594.5   0.225 0.824061    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84670 on 22 degrees of freedom
## Multiple R-squared:  0.8531, Adjusted R-squared:  0.8398 
## F-statistic: 63.89 on 2 and 22 DF,  p-value: 6.865e-10

the R^2 value of the model Price~LotSize+NumApts is 0.8528 and the Adjusted R^2 value is 0.8394

(b) Based on the models you have fit thus far, the real estate appraiser wants to know which variable(s) to include in a regression model to predict apartment price. Suggest a model, and support your suggestion with numeric evidence.

I would say that the best model uses only NumApts as a variable because it has the highest adjusted R^2 value, adding any additional variables actually decreases the adjusted R^2 and the predictive power of the model.

(c) Challenge Question Considering `NumApts`, `Age`, `LotSize`, `Parking`, and `Condition`, try fitting at least 3 different models with combinations of these explanatory variables (and potentially interactions between them), and report the model you found with the highest adjusted R-squared. Do you think this model is better or worse than a model with only number of apartments as a predictor? Explain why or why not (there are multiple, reasonable justifications).

Model8=lm(Price~NumApts+Age+LotSize+Parking+Condition,data=MNSALES)
Model9=lm(Price~NumApts+Age+LotSize+Parking+Condition+NumApts:Age,data=MNSALES)
Model10=lm(Price~NumApts+Age+LotSize+Parking+Condition+NumApts:LotSize,data=MNSALES)
Model11=lm(Price~NumApts+Age+LotSize+Parking+Condition+NumApts:Parking,data=MNSALES)
Model12=lm(Price~NumApts+Age+LotSize+Parking+Condition+NumApts:Condition,data=MNSALES)
Model13=lm(Price~NumApts+Age+LotSize+Parking+Condition+Age:LotSize,data=MNSALES)
Model14=lm(Price~NumApts+Age+LotSize+Parking+Condition+Age:Parking,data=MNSALES)
Model15=lm(Price~NumApts+Age+LotSize+Parking+Condition+Age:Condition,data=MNSALES)
Model16=lm(Price~NumApts+Age+LotSize+Parking+Condition+LotSize:Parking,data=MNSALES)
Model17=lm(Price~NumApts+Age+LotSize+Parking+Condition+LotSize:Condition,data=MNSALES)
Model18=lm(Price~NumApts+Age+LotSize+Parking+Condition+Parking:Condition,data=MNSALES)
summary(Model8)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition, 
##     data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107880  -22277   -2703   20448  115524 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.278e+05  5.930e+04   3.841  0.00120 ** 
## NumApts      1.499e+04  1.890e+03   7.929 2.78e-07 ***
## Age         -8.631e+02  6.652e+02  -1.297  0.21085    
## LotSize      2.564e+00  5.622e+00   0.456  0.65381    
## Parking      1.091e+03  3.080e+03   0.354  0.72727    
## ConditionF  -1.455e+05  4.154e+04  -3.503  0.00254 ** 
## ConditionG  -1.239e+05  3.352e+04  -3.695  0.00166 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63900 on 18 degrees of freedom
## Multiple R-squared:  0.9316, Adjusted R-squared:  0.9087 
## F-statistic: 40.83 on 6 and 18 DF,  p-value: 1.596e-09

summary(Model9)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     NumApts:Age, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -106139  -23010   -4229   18441  114299 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.039e+05  8.339e+04   2.445  0.02570 * 
## NumApts      1.835e+04  8.262e+03   2.221  0.04023 * 
## Age         -4.159e+02  1.267e+03  -0.328  0.74675   
## LotSize      7.399e-01  7.220e+00   0.102  0.91957   
## Parking      6.046e+02  3.361e+03   0.180  0.85936   
## ConditionF  -1.440e+05  4.267e+04  -3.376  0.00359 **
## ConditionG  -1.183e+05  3.683e+04  -3.211  0.00512 **
## NumApts:Age -4.257e+01  1.017e+02  -0.418  0.68085   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65420 on 17 degrees of freedom
## Multiple R-squared:  0.9323, Adjusted R-squared:  0.9044 
## F-statistic: 33.42 on 7 and 17 DF,  p-value: 9.945e-09

summary(Model10)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     NumApts:LotSize, data = MNSALES)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -75029 -19297   4257  21513 113047 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.082e+05  5.456e+04   1.983 0.063741 .  
## NumApts          2.542e+04  3.076e+03   8.263 2.34e-07 ***
## Age             -5.189e+02  5.097e+02  -1.018 0.322952    
## LotSize          8.563e+00  4.521e+00   1.894 0.075367 .  
## Parking         -1.445e+03  2.416e+03  -0.598 0.557626    
## ConditionF      -1.341e+05  3.147e+04  -4.262 0.000527 ***
## ConditionG      -9.785e+04  2.618e+04  -3.737 0.001640 ** 
## NumApts:LotSize -6.033e-01  1.577e-01  -3.826 0.001351 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48200 on 17 degrees of freedom
## Multiple R-squared:  0.9632, Adjusted R-squared:  0.9481 
## F-statistic: 63.61 on 7 and 17 DF,  p-value: 5.938e-11

summary(Model11)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     NumApts:Parking, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107526  -32073   -1170   22191  118206 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.060e+05  6.455e+04   3.192  0.00534 ** 
## NumApts          1.611e+04  2.284e+03   7.052 1.95e-06 ***
## Age             -6.291e+02  7.197e+02  -0.874  0.39424    
## LotSize          2.662e+00  5.658e+00   0.471  0.64397    
## Parking          9.477e+03  9.976e+03   0.950  0.35541    
## ConditionF      -1.561e+05  4.346e+04  -3.591  0.00225 ** 
## ConditionG      -1.280e+05  3.405e+04  -3.759  0.00156 ** 
## NumApts:Parking -4.946e+02  5.593e+02  -0.884  0.38881    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64290 on 17 degrees of freedom
## Multiple R-squared:  0.9346, Adjusted R-squared:  0.9076 
## F-statistic: 34.69 on 7 and 17 DF,  p-value: 7.443e-09

summary(Model12)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     NumApts:Condition, data = MNSALES)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65789 -30754  -6469  17773 107164 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         86210.108  63268.778   1.363  0.19188    
## NumApts             23009.549   2797.008   8.226 3.85e-07 ***
## Age                  -419.812    558.993  -0.751  0.46355    
## LotSize                 7.159      4.723   1.516  0.14905    
## Parking               736.850   2491.671   0.296  0.77124    
## ConditionF         -62842.327  53660.761  -1.171  0.25870    
## ConditionG         -10844.423  42752.531  -0.254  0.80299    
## NumApts:ConditionF  -8792.554   4501.568  -1.953  0.06851 .  
## NumApts:ConditionG -10386.213   3030.625  -3.427  0.00346 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51460 on 16 degrees of freedom
## Multiple R-squared:  0.9605, Adjusted R-squared:  0.9408 
## F-statistic: 48.68 on 8 and 16 DF,  p-value: 8.719e-10

summary(Model13)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     Age:LotSize, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107779  -22421   -2830   20208  115640 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.259e+05  9.709e+04   2.327  0.03260 *  
## NumApts      1.503e+04  2.517e+03   5.970 1.52e-05 ***
## Age         -8.243e+02  1.696e+03  -0.486  0.63320    
## LotSize      2.681e+00  7.443e+00   0.360  0.72313    
## Parking      1.083e+03  3.186e+03   0.340  0.73810    
## ConditionF  -1.453e+05  4.320e+04  -3.365  0.00368 ** 
## ConditionG  -1.235e+05  3.774e+04  -3.272  0.00449 ** 
## Age:LotSize -4.338e-03  1.736e-01  -0.025  0.98035    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65750 on 17 degrees of freedom
## Multiple R-squared:  0.9316, Adjusted R-squared:  0.9034 
## F-statistic: 33.06 on 7 and 17 DF,  p-value: 1.083e-08

summary(Model14)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     Age:Parking, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -106548  -21296   -1512   18055  117524 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.243e+05  6.072e+04   3.694  0.00180 ** 
## NumApts      1.449e+04  2.110e+03   6.865 2.74e-06 ***
## Age         -8.673e+02  6.778e+02  -1.279  0.21791    
## LotSize      3.328e+00  5.877e+00   0.566  0.57866    
## Parking     -1.588e+03  5.577e+03  -0.285  0.77931    
## ConditionF  -1.514e+05  4.353e+04  -3.479  0.00287 ** 
## ConditionG  -1.192e+05  3.508e+04  -3.398  0.00342 ** 
## Age:Parking  9.677e+01  1.665e+02   0.581  0.56880    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65110 on 17 degrees of freedom
## Multiple R-squared:  0.9329, Adjusted R-squared:  0.9053 
## F-statistic: 33.76 on 7 and 17 DF,  p-value: 9.193e-09

summary(Model15)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     Age:Condition, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103695  -17046   -2736   26859   92940 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     348457.79   76903.21   4.531 0.000341 ***
## NumApts          14534.66    1771.77   8.203    4e-07 ***
## Age              -2764.97    1050.86  -2.631 0.018154 *  
## LotSize              1.97       5.21   0.378 0.710338    
## Parking           3098.63    3108.04   0.997 0.333617    
## ConditionF     -294535.21  202265.76  -1.456 0.164686    
## ConditionG     -281001.92   76722.34  -3.663 0.002102 ** 
## Age:ConditionF    2326.21    2767.09   0.841 0.412921    
## Age:ConditionG    2893.13    1290.33   2.242 0.039473 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59120 on 16 degrees of freedom
## Multiple R-squared:  0.9479, Adjusted R-squared:  0.9219 
## F-statistic:  36.4 on 8 and 16 DF,  p-value: 7.746e-09

summary(Model16)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     LotSize:Parking, data = MNSALES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -104944  -33690   -2469   25325  119420 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.014e+05  6.726e+04   2.995  0.00814 ** 
## NumApts          1.494e+04  1.906e+03   7.838 4.81e-07 ***
## Age             -6.849e+02  7.020e+02  -0.976  0.34293    
## LotSize          4.476e+00  6.092e+00   0.735  0.47251    
## Parking          1.152e+04  1.260e+04   0.914  0.37355    
## ConditionF      -1.526e+05  4.267e+04  -3.576  0.00233 ** 
## ConditionG      -1.218e+05  3.386e+04  -3.597  0.00222 ** 
## LotSize:Parking -9.971e-01  1.168e+00  -0.854  0.40519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64390 on 17 degrees of freedom
## Multiple R-squared:  0.9344, Adjusted R-squared:  0.9073 
## F-statistic: 34.58 on 7 and 17 DF,  p-value: 7.633e-09

summary(Model17)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     LotSize:Condition, data = MNSALES)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66028 -18671  -4288   6733  83667 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -126542.53   93132.36  -1.359 0.193077    
## NumApts              14689.91    1370.38  10.720 1.04e-08 ***
## Age                    130.08     538.67   0.241 0.812251    
## LotSize                 44.41      10.57   4.203 0.000674 ***
## Parking               2947.22    2269.23   1.299 0.212428    
## ConditionF          123797.73   99749.29   1.241 0.232462    
## ConditionG          209209.14   81549.05   2.565 0.020745 *  
## LotSize:ConditionF     -40.54      13.28  -3.053 0.007597 ** 
## LotSize:ConditionG     -44.23      10.32  -4.286 0.000567 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46210 on 16 degrees of freedom
## Multiple R-squared:  0.9682, Adjusted R-squared:  0.9523 
## F-statistic: 60.85 on 8 and 16 DF,  p-value: 1.591e-10

summary(Model18)

## 
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition + 
##     Parking:Condition, data = MNSALES)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91067 -16649    333  18335 128353 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.980e+05  5.995e+04   3.303  0.00449 ** 
## NumApts             1.365e+04  1.984e+03   6.879  3.7e-06 ***
## Age                -8.674e+02  6.455e+02  -1.344  0.19780    
## LotSize             5.756e+00  5.748e+00   1.001  0.33154    
## Parking             2.237e+04  1.256e+04   1.781  0.09388 .  
## ConditionF         -1.301e+05  4.451e+04  -2.924  0.00994 ** 
## ConditionG         -1.015e+05  3.496e+04  -2.904  0.01035 *  
## Parking:ConditionF -2.037e+04  1.338e+04  -1.522  0.14752    
## Parking:ConditionG -2.262e+04  1.282e+04  -1.765  0.09660 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62000 on 16 degrees of freedom
## Multiple R-squared:  0.9427, Adjusted R-squared:  0.9141 
## F-statistic: 32.93 on 8 and 16 DF,  p-value: 1.631e-08

anova(Model2,Model17)

## Analysis of Variance Table
## 
## Model 1: Price ~ NumApts
## Model 2: Price ~ NumApts + Age + LotSize + Parking + Condition + LotSize:Condition
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1     23 1.5809e+11                                   
## 2     16 3.4170e+10  7 1.2392e+11 8.2896 0.0002484 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model with the highest adjusted R^2 is Model 17 (Price~NumApts+Age+LotSize+Parking+Condition+LotSize:Condition), I would say that this is a better predictor than just using NumApts as a variable because the adjusted R^2 is a lot bigger, 0.83 to 0.95. This means that Model 17 is 10% better at predicting the data than Model 2 is. To confirm this conclusion, I ran an F-Test as well, using Model 2 as the null and Model 17 as the alternate, this test returned a p-value of 0.0002484, which is much less than 0.05 meaning that we can reject the null hypothesis.

Question 5

The `high_peaks` data includes information on hiking trails in the 46 “high peaks” in the Adirondack mountains of northern New York state. Our goal will be to understand the variability in the time in hours that it takes to complete each hike. In doing so, we’ll separately consider five possible predictors for time:

elevation: highest elevation (feet)

difficulty: difficulty rating of the hike (scale from 1 to 10)

ascent: a hike’s vertical ascent (feet)

length: length (miles)

rating: difficulty rating, categorical (easy / moderate / difficult)

**time (response variable):** average time (hours) it takes to complete the hike

Read in the data:

peaks <- read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/high_peaks.csv')

(a) Consider the five predictor variables. If we were to include all five variables in a regression model, would you have any concerns about multi-collinearity? Explain why or why not, and support your claim with visualization(s) or numerical summaries.

Yes I would because variables such as difficulty and rating could very well be proxies for each other

Model20=lm(time~elevation+difficulty+ascent+length+rating,data=peaks)
summary(Model20)

## 
## Call:
## lm(formula = time ~ elevation + difficulty + ascent + length + 
##     rating, data = peaks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5537 -0.7240 -0.1461  0.6671  2.1528 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.4594588  3.4389638   3.041 0.004196 ** 
## elevation      -0.0017805  0.0004783  -3.723 0.000621 ***
## difficulty      0.5282507  0.3509133   1.505 0.140288    
## ascent          0.0005698  0.0003002   1.898 0.065127 .  
## length          0.4041126  0.0739211   5.467 2.85e-06 ***
## ratingeasy     -2.0116202  1.2184442  -1.651 0.106775    
## ratingmoderate -1.9457713  0.6725960  -2.893 0.006215 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.052 on 39 degrees of freedom
## Multiple R-squared:  0.8772, Adjusted R-squared:  0.8583 
## F-statistic: 46.43 on 6 and 39 DF,  p-value: 2.984e-16

Model21=lm(time~elevation+ascent+length+rating,data=peaks)
summary(Model21)

## 
## Call:
## lm(formula = time ~ elevation + ascent + length + rating, data = peaks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5202 -0.6913 -0.1204  0.7829  2.1705 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    14.2662182  2.3671816   6.027 4.34e-07 ***
## elevation      -0.0019257  0.0004758  -4.047 0.000231 ***
## ascent          0.0005245  0.0003034   1.729 0.091605 .  
## length          0.4456704  0.0696495   6.399 1.30e-07 ***
## ratingeasy     -3.4825497  0.7393221  -4.710 2.96e-05 ***
## ratingmoderate -2.5996948  0.5215660  -4.984 1.24e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.068 on 40 degrees of freedom
## Multiple R-squared:  0.8701, Adjusted R-squared:  0.8538 
## F-statistic: 53.57 on 5 and 40 DF,  p-value: < 2.2e-16

Model22=lm(time~elevation+difficulty+ascent+length,data=peaks)
summary(Model22)

## 
## Call:
## lm(formula = time ~ elevation + difficulty + ascent + length, 
##     data = peaks)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77942 -0.81216 -0.08647  0.68962  3.06736 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.9567864  2.2307630   2.670  0.01082 *  
## elevation   -0.0016703  0.0005183  -3.223  0.00249 ** 
## difficulty   0.8654527  0.2285275   3.787  0.00049 ***
## ascent       0.0006011  0.0003310   1.816  0.07669 .  
## length       0.4440084  0.0812523   5.465 2.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.171 on 41 degrees of freedom
## Multiple R-squared:  0.8401, Adjusted R-squared:  0.8245 
## F-statistic: 53.84 on 4 and 41 DF,  p-value: 8.738e-16

When you include both rating and difficulty in the same model, neither variable is significant, even at the 0.05 level. However, using only one of the variables at a time makes them very significant, down to the 0.001 level. As such, it can be concluded that including either rating or difficulty to a model that already has the other is redundant because they are proxies for each other.

(b) Fit a model predicting `time` using `rating`. Interpret each of the coefficients from this model in the context of the problem, and make a graph comparing the two variables to go alongside your model.

lm(time~rating,data=peaks)

## 
## Call:
## lm(formula = time ~ rating, data = peaks)
## 
## Coefficients:
##    (Intercept)      ratingeasy  ratingmoderate  
##         15.000          -7.000          -4.556

ggplot(data=peaks,aes(x=rating,y=time))+geom_point()

the intercept coefficient tells us that the the average time it takes to complete a difficult hike(the reference group), is about 15 hours. The ratingeasy coefficient tells us that the average easy rated hike takes about 7 hours less time to complete than a difficult hike. The rating moderate coefficient tells us that the average moderate hike takes about 4.5 less hours to complete than a difficult hike.

(c) Report and interpret the R-squared value from your model in (b).

Model25=lm(time~rating,data=peaks)
summary(Model25)

## 
## Call:
## lm(formula = time ~ rating, data = peaks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0000 -1.0000 -0.2222  0.8889  4.0000 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     15.0000     0.5947  25.222  < 2e-16 ***
## ratingeasy      -7.0000     0.7816  -8.956 2.20e-11 ***
## ratingmoderate  -4.5556     0.6771  -6.728 3.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.682 on 43 degrees of freedom
## Multiple R-squared:  0.6538, Adjusted R-squared:  0.6377 
## F-statistic:  40.6 on 2 and 43 DF,  p-value: 1.246e-10

the R^2 value from this model is 0.6538, that means that the model correctly predicts the data about 65% of the time

(d) Now fit a multiple linear regression model predicting `time` using `rating`, `ascent`, and `length`. Interpret the coefficients corresponding to `rating` from this model in the context of the problem. Report and interpret the R-squared value from your model.

Model26=lm(time~rating+ascent+length,data=peaks)
summary(Model26)

## 
## Call:
## lm(formula = time ~ rating + ascent + length, data = peaks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3516 -0.6341 -0.0605  0.6308  2.5716 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.5106514  1.6298374   3.995 0.000263 ***
## ratingeasy     -3.1685224  0.8621911  -3.675 0.000683 ***
## ratingmoderate -2.4767827  0.6105856  -4.056 0.000218 ***
## ascent          0.0001875  0.0003422   0.548 0.586697    
## length          0.4590819  0.0815831   5.627 1.47e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.253 on 41 degrees of freedom
## Multiple R-squared:  0.8168, Adjusted R-squared:  0.799 
## F-statistic: 45.71 on 4 and 41 DF,  p-value: 1.37e-14

The R^2 value for this model is 0.8168. This R^2 value means that the model fits the data 81% of the time.

(e) Compare your results from parts (c) and (d). How do your estimates change?

When we added more variables, the R^2 value went up

(f) Based on your answers to the above questions, which model best answers our scientific question? Support your claim with numeric evidence, problem context, etc.

I would say that the best model is model 26(time~rating+ascent+length), it has the highest adjusted R^2 values and the lowest P values. Since our question is how do times differ to complete hikes, we need to take into account the vertical ascent of the trail, the trails overall length, and the difficulty of the trail.

(g) If your adjusted R-squared value was lower for the multiple linear regression model than the simple linear regression model, would your answer to part (f) change? Why or why not?

Yes, if the adjusted R^2 value was lower for the multiple linear regression model(model 26), I would change my answer to the simple linear regression model(model 25). This is because a lower adjusted R^2 value means that the extra variables in model 26 are not necessary and only serve to lower the model’s predictive power.

STAT 155-01,-02 Topic 12 In-class Activity

Load some potentially useful packages:

Question 1

We can use Adjusted-R\(^2\) to compare nested, or non-nested, models. F-tests can only be used to compared nested models. In this problem, we will use the USCrime data.

USCrime.csv has crime-related statistics for 47 U.S. states in 1960:

CrimeRate: Number of offenses reported per 10^6 population

Region: A categorical variable which classifies each state as “Southern” or “Non-Southern”

Education: Average number of years of schooling in the state

Expen59: Per capita expenditure on police in 1959

Expen60: Per capita expenditure on police in 1960

PopSize: Population size (in hundred thousands)

Unem: Unemployment rate (per 1000 people)

Poverty: Number of low income families (per 1000 people)

Consider the following two models based on the USCrime data:

\[CrimeRate \sim Expen60 + Poverty ~~~~\textbf{vs.}\]

\[CrimeRate \sim Expen60 + Poverty + Unem + PopSize\]

(a) Are these two models nested? If so, we can use either Adjusted-R\(^2\) or an F-test to decide between them.

(b) Which of these models would be selected if we use Adjusted-R\(^2\) as our criterion? Note: I realize that it is very close, but just follow the “bigger is better” rule.

(c) Perform an F-test to decide between these two models. Which model is taken as the null hypothesis, and which one is the alternative hypothesis? State the test statistic, p-value, and your conclusion. In this case, does your conclusion from (b) agree with your conclusion from this F-test?

(d) Your answer to (c) indicates a more general result about which of F-tests or Adjusted-R\(^2\) is more stringent about the inclusion of additional variables in a model. Which comparison method is more stringent about selecting a bigger model?

Question 2

MacGrades.csv contains a sub-sample (to help preserve anonymity) of every grade assigned to a former Macalester graduating class. For each of the 6146 rows of data, the following information is provided (with a few missing values):

sessionID: A section ID number

sid: A student ID number

grade: The grade obtained, as a numerical value (i.e. an \(A\) is a 4, an \(A-\) is a 3.67, etc.)

dept: A department identifier (these have been made ambiguous to maintain anonymity)

level: The course level (e.g. 100-, 200-, 300-, and 600-)

sem: A semester identifier

enroll: The section enrollment

iid: An instructor identifier (these have been made ambiguous to maintain anonymity)

Read in the data:

(a) Make a model for grade that uses level. Although level is recorded as a number, it is really a categorical variable. To treat it as such in your model, type:

Is level significantly associated with grade? What are the hypotheses, test statistic, and p-value?

Also, by looking at the coefficients, comment on the nature of this relationship.

(b) Is enroll associated with grade? What are the hypotheses, test statistic, and p-value?

(c) Is enroll associated with grade after we control for level? What are the hypotheses, test statistic, and p-value?

(d) Is level associated with grade after controling for enroll?

(e) Make a model for grade that uses dept. Is dept significantly associated with grade? How can you tell?

(f) Are any of the individual department p-values significant? What do these p-values tell us, and why is this not contradictory with your answer to (e)?

Question 3

FUELCON.csv has the following variables for all 50 states and DC:

FUEL: Per capita fuel consumption in gallons

DRIVERS: Ratio of licensed drivers to vehicles registered

HWYMILES: Number of miles of federally funded highways

GASTAX: Tax per gallon of gasoline in cents

INCOME: Average household income in dollars

Read in the data:

(a) Make a model for the FUEL response variable that uses the other 4 explanatory variables simultaneously. Test whether this model fits better than the constant only model. State the hypotheses, test statistic, and p-value.

(b) Does adding the variables GASTAX and INCOME improve a model for FUEL which already contains DRIVERS? Answer this question by performing a formal test of hypotheses: state the hypotheses, test statistic, and p-value.

(c) Which of the following two models would you choose?

Briefly justify your choice.

Question 4

A real estate appraiser wants to explore the relationship between sale price of an apartment building and other characteristics of the property. The data available is contained in the file MNSALES.csv. The variables of interest are:

Price: Sale price of the building (in dollars)

NumApts: Number of apartments in the building

Age: Age of the building in years

LotSize: Size of the lot on which the building is built (in square feet)

Parking: Number of parking spots

Condition: Condition of the building (F=Fair, G=Good, E=Excellent)

Read in the data:

(a) Recall in the Topic 5 ICA, you fit the following linear regression models and then ranked them according to their R-squared values:

Model1: Price ~ Age

Model2: Price ~ NumApts

Model3: Price ~ LotSize

Model4: Price ~ Parking

Model5: Price ~ Condition

Fit a multiple linear regression model with Price as the response variable, and the two MOST predictive variables (on their own). Report the R-squared value and the Adjusted R-squared value from this multiple linear regression model.

(b) Based on the models you have fit thus far, the real estate appraiser wants to know which variable(s) to include in a regression model to predict apartment price. Suggest a model, and support your suggestion with numeric evidence.

Question 5

elevation: highest elevation (feet)

difficulty: difficulty rating of the hike (scale from 1 to 10)

ascent: a hike’s vertical ascent (feet)

length: length (miles)

rating: difficulty rating, categorical (easy / moderate / difficult)

time (response variable): average time (hours) it takes to complete the hike

Read in the data:

(a) Consider the five predictor variables. If we were to include all five variables in a regression model, would you have any concerns about multi-collinearity? Explain why or why not, and support your claim with visualization(s) or numerical summaries.

(b) Fit a model predicting time using rating. Interpret each of the coefficients from this model in the context of the problem, and make a graph comparing the two variables to go alongside your model.

(c) Report and interpret the R-squared value from your model in (b).

(d) Now fit a multiple linear regression model predicting time using rating, ascent, and length. Interpret the coefficients corresponding to rating from this model in the context of the problem. Report and interpret the R-squared value from your model.

`MacGrades.csv` contains a sub-sample (to help preserve anonymity) of every grade assigned to a former Macalester graduating class. For each of the 6146 rows of data, the following information is provided (with a few missing values):

(a) Make a model for `grade` that uses `level`. Although `level` is recorded as a number, it is really a categorical variable. To treat it as such in your model, type:

(b) Is `enroll` associated with `grade`? What are the hypotheses, test statistic, and p-value?

(c) Is `enroll` associated with `grade` after we control for `level`? What are the hypotheses, test statistic, and p-value?

(d) Is `level` associated with `grade` after controling for `enroll`?

(e) Make a model for `grade` that uses `dept`. Is `dept` significantly associated with `grade`? How can you tell?

`FUELCON.csv` has the following variables for all 50 states and DC:

(a) Make a model for the `FUEL` response variable that uses the other 4 explanatory variables simultaneously. Test whether this model fits better than the constant only model. State the hypotheses, test statistic, and p-value.

(b) Does adding the variables `GASTAX` and `INCOME` improve a model for `FUEL` which already contains `DRIVERS`? Answer this question by performing a formal test of hypotheses: state the hypotheses, test statistic, and p-value.

Model1: `Price` ~ `Age`

Model2: `Price` ~ `NumApts`

Model3: `Price` ~ `LotSize`

Model4: `Price` ~ `Parking`

Model5: `Price` ~ `Condition`

Fit a multiple linear regression model with `Price` as the response variable, and the two MOST predictive variables (on their own). Report the R-squared value and the Adjusted R-squared value from this multiple linear regression model.

**time (response variable):** average time (hours) it takes to complete the hike

(b) Fit a model predicting `time` using `rating`. Interpret each of the coefficients from this model in the context of the problem, and make a graph comparing the two variables to go alongside your model.

(d) Now fit a multiple linear regression model predicting `time` using `rating`, `ascent`, and `length`. Interpret the coefficients corresponding to `rating` from this model in the context of the problem. Report and interpret the R-squared value from your model.