library(ggplot2)
library(dplyr)
USCrime = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/USCrime.csv' )
Yes, they are.
model1=lm(formula=CrimeRate~Expen60+Poverty,data=USCrime)
model2=lm(formula=CrimeRate~Expen60+Poverty+Unem+PopSize,data=USCrime)
summary(model1)
##
## Call:
## lm(formula = CrimeRate ~ Expen60 + Poverty, data = USCrime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.387 -12.181 -1.456 15.455 50.645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -94.4662 34.3947 -2.747 0.00869 **
## Expen60 1.2415 0.1637 7.582 1.62e-09 ***
## Poverty 0.4095 0.1220 3.357 0.00163 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.62 on 44 degrees of freedom
## Multiple R-squared: 0.5803, Adjusted R-squared: 0.5612
## F-statistic: 30.42 on 2 and 44 DF, p-value: 5.061e-09
summary(model2)
##
## Call:
## lm(formula = CrimeRate ~ Expen60 + Poverty + Unem + PopSize,
## data = USCrime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.190 -12.049 -0.687 16.291 50.036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -114.77119 38.21229 -3.004 0.004484 **
## Expen60 1.40179 0.20239 6.926 1.85e-08 ***
## Poverty 0.46328 0.12923 3.585 0.000871 ***
## Unem 0.07991 0.46835 0.171 0.865348
## PopSize -0.17653 0.12444 -1.419 0.163389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.61 on 42 degrees of freedom
## Multiple R-squared: 0.5995, Adjusted R-squared: 0.5614
## F-statistic: 15.72 on 4 and 42 DF, p-value: 6.122e-08
the better model is model2, or CrimeRate=Expen60+Poverty+Unem+PopSize.
anova(model1,model2)
## Analysis of Variance Table
##
## Model 1: CrimeRate ~ Expen60 + Poverty
## Model 2: CrimeRate ~ Expen60 + Poverty + Unem + PopSize
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 28878
## 2 42 27555 2 1322.9 1.0082 0.3735
In this case, the null hypothesis is that the addition of PopSize and
Unem are not necessary to obtain significant results. The test statistic
is 1.0082, and the p value is 0.3735. Because the p value is bigger than
0.05, the results seen in model 2 are quite plausible under the null
hypothesis. This means that PopSize and Unem are redundant or they don’t
add anything significant to the model.
The F-test is much more strict about adding additional variables than adjusted R^2.
MacGrades.csv contains a sub-sample (to help preserve
anonymity) of every grade assigned to a former Macalester graduating
class. For each of the 6146 rows of data, the following information is
provided (with a few missing values):MacGrades = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/MacGrades.csv' )
grade that uses
level. Although level is recorded as a number,
it is really a categorical variable. To treat it as such in your model,
type:m = lm( grade ~ factor(level) , data=MacGrades )
summary(m)
##
## Call:
## lm(formula = grade ~ factor(level), data = MacGrades)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4776 -0.3492 0.2089 0.5224 0.6508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.34924 0.01208 277.166 < 2e-16 ***
## factor(level)200 0.11183 0.01995 5.606 2.17e-08 ***
## factor(level)300 0.09078 0.01949 4.659 3.25e-06 ***
## factor(level)400 0.12835 0.03168 4.052 5.15e-05 ***
## factor(level)600 0.63339 0.13624 4.649 3.41e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5915 on 5704 degrees of freedom
## (437 observations deleted due to missingness)
## Multiple R-squared: 0.01085, Adjusted R-squared: 0.01016
## F-statistic: 15.65 on 4 and 5704 DF, p-value: 9.713e-13
m2=lm(grade~1,data=MacGrades)
anova(m2,m)
## Analysis of Variance Table
##
## Model 1: grade ~ 1
## Model 2: grade ~ factor(level)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5708 2017.5
## 2 5704 1995.6 4 21.898 15.648 9.713e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis is that level has no effect on grade, the alternative hypothesis is that level does have an effect on grade. The F test statistic is 15.648 and the p value is 9.713e-13. Because the p value is much less than 0.05, it can be concluded that level does have an effect on grade.
enroll associated with grade? What
are the hypotheses, test statistic, and p-value?EG=lm(grade~enroll,data=MacGrades)
anova(m2,EG)
## Analysis of Variance Table
##
## Model 1: grade ~ 1
## Model 2: grade ~ enroll
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5708 2017.5
## 2 5707 2009.8 1 7.7436 21.989 2.806e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis is that enroll has no effect on grade, the alternate hypothesis is that the enroll does have an effect on grade, the f test is 21.989, the p value is 2.806*10^-6. The fact that the P-value was so small means that we can reject the null hypothesis and come to the conclusion that enroll does have an effect on grade
enroll associated with grade after
we control for level? What are the hypotheses, test
statistic, and p-value?EGL=lm(grade~enroll+level,data=MacGrades)
anova(m2,EGL)
## Analysis of Variance Table
##
## Model 1: grade ~ 1
## Model 2: grade ~ enroll + level
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5708 2017.5
## 2 5706 1999.7 2 17.867 25.492 9.512e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis is that when controlling for level, enroll has no effect on grade. The alternative hypothesis is that when controlling for level, enroll does have an effect on grade. The f statistic is 25.492, and the p value is 9.512*10^-12.
level associated with grade after
controling for enroll?EG=lm(grade~level+enroll,data=MacGrades)
anova(m2,EG)
## Analysis of Variance Table
##
## Model 1: grade ~ 1
## Model 2: grade ~ level + enroll
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5708 2017.5
## 2 5706 1999.7 2 17.867 25.492 9.512e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
yes
grade that uses dept.
Is dept significantly associated with grade?
How can you tell?DG=lm(grade~dept,data=MacGrades)
summary(DG)
##
## Call:
## lm(formula = grade ~ dept, data = MacGrades)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5268 -0.2749 0.1475 0.4515 0.9039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5000000 0.4088676 8.560 <2e-16 ***
## deptb -0.2536066 0.4155163 -0.610 0.542
## deptB -0.2614286 0.4278948 -0.611 0.541
## deptC 0.0268398 0.4106338 0.065 0.948
## deptd -0.1825995 0.4098240 -0.446 0.656
## deptD 0.0617623 0.4105399 0.150 0.880
## depte -0.0162500 0.4122607 -0.039 0.969
## deptE 0.1391667 0.4416275 0.315 0.753
## deptF -0.2251373 0.4104679 -0.548 0.583
## deptg 0.0956522 0.4176615 0.229 0.819
## deptG -0.3303306 0.4105537 -0.805 0.421
## deptH -0.0307812 0.4120495 -0.075 0.940
## depti -0.1474011 0.4111711 -0.358 0.720
## deptI 0.0011538 0.4243020 0.003 0.998
## deptj -0.0625248 0.4108867 -0.152 0.879
## deptJ -0.2815172 0.4116777 -0.684 0.494
## deptk 0.0124521 0.4104311 0.030 0.976
## deptK -0.2431624 0.4123474 -0.590 0.555
## deptL 0.0468000 0.4169648 0.112 0.911
## deptm 0.0224798 0.4099682 0.055 0.956
## deptM -0.4039440 0.4099067 -0.985 0.324
## deptn 0.1487363 0.4111080 0.362 0.718
## deptN 0.1670000 0.4139469 0.403 0.687
## depto -0.3616667 0.4255629 -0.850 0.395
## deptO 0.0066856 0.4100242 0.016 0.987
## deptp -0.0139370 0.4120744 -0.034 0.973
## deptP -0.1140000 0.4249077 -0.268 0.788
## deptq 0.0001132 0.4104076 0.000 1.000
## deptQ -0.0644094 0.4120744 -0.156 0.876
## deptR -0.0381579 0.4110139 -0.093 0.926
## depts -0.0154839 0.4218507 -0.037 0.971
## deptS -0.0247500 0.4139469 -0.060 0.952
## deptt -0.0126923 0.4166562 -0.030 0.976
## deptT -0.2733962 0.4165106 -0.656 0.512
## deptU -0.1136842 0.4298486 -0.264 0.791
## deptV -0.0242500 0.4189646 -0.058 0.954
## deptW -0.0507164 0.4100863 -0.124 0.902
## deptX -0.1189865 0.4116209 -0.289 0.773
## deptY 0.0814894 0.4174763 0.195 0.845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5782 on 5670 degrees of freedom
## (437 observations deleted due to missingness)
## Multiple R-squared: 0.06037, Adjusted R-squared: 0.05407
## F-statistic: 9.586 on 38 and 5670 DF, p-value: < 2.2e-16
anova(m2,DG)
## Analysis of Variance Table
##
## Model 1: grade ~ 1
## Model 2: grade ~ dept
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5708 2017.5
## 2 5670 1895.7 38 121.79 9.5862 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
yes, dept is associated with grade. In this model, we are assuming that the null hypothesis is that dept has no effect on grade, as such the P value associated with this null hypothesis is 2.2*10^-16. This value means that we reject the null hypothesis. As such we can conclude that department does have an effect on grade
Quite a few of the department values are significant. These P-values tell us that within each department, department does not have an effect on grade. The reason why this is not contradictory is because our research question is whether a difference in departments has an effect on grades, which means that individual department P-values are irrelevant.
FUELCON.csv has the following variables for all 50
states and DC:FUELCON = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/FUELCON.csv' )
FUEL response variable that
uses the other 4 explanatory variables simultaneously. Test whether this
model fits better than the constant only model. State the hypotheses,
test statistic, and p-value.A=lm(formula=FUEL~1,data=FUELCON)
f=lm(formula=FUEL~DRIVERS+HWYMILES+GASTAX+INCOME,data=FUELCON)
summary(A)
##
## Call:
## lm(formula = FUEL ~ 1, data = FUELCON)
##
## Residuals:
## Min 1Q Median 3Q Max
## -196.466 -38.736 6.514 45.999 229.094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 486.46 10.14 47.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 72.4 on 50 degrees of freedom
anova(A,f)
## Analysis of Variance Table
##
## Model 1: FUEL ~ 1
## Model 2: FUEL ~ DRIVERS + HWYMILES + GASTAX + INCOME
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 50 262054
## 2 46 145705 4 116349 9.183 1.537e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis is that every explanatory variable has no effect
on fuel. the alternative hypothesis is that all the explanatory
variables have an effect on fuel.
GASTAX and
INCOME improve a model for FUEL which already
contains DRIVERS? Answer this question by performing a
formal test of hypotheses: state the hypotheses, test statistic, and
p-value.B=lm(formula=FUEL~DRIVERS,data=FUELCON)
C=lm(formula=FUEL~DRIVERS+GASTAX+INCOME,data=FUELCON)
summary(B)
##
## Call:
## lm(formula = FUEL ~ DRIVERS, data = FUELCON)
##
## Residuals:
## Min 1Q Median 3Q Max
## -130.910 -47.129 -1.325 38.225 177.391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 726.51 55.90 12.997 < 2e-16 ***
## DRIVERS -281.12 64.66 -4.347 6.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.12 on 49 degrees of freedom
## Multiple R-squared: 0.2784, Adjusted R-squared: 0.2636
## F-statistic: 18.9 on 1 and 49 DF, p-value: 6.941e-05
summary(C)
##
## Call:
## lm(formula = FUEL ~ DRIVERS + GASTAX + INCOME, data = FUELCON)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153.90 -36.35 0.58 36.53 165.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e+02 6.963e+01 12.914 < 2e-16 ***
## DRIVERS -2.129e+02 6.149e+01 -3.463 0.00115 **
## GASTAX -3.531e+00 1.753e+00 -2.015 0.04968 *
## INCOME -5.463e-03 1.759e-03 -3.106 0.00321 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.04 on 47 degrees of freedom
## Multiple R-squared: 0.4368, Adjusted R-squared: 0.4008
## F-statistic: 12.15 on 3 and 47 DF, p-value: 5.208e-06
anova(B,C)
## Analysis of Variance Table
##
## Model 1: FUEL ~ DRIVERS
## Model 2: FUEL ~ DRIVERS + GASTAX + INCOME
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 189111
## 2 47 147590 2 41521 6.6112 0.002951 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the null hypothesis in this case is that GASTAX and INCOME have no effect on the FUEL DRIVERS model. The alternate hypothesis is that GASTAX and INCOME do have an effect om the FUEL DRIVERS model. the test statistic is 6.6112 and the p value is 0.002951. Given these results, we can reject the null hypothesis and conclude that GASTAX and INCOME do have an effect on the FUEL DRIVERS model.
\[FUEL ~\sim~ DRIVERS ~~~~ \text{vs.}~~~~ FUEL ~\sim~ GASTAX + INCOME + HWYMILES\]
FD=lm(FUEL~DRIVERS,data=FUELCON)
FGIH=lm(FUEL~GASTAX+INCOME+HWYMILES,data=FUELCON)
summary(FD)
##
## Call:
## lm(formula = FUEL ~ DRIVERS, data = FUELCON)
##
## Residuals:
## Min 1Q Median 3Q Max
## -130.910 -47.129 -1.325 38.225 177.391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 726.51 55.90 12.997 < 2e-16 ***
## DRIVERS -281.12 64.66 -4.347 6.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.12 on 49 degrees of freedom
## Multiple R-squared: 0.2784, Adjusted R-squared: 0.2636
## F-statistic: 18.9 on 1 and 49 DF, p-value: 6.941e-05
summary(FGIH)
##
## Call:
## lm(formula = FUEL ~ GASTAX + INCOME + HWYMILES, data = FUELCON)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.869 -32.628 2.982 30.946 198.741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.907e+02 7.128e+01 11.093 1.01e-14 ***
## GASTAX -4.220e+00 1.967e+00 -2.145 0.03714 *
## INCOME -7.353e-03 1.878e-03 -3.916 0.00029 ***
## HWYMILES -3.866e-04 1.113e-03 -0.347 0.72997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.7 on 47 degrees of freedom
## Multiple R-squared: 0.2949, Adjusted R-squared: 0.2499
## F-statistic: 6.553 on 3 and 47 DF, p-value: 0.000857
anova(FGIH,FD)
## Analysis of Variance Table
##
## Model 1: FUEL ~ GASTAX + INCOME + HWYMILES
## Model 2: FUEL ~ DRIVERS
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 47 184766
## 2 49 189111 -2 -4345.1 0.5526 0.5791
I would choose the FD model(FUEL~DRIVERS) because its adjusted R^2
value is larger. In addition to this, the P-value of the F-test also
shows that the null, or the fuel~drivers model, cannot be rejected.
MNSALES = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/MNSALES.csv')
Price ~ AgePrice ~ NumAptsPrice ~ LotSizePrice ~ ParkingPrice ~ ConditionPrice as
the response variable, and the two MOST predictive variables (on their
own). Report the R-squared value and the Adjusted R-squared value from
this multiple linear regression model.Model1=lm(Price~Age,data=MNSALES)
Model2=lm(Price~NumApts,data=MNSALES)
Model3=lm(Price~LotSize,data=MNSALES)
Model4=lm(Price~Parking,data=MNSALES)
Model5=lm(Price~Condition,data=MNSALES)
summary(Model1)
##
## Call:
## lm(formula = Price ~ Age, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184075 -147228 -42019 56090 676336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 340068.9 99308.9 3.424 0.00232 **
## Age -935.3 1692.2 -0.553 0.58579
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214700 on 23 degrees of freedom
## Multiple R-squared: 0.01311, Adjusted R-squared: -0.0298
## F-statistic: 0.3055 on 1 and 23 DF, p-value: 0.5858
summary(Model2)
##
## Call:
## lm(formula = Price ~ NumApts, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -114353 -53887 -21738 42961 254060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101786 23291 4.37 0.000224 ***
## NumApts 15525 1345 11.54 4.78e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82910 on 23 degrees of freedom
## Multiple R-squared: 0.8528, Adjusted R-squared: 0.8464
## F-statistic: 133.2 on 1 and 23 DF, p-value: 4.782e-11
summary(Model3)
##
## Call:
## lm(formula = Price ~ LotSize, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -251992 -89484 -12529 20816 415674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29070.745 66858.941 -0.435 0.668
## LotSize 37.367 7.044 5.305 2.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 144900 on 23 degrees of freedom
## Multiple R-squared: 0.5503, Adjusted R-squared: 0.5307
## F-statistic: 28.14 on 1 and 23 DF, p-value: 2.195e-05
summary(Model4)
##
## Call:
## lm(formula = Price ~ Parking, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -186977 -150926 -79277 33723 654799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 266277 47487 5.607 1.05e-05 ***
## Parking 9642 8711 1.107 0.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 210500 on 23 degrees of freedom
## Multiple R-squared: 0.05057, Adjusted R-squared: 0.009295
## F-statistic: 1.225 on 1 and 23 DF, p-value: 0.2798
summary(Model5)
##
## Call:
## lm(formula = Price ~ Condition, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -215850 -140581 -78250 91060 644419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 350250 86374 4.055 0.000527 ***
## ConditionF -173310 128114 -1.353 0.189865
## ConditionG -44669 103237 -0.433 0.669458
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 211600 on 22 degrees of freedom
## Multiple R-squared: 0.08296, Adjusted R-squared: -0.0004118
## F-statistic: 0.9951 on 2 and 22 DF, p-value: 0.3857
Model6=lm(Price~LotSize+NumApts,data=MNSALES)
summary(Model6)
##
## Call:
## lm(formula = Price ~ LotSize + NumApts, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113953 -53014 -21042 40841 253055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.930e+04 4.352e+04 2.282 0.0325 *
## LotSize 4.681e-01 6.862e+00 0.068 0.9462
## NumApts 1.540e+04 2.290e+03 6.724 9.29e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84760 on 22 degrees of freedom
## Multiple R-squared: 0.8528, Adjusted R-squared: 0.8394
## F-statistic: 63.73 on 2 and 22 DF, p-value: 7.024e-10
Model7=lm(Price~NumApts+Parking,data=MNSALES)
summary(Model7)
##
## Call:
## lm(formula = Price ~ NumApts + Parking, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111199 -52429 -20383 44846 256229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100612.4 24352.2 4.132 0.000438 ***
## NumApts 15454.2 1409.6 10.964 2.21e-10 ***
## Parking 808.7 3594.5 0.225 0.824061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84670 on 22 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.8398
## F-statistic: 63.89 on 2 and 22 DF, p-value: 6.865e-10
the R^2 value of the model Price~LotSize+NumApts is 0.8528 and the Adjusted R^2 value is 0.8394
I would say that the best model uses only NumApts as a variable because it has the highest adjusted R^2 value, adding any additional variables actually decreases the adjusted R^2 and the predictive power of the model.
NumApts, Age, LotSize,
Parking, and Condition, try fitting at least 3
different models with combinations of these explanatory variables (and
potentially interactions between them), and report the model you found
with the highest adjusted R-squared. Do you think this model is better
or worse than a model with only number of apartments as a predictor?
Explain why or why not (there are multiple, reasonable
justifications).Model8=lm(Price~NumApts+Age+LotSize+Parking+Condition,data=MNSALES)
Model9=lm(Price~NumApts+Age+LotSize+Parking+Condition+NumApts:Age,data=MNSALES)
Model10=lm(Price~NumApts+Age+LotSize+Parking+Condition+NumApts:LotSize,data=MNSALES)
Model11=lm(Price~NumApts+Age+LotSize+Parking+Condition+NumApts:Parking,data=MNSALES)
Model12=lm(Price~NumApts+Age+LotSize+Parking+Condition+NumApts:Condition,data=MNSALES)
Model13=lm(Price~NumApts+Age+LotSize+Parking+Condition+Age:LotSize,data=MNSALES)
Model14=lm(Price~NumApts+Age+LotSize+Parking+Condition+Age:Parking,data=MNSALES)
Model15=lm(Price~NumApts+Age+LotSize+Parking+Condition+Age:Condition,data=MNSALES)
Model16=lm(Price~NumApts+Age+LotSize+Parking+Condition+LotSize:Parking,data=MNSALES)
Model17=lm(Price~NumApts+Age+LotSize+Parking+Condition+LotSize:Condition,data=MNSALES)
Model18=lm(Price~NumApts+Age+LotSize+Parking+Condition+Parking:Condition,data=MNSALES)
summary(Model8)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition,
## data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107880 -22277 -2703 20448 115524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.278e+05 5.930e+04 3.841 0.00120 **
## NumApts 1.499e+04 1.890e+03 7.929 2.78e-07 ***
## Age -8.631e+02 6.652e+02 -1.297 0.21085
## LotSize 2.564e+00 5.622e+00 0.456 0.65381
## Parking 1.091e+03 3.080e+03 0.354 0.72727
## ConditionF -1.455e+05 4.154e+04 -3.503 0.00254 **
## ConditionG -1.239e+05 3.352e+04 -3.695 0.00166 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63900 on 18 degrees of freedom
## Multiple R-squared: 0.9316, Adjusted R-squared: 0.9087
## F-statistic: 40.83 on 6 and 18 DF, p-value: 1.596e-09
summary(Model9)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## NumApts:Age, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106139 -23010 -4229 18441 114299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.039e+05 8.339e+04 2.445 0.02570 *
## NumApts 1.835e+04 8.262e+03 2.221 0.04023 *
## Age -4.159e+02 1.267e+03 -0.328 0.74675
## LotSize 7.399e-01 7.220e+00 0.102 0.91957
## Parking 6.046e+02 3.361e+03 0.180 0.85936
## ConditionF -1.440e+05 4.267e+04 -3.376 0.00359 **
## ConditionG -1.183e+05 3.683e+04 -3.211 0.00512 **
## NumApts:Age -4.257e+01 1.017e+02 -0.418 0.68085
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65420 on 17 degrees of freedom
## Multiple R-squared: 0.9323, Adjusted R-squared: 0.9044
## F-statistic: 33.42 on 7 and 17 DF, p-value: 9.945e-09
summary(Model10)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## NumApts:LotSize, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75029 -19297 4257 21513 113047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.082e+05 5.456e+04 1.983 0.063741 .
## NumApts 2.542e+04 3.076e+03 8.263 2.34e-07 ***
## Age -5.189e+02 5.097e+02 -1.018 0.322952
## LotSize 8.563e+00 4.521e+00 1.894 0.075367 .
## Parking -1.445e+03 2.416e+03 -0.598 0.557626
## ConditionF -1.341e+05 3.147e+04 -4.262 0.000527 ***
## ConditionG -9.785e+04 2.618e+04 -3.737 0.001640 **
## NumApts:LotSize -6.033e-01 1.577e-01 -3.826 0.001351 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48200 on 17 degrees of freedom
## Multiple R-squared: 0.9632, Adjusted R-squared: 0.9481
## F-statistic: 63.61 on 7 and 17 DF, p-value: 5.938e-11
summary(Model11)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## NumApts:Parking, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107526 -32073 -1170 22191 118206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.060e+05 6.455e+04 3.192 0.00534 **
## NumApts 1.611e+04 2.284e+03 7.052 1.95e-06 ***
## Age -6.291e+02 7.197e+02 -0.874 0.39424
## LotSize 2.662e+00 5.658e+00 0.471 0.64397
## Parking 9.477e+03 9.976e+03 0.950 0.35541
## ConditionF -1.561e+05 4.346e+04 -3.591 0.00225 **
## ConditionG -1.280e+05 3.405e+04 -3.759 0.00156 **
## NumApts:Parking -4.946e+02 5.593e+02 -0.884 0.38881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64290 on 17 degrees of freedom
## Multiple R-squared: 0.9346, Adjusted R-squared: 0.9076
## F-statistic: 34.69 on 7 and 17 DF, p-value: 7.443e-09
summary(Model12)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## NumApts:Condition, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65789 -30754 -6469 17773 107164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86210.108 63268.778 1.363 0.19188
## NumApts 23009.549 2797.008 8.226 3.85e-07 ***
## Age -419.812 558.993 -0.751 0.46355
## LotSize 7.159 4.723 1.516 0.14905
## Parking 736.850 2491.671 0.296 0.77124
## ConditionF -62842.327 53660.761 -1.171 0.25870
## ConditionG -10844.423 42752.531 -0.254 0.80299
## NumApts:ConditionF -8792.554 4501.568 -1.953 0.06851 .
## NumApts:ConditionG -10386.213 3030.625 -3.427 0.00346 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51460 on 16 degrees of freedom
## Multiple R-squared: 0.9605, Adjusted R-squared: 0.9408
## F-statistic: 48.68 on 8 and 16 DF, p-value: 8.719e-10
summary(Model13)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## Age:LotSize, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107779 -22421 -2830 20208 115640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.259e+05 9.709e+04 2.327 0.03260 *
## NumApts 1.503e+04 2.517e+03 5.970 1.52e-05 ***
## Age -8.243e+02 1.696e+03 -0.486 0.63320
## LotSize 2.681e+00 7.443e+00 0.360 0.72313
## Parking 1.083e+03 3.186e+03 0.340 0.73810
## ConditionF -1.453e+05 4.320e+04 -3.365 0.00368 **
## ConditionG -1.235e+05 3.774e+04 -3.272 0.00449 **
## Age:LotSize -4.338e-03 1.736e-01 -0.025 0.98035
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65750 on 17 degrees of freedom
## Multiple R-squared: 0.9316, Adjusted R-squared: 0.9034
## F-statistic: 33.06 on 7 and 17 DF, p-value: 1.083e-08
summary(Model14)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## Age:Parking, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106548 -21296 -1512 18055 117524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.243e+05 6.072e+04 3.694 0.00180 **
## NumApts 1.449e+04 2.110e+03 6.865 2.74e-06 ***
## Age -8.673e+02 6.778e+02 -1.279 0.21791
## LotSize 3.328e+00 5.877e+00 0.566 0.57866
## Parking -1.588e+03 5.577e+03 -0.285 0.77931
## ConditionF -1.514e+05 4.353e+04 -3.479 0.00287 **
## ConditionG -1.192e+05 3.508e+04 -3.398 0.00342 **
## Age:Parking 9.677e+01 1.665e+02 0.581 0.56880
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65110 on 17 degrees of freedom
## Multiple R-squared: 0.9329, Adjusted R-squared: 0.9053
## F-statistic: 33.76 on 7 and 17 DF, p-value: 9.193e-09
summary(Model15)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## Age:Condition, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103695 -17046 -2736 26859 92940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 348457.79 76903.21 4.531 0.000341 ***
## NumApts 14534.66 1771.77 8.203 4e-07 ***
## Age -2764.97 1050.86 -2.631 0.018154 *
## LotSize 1.97 5.21 0.378 0.710338
## Parking 3098.63 3108.04 0.997 0.333617
## ConditionF -294535.21 202265.76 -1.456 0.164686
## ConditionG -281001.92 76722.34 -3.663 0.002102 **
## Age:ConditionF 2326.21 2767.09 0.841 0.412921
## Age:ConditionG 2893.13 1290.33 2.242 0.039473 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59120 on 16 degrees of freedom
## Multiple R-squared: 0.9479, Adjusted R-squared: 0.9219
## F-statistic: 36.4 on 8 and 16 DF, p-value: 7.746e-09
summary(Model16)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## LotSize:Parking, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104944 -33690 -2469 25325 119420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.014e+05 6.726e+04 2.995 0.00814 **
## NumApts 1.494e+04 1.906e+03 7.838 4.81e-07 ***
## Age -6.849e+02 7.020e+02 -0.976 0.34293
## LotSize 4.476e+00 6.092e+00 0.735 0.47251
## Parking 1.152e+04 1.260e+04 0.914 0.37355
## ConditionF -1.526e+05 4.267e+04 -3.576 0.00233 **
## ConditionG -1.218e+05 3.386e+04 -3.597 0.00222 **
## LotSize:Parking -9.971e-01 1.168e+00 -0.854 0.40519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64390 on 17 degrees of freedom
## Multiple R-squared: 0.9344, Adjusted R-squared: 0.9073
## F-statistic: 34.58 on 7 and 17 DF, p-value: 7.633e-09
summary(Model17)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## LotSize:Condition, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66028 -18671 -4288 6733 83667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -126542.53 93132.36 -1.359 0.193077
## NumApts 14689.91 1370.38 10.720 1.04e-08 ***
## Age 130.08 538.67 0.241 0.812251
## LotSize 44.41 10.57 4.203 0.000674 ***
## Parking 2947.22 2269.23 1.299 0.212428
## ConditionF 123797.73 99749.29 1.241 0.232462
## ConditionG 209209.14 81549.05 2.565 0.020745 *
## LotSize:ConditionF -40.54 13.28 -3.053 0.007597 **
## LotSize:ConditionG -44.23 10.32 -4.286 0.000567 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46210 on 16 degrees of freedom
## Multiple R-squared: 0.9682, Adjusted R-squared: 0.9523
## F-statistic: 60.85 on 8 and 16 DF, p-value: 1.591e-10
summary(Model18)
##
## Call:
## lm(formula = Price ~ NumApts + Age + LotSize + Parking + Condition +
## Parking:Condition, data = MNSALES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91067 -16649 333 18335 128353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.980e+05 5.995e+04 3.303 0.00449 **
## NumApts 1.365e+04 1.984e+03 6.879 3.7e-06 ***
## Age -8.674e+02 6.455e+02 -1.344 0.19780
## LotSize 5.756e+00 5.748e+00 1.001 0.33154
## Parking 2.237e+04 1.256e+04 1.781 0.09388 .
## ConditionF -1.301e+05 4.451e+04 -2.924 0.00994 **
## ConditionG -1.015e+05 3.496e+04 -2.904 0.01035 *
## Parking:ConditionF -2.037e+04 1.338e+04 -1.522 0.14752
## Parking:ConditionG -2.262e+04 1.282e+04 -1.765 0.09660 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62000 on 16 degrees of freedom
## Multiple R-squared: 0.9427, Adjusted R-squared: 0.9141
## F-statistic: 32.93 on 8 and 16 DF, p-value: 1.631e-08
anova(Model2,Model17)
## Analysis of Variance Table
##
## Model 1: Price ~ NumApts
## Model 2: Price ~ NumApts + Age + LotSize + Parking + Condition + LotSize:Condition
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 1.5809e+11
## 2 16 3.4170e+10 7 1.2392e+11 8.2896 0.0002484 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model with the highest adjusted R^2 is Model 17 (Price~NumApts+Age+LotSize+Parking+Condition+LotSize:Condition), I would say that this is a better predictor than just using NumApts as a variable because the adjusted R^2 is a lot bigger, 0.83 to 0.95. This means that Model 17 is 10% better at predicting the data than Model 2 is. To confirm this conclusion, I ran an F-Test as well, using Model 2 as the null and Model 17 as the alternate, this test returned a p-value of 0.0002484, which is much less than 0.05 meaning that we can reject the null hypothesis.
high_peaks data includes information on hiking
trails in the 46 “high peaks” in the Adirondack mountains of northern
New York state. Our goal will be to understand the variability in the
time in hours that it takes to complete each hike. In doing so, we’ll
separately consider five possible predictors for time:peaks <- read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/high_peaks.csv')
Yes I would because variables such as difficulty and rating could very well be proxies for each other
Model20=lm(time~elevation+difficulty+ascent+length+rating,data=peaks)
summary(Model20)
##
## Call:
## lm(formula = time ~ elevation + difficulty + ascent + length +
## rating, data = peaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5537 -0.7240 -0.1461 0.6671 2.1528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.4594588 3.4389638 3.041 0.004196 **
## elevation -0.0017805 0.0004783 -3.723 0.000621 ***
## difficulty 0.5282507 0.3509133 1.505 0.140288
## ascent 0.0005698 0.0003002 1.898 0.065127 .
## length 0.4041126 0.0739211 5.467 2.85e-06 ***
## ratingeasy -2.0116202 1.2184442 -1.651 0.106775
## ratingmoderate -1.9457713 0.6725960 -2.893 0.006215 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.052 on 39 degrees of freedom
## Multiple R-squared: 0.8772, Adjusted R-squared: 0.8583
## F-statistic: 46.43 on 6 and 39 DF, p-value: 2.984e-16
Model21=lm(time~elevation+ascent+length+rating,data=peaks)
summary(Model21)
##
## Call:
## lm(formula = time ~ elevation + ascent + length + rating, data = peaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5202 -0.6913 -0.1204 0.7829 2.1705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.2662182 2.3671816 6.027 4.34e-07 ***
## elevation -0.0019257 0.0004758 -4.047 0.000231 ***
## ascent 0.0005245 0.0003034 1.729 0.091605 .
## length 0.4456704 0.0696495 6.399 1.30e-07 ***
## ratingeasy -3.4825497 0.7393221 -4.710 2.96e-05 ***
## ratingmoderate -2.5996948 0.5215660 -4.984 1.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.068 on 40 degrees of freedom
## Multiple R-squared: 0.8701, Adjusted R-squared: 0.8538
## F-statistic: 53.57 on 5 and 40 DF, p-value: < 2.2e-16
Model22=lm(time~elevation+difficulty+ascent+length,data=peaks)
summary(Model22)
##
## Call:
## lm(formula = time ~ elevation + difficulty + ascent + length,
## data = peaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77942 -0.81216 -0.08647 0.68962 3.06736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9567864 2.2307630 2.670 0.01082 *
## elevation -0.0016703 0.0005183 -3.223 0.00249 **
## difficulty 0.8654527 0.2285275 3.787 0.00049 ***
## ascent 0.0006011 0.0003310 1.816 0.07669 .
## length 0.4440084 0.0812523 5.465 2.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.171 on 41 degrees of freedom
## Multiple R-squared: 0.8401, Adjusted R-squared: 0.8245
## F-statistic: 53.84 on 4 and 41 DF, p-value: 8.738e-16
When you include both rating and difficulty in the same model, neither variable is significant, even at the 0.05 level. However, using only one of the variables at a time makes them very significant, down to the 0.001 level. As such, it can be concluded that including either rating or difficulty to a model that already has the other is redundant because they are proxies for each other.
time using
rating. Interpret each of the coefficients from this model
in the context of the problem, and make a graph comparing the two
variables to go alongside your model.lm(time~rating,data=peaks)
##
## Call:
## lm(formula = time ~ rating, data = peaks)
##
## Coefficients:
## (Intercept) ratingeasy ratingmoderate
## 15.000 -7.000 -4.556
ggplot(data=peaks,aes(x=rating,y=time))+geom_point()
the intercept coefficient tells us that the the average time it takes to
complete a difficult hike(the reference group), is about 15 hours. The
ratingeasy coefficient tells us that the average easy rated hike takes
about 7 hours less time to complete than a difficult hike. The rating
moderate coefficient tells us that the average moderate hike takes about
4.5 less hours to complete than a difficult hike.
Model25=lm(time~rating,data=peaks)
summary(Model25)
##
## Call:
## lm(formula = time ~ rating, data = peaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0000 -1.0000 -0.2222 0.8889 4.0000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.0000 0.5947 25.222 < 2e-16 ***
## ratingeasy -7.0000 0.7816 -8.956 2.20e-11 ***
## ratingmoderate -4.5556 0.6771 -6.728 3.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.682 on 43 degrees of freedom
## Multiple R-squared: 0.6538, Adjusted R-squared: 0.6377
## F-statistic: 40.6 on 2 and 43 DF, p-value: 1.246e-10
the R^2 value from this model is 0.6538, that means that the model
correctly predicts the data about 65% of the time
time using rating, ascent, and
length. Interpret the coefficients corresponding to
rating from this model in the context of the problem.
Report and interpret the R-squared value from your model.Model26=lm(time~rating+ascent+length,data=peaks)
summary(Model26)
##
## Call:
## lm(formula = time ~ rating + ascent + length, data = peaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3516 -0.6341 -0.0605 0.6308 2.5716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5106514 1.6298374 3.995 0.000263 ***
## ratingeasy -3.1685224 0.8621911 -3.675 0.000683 ***
## ratingmoderate -2.4767827 0.6105856 -4.056 0.000218 ***
## ascent 0.0001875 0.0003422 0.548 0.586697
## length 0.4590819 0.0815831 5.627 1.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.253 on 41 degrees of freedom
## Multiple R-squared: 0.8168, Adjusted R-squared: 0.799
## F-statistic: 45.71 on 4 and 41 DF, p-value: 1.37e-14
The R^2 value for this model is 0.8168. This R^2 value means that the
model fits the data 81% of the time.
When we added more variables, the R^2 value went up
I would say that the best model is model 26(time~rating+ascent+length), it has the highest adjusted R^2 values and the lowest P values. Since our question is how do times differ to complete hikes, we need to take into account the vertical ascent of the trail, the trails overall length, and the difficulty of the trail.
Yes, if the adjusted R^2 value was lower for the multiple linear regression model(model 26), I would change my answer to the simple linear regression model(model 25). This is because a lower adjusted R^2 value means that the extra variables in model 26 are not necessary and only serve to lower the model’s predictive power.