This week, I worked through chapter 3 of the Stat2 book, on Multiple Regression. This chapter introduced the topic of the project’s namesake, and is a foundation for everything I do within multiple regression for years to come.
While working through the chapters, I like to follow along with the examples within the book, meaning it is necessary to understand how to code different things within R. I thought constructing a multiple regression (MR) model in R would be very difficult, when in reality, it consists of simply choosing a dependent variable, and simply adding other predictor variables that you want to regress upon the model. I learned how to create different types of plots, such as QQ-plots, as well as how to subset data correctly that is to be used in a MR model. I also found how to compute a correlational matrix, which allows us to compile variables from a dataset and see the correlations between them. This is helpful when choosing predictors.
From my prior experience, I was well acquainted with scatterplots, but the book introduced a different type, called scatterplot matrices, where we can program R to show multiple scatterplots based on different pairs of variables. When I first though of multiple regression as a topic, I was curious as to how statisticians actually chose the variables they were going to use in the model, and this was a great introduction into that process. Scatterplot matrices are a great way to see what types of correlations, such as negative, positive, or none, are present within the data, and correlational matrices are useful as a number alongside a visual.
I remember an example within the book that used both correlational and scatterplot matrices to choose variables for a model, but after some exploration, had to re-do the process. I noticed that when I was tinkering with other datasets outside of the book, attempting to create MR models, I was scared by the immense amount of data, and the fear of putting in the wrong variable and creating a bad model. Observing the analytical way the book went about having to go back and create a new model gave me peace of mind, and also helped me to realize that all models are specific to their creator, and that a truly “perfect” model will never be created.
I was interested in how conditions of a MR model would be checked, curious to see if they would be any different from those of a simple linear model. I learned that rather than only using residual plots, we can check for conditions such as linearity and normality, but there are other conditions to check as well, such as multicollinearity, that require other graphs, such as quantile-quantile plots. The purpose of a QQ plot is to observe if data comes from a normal distribution, usually to see if the data comes from the same population.
Doing the homework, I was able to incorporate the ANOVA I had learned in the previous chapter alongside the creation of MR models. Like I said earlier, this chapter put my R skills to the test. For example, problem 3.23 had me create a scatterplot of brain pH on Age, which seems simple enough, but the problem asked to make Sex a grouping variable. After browsing the internet for a solid period of time, I figured out how to properly group the data within the scatterplot in the ggplot package by setting the color as Sex, the variable to be grouped. Learning this proved to be valuable. Looking at the scatterplot in just black datapoints leads the viewer to think there is a positive correlation between the variables, but by plotting regression lines by gender, we see that the data show a positive line for women, but a negative line for men.
This week, I feel I have established a solid foundation in my understanding of MR, and look forward to continuing to strengthen this foundation in the next chapter where I will learn how to apply other functions, such as ANOVA, to a MR model.
# import libraries
library(ggplot2)
library(stats)
rails = read.csv("~/downloads/Rails.csv")
# a
railslm = lm(adj2007 ~ distance, data = rails)
plot(rails$distance, rails$adj2007)
summary(railslm)
##
## Call:
## lm(formula = adj2007 ~ distance, data = rails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -190.55 -58.19 -17.48 25.22 444.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 388.204 14.052 27.626 < 2e-16 ***
## distance -54.427 9.659 -5.635 1.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 92.13 on 102 degrees of freedom
## Multiple R-squared: 0.2374, Adjusted R-squared: 0.2299
## F-statistic: 31.75 on 1 and 102 DF, p-value: 1.562e-07
# b
railsmr = lm(adj2007 ~ distance + squarefeet, data = rails)
summary(railsmr)
##
## Call:
## lm(formula = adj2007 ~ distance + squarefeet, data = rails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.835 -32.621 -1.903 27.369 145.504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.742 20.057 5.472 3.25e-07 ***
## distance -16.486 5.942 -2.775 0.00659 **
## squarefeet 150.780 9.998 15.080 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.34 on 101 degrees of freedom
## Multiple R-squared: 0.7655, Adjusted R-squared: 0.7608
## F-statistic: 164.8 on 2 and 101 DF, p-value: < 2.2e-16
# c
confint(railslm)
## 2.5 % 97.5 %
## (Intercept) 360.3317 416.07588
## distance -73.5859 -35.26851
confint(railsmr)
## 2.5 % 97.5 %
## (Intercept) 69.95460 149.530197
## distance -28.27307 -4.698861
## squarefeet 130.94601 170.614247
adj2007(hat) = 388.204 - 54.427(distance) r^2 = 23.74%
adj2007(hat) = 109.742 - 16.486(distance) + 150.78(squarefeet) R^2 = 76.55% Adding squarefeet did change the estimate, as the r2/R2 value dramatically increased with the addition of the squarefeet variable. The coefficient for squarefeet is positive and much greater than the distance coefficient, which makes it very significant as well. Overall, this means that the combination of distance and squarefeet is more effective at explaining variability than the two variables are alone.
For the first linear model, the interval is (-73.56, -35.26), while the MR model has an interval of (-28.27, -4.69). The smaller interval within the MR means that it gives a better prediction.
Using the second MR equation with distance 0.5 miles and 1500 square feet, we get an estimated adjusted 2007 price of $226,271.50.
math = read.csv("~/downloads/MthEnr.csv")
# a
mathlm = lm(Spring ~ Fall + AYear, data = math)
summary(mathlm)
##
## Call:
## lm(formula = Spring ~ Fall + AYear, data = math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.613 -23.022 5.416 7.541 55.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8743.3901 6141.5341 -1.424 0.192
## Fall -0.2021 0.3589 -0.563 0.589
## AYear 4.5159 3.0492 1.481 0.177
##
## Residual standard error: 31.09 on 8 degrees of freedom
## Multiple R-squared: 0.2773, Adjusted R-squared: 0.09663
## F-statistic: 1.535 on 2 and 8 DF, p-value: 0.2728
# c
anova(mathlm)
## Analysis of Variance Table
##
## Response: Spring
## Df Sum Sq Mean Sq F value Pr(>F)
## Fall 1 847.0 847.00 0.8763 0.3766
## AYear 1 2120.1 2120.06 2.1934 0.1769
## Residuals 8 7732.6 966.57
The R^2 value of 0.2773 means that 27.73% of variation in Spring enrollment can be explained by Fall enrollment.
The residual standard error for this model is 31.09
Taking the averages of the F-values, we get a value of 1.53. The p-values are quite high, meaning we will fail to reject the null hypothesis.
# a
brain = read.csv("~/downloads/BrainpH.csv")
ggplot(brain, aes(x = Age, y = pH, color = Sex)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE,
fullrange = TRUE)
## `geom_smooth()` using formula 'y ~ x'
# b
brainlm = lm(pH ~ Age, data = brain)
summary(brainlm)
##
## Call:
## lm(formula = pH ~ Age, data = brain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56976 -0.21781 0.02032 0.16801 0.38649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8881113 0.1321194 52.13 <2e-16 ***
## Age -0.0003905 0.0022944 -0.17 0.866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.235 on 52 degrees of freedom
## Multiple R-squared: 0.0005566, Adjusted R-squared: -0.01866
## F-statistic: 0.02896 on 1 and 52 DF, p-value: 0.8655
# c
brainmale = subset(brain, Sex == "M")
brainmlm = lm(pH ~ Age, data = brainmale)
summary(brainmlm)
##
## Call:
## lm(formula = pH ~ Age, data = brainmale)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45463 -0.18439 0.02408 0.16533 0.40449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.989109 0.129547 53.950 <2e-16 ***
## Age -0.002279 0.002289 -0.996 0.325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2175 on 42 degrees of freedom
## Multiple R-squared: 0.02306, Adjusted R-squared: -0.000199
## F-statistic: 0.9914 on 1 and 42 DF, p-value: 0.3251
brainfemale = subset(brain, Sex == "F")
brainflm = lm(pH ~ Age, data = brainfemale)
summary(brainflm)
##
## Call:
## lm(formula = pH ~ Age, data = brainfemale)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42393 -0.09830 0.03404 0.16394 0.39209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.03963 0.50386 11.987 2.16e-06 ***
## Age 0.01374 0.00816 1.684 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2781 on 8 degrees of freedom
## Multiple R-squared: 0.2617, Adjusted R-squared: 0.1694
## F-statistic: 2.835 on 1 and 8 DF, p-value: 0.1307
This plot shows that the correlation for women is positive while it is negative for men. This should be taken cautiously, as there are far more data points for males than for women.
The R^2 value of 0.0005566 means that only 0.05% of the variability within the data can be explained by the model, meaning that there is not a linear association between the variables.
Female pH = 6.04 + 0.013(Age) Male pH = 6.98 - 0.0022(Age)
# a
speed = read.csv("~/downloads/Speed.csv")
speedlm = lm(FatalityRate ~ Year, data = speed)
summary(speedlm)
##
## Call:
## lm(formula = FatalityRate ~ Year, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18959 -0.07550 -0.02576 0.09346 0.24606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.320887 8.374227 10.9 1.28e-09 ***
## Year -0.044870 0.004193 -10.7 1.75e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1164 on 19 degrees of freedom
## Multiple R-squared: 0.8577, Adjusted R-squared: 0.8502
## F-statistic: 114.5 on 1 and 19 DF, p-value: 1.75e-09
# b
resSpeed = resid(speedlm)
plot(fitted(speedlm), resSpeed, pch = 19)
abline(0,0)
barplot(resid(speedlm))
plot(speedlm)
# c
speedmr = lm(FatalityRate ~ Year + StateControl, data = speed)
summary(speedmr)
##
## Call:
## lm(formula = FatalityRate ~ Year + StateControl, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20225 -0.06443 -0.02077 0.08021 0.24859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.278326 15.809290 5.394 3.99e-05 ***
## Year -0.041830 0.007942 -5.267 5.23e-05 ***
## StateControl -0.045012 0.099035 -0.455 0.655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1189 on 18 degrees of freedom
## Multiple R-squared: 0.8593, Adjusted R-squared: 0.8437
## F-statistic: 54.96 on 2 and 18 DF, p-value: 2.163e-08
plot(speedmr)
# subset by year
speedbefore = subset(speed, Year <= 1994)
speedafter = subset(speed, Year >= 1995)
speedlmbefore = lm(FatalityRate ~ Year + StateControl, data = speedbefore)
summary(speedlmbefore)
##
## Call:
## lm(formula = FatalityRate ~ Year + StateControl, data = speedbefore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.103571 -0.017619 0.007619 0.022738 0.091667
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 216.23071 19.24719 11.23 2.97e-05 ***
## Year -0.10762 0.00967 -11.13 3.14e-05 ***
## StateControl NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06267 on 6 degrees of freedom
## Multiple R-squared: 0.9538, Adjusted R-squared: 0.9461
## F-statistic: 123.9 on 1 and 6 DF, p-value: 3.136e-05
speedlmafter = lm(FatalityRate ~ Year + StateControl, data = speedafter)
summary(speedlmafter)
##
## Call:
## lm(formula = FatalityRate ~ Year + StateControl, data = speedafter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.034066 -0.020769 0.002527 0.022473 0.035824
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.854121 3.754418 14.61 1.50e-08 ***
## Year -0.026648 0.001876 -14.20 2.02e-08 ***
## StateControl NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02531 on 11 degrees of freedom
## Multiple R-squared: 0.9483, Adjusted R-squared: 0.9436
## F-statistic: 201.7 on 1 and 11 DF, p-value: 2.022e-08
plot(speedlmafter)
Speed(hat) = 91.32 - 0.0448(Year)
The residual plot exhibits a V-shape, meaning it is nonlinear. The points do not have equal spread, meaning that there is nonlinearity.
After 1995, the data shows a more normal QQ plot
The regression formula for years prior to 1995 is FatalityRate(hat) = 216.23 - 0.1076(Year) R^2 = 95.38%, while for years after and including 1995 it is FatalityRate(hat) = 54.8541 - 0.0266(Year) R^2 = 94.83%. There is a significant difference in the intercepts from before and after 1995.
## 3.37
diam = read.csv("~/downloads/Diamond.csv")
# a
plot(diam$Carat, diam$TotalPrice)
# b
casq = diam$Carat^2
quadlm = lm(TotalPrice ~ Carat, data = diam)
summary(quadlm)
##
## Call:
## lm(formula = TotalPrice ~ Carat, data = diam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8701.3 -1481.7 -136.1 1093.7 16310.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7181.3 347.5 -20.66 <2e-16 ***
## Carat 14638.4 311.8 46.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2881 on 349 degrees of freedom
## Multiple R-squared: 0.8633, Adjusted R-squared: 0.8629
## F-statistic: 2204 on 1 and 349 DF, p-value: < 2.2e-16
quaddiam = lm(TotalPrice ~ Carat + casq, data = diam)
summary(quaddiam)
##
## Call:
## lm(formula = TotalPrice ~ Carat + casq, data = diam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10207.4 -711.6 -167.9 355.0 12147.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -522.7 466.3 -1.121 0.26307
## Carat 2386.0 752.5 3.171 0.00166 **
## casq 4498.2 263.0 17.101 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2127 on 348 degrees of freedom
## Multiple R-squared: 0.9257, Adjusted R-squared: 0.9253
## F-statistic: 2168 on 2 and 348 DF, p-value: < 2.2e-16
plot(quaddiam)
##### a. This plot shows that as the carat of the diamond increases there is more variability, meaning as the carat size increases, it becomes harder to predict the total price from the model.
TotalPrice = -522.70 + 2386.0(Carat) + 4498.20(Carat^2) - R^2 = 0.9257 - Adjusted R^2 = 0.9253
# a
depthsq = diam$Depth^2
casq = diam$Carat^2
diamquad = lm(TotalPrice ~ Depth + depthsq, data = diam)
summary(diamquad)
##
## Call:
## lm(formula = TotalPrice ~ Depth + depthsq, data = diam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9323 -4251 -2676 2134 45513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28406.783 112211.790 -0.253 0.800
## Depth 766.369 3353.222 0.229 0.819
## depthsq -3.233 24.869 -0.130 0.897
##
## Residual standard error: 7616 on 348 degrees of freedom
## Multiple R-squared: 0.04748, Adjusted R-squared: 0.042
## F-statistic: 8.673 on 2 and 348 DF, p-value: 0.0002111
anova(diamquad)
## Analysis of Variance Table
##
## Response: TotalPrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Depth 1 1.0050e+09 1005028980 17.3283 3.968e-05 ***
## depthsq 1 9.8029e+05 980292 0.0169 0.8966
## Residuals 348 2.0184e+10 57999426
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# b
twopredic = lm(TotalPrice ~ Carat + Depth, data = diam)
summary(twopredic)
##
## Call:
## lm(formula = TotalPrice ~ Carat + Depth, data = diam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9234.7 -1223.7 -274.3 1161.0 16368.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1059.24 1918.36 0.552 0.581
## Carat 15087.01 320.96 47.006 < 2e-16 ***
## Depth -134.94 30.92 -4.364 1.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2809 on 348 degrees of freedom
## Multiple R-squared: 0.8704, Adjusted R-squared: 0.8696
## F-statistic: 1168 on 2 and 348 DF, p-value: < 2.2e-16
# c
threepredic = lm(TotalPrice ~ Carat + Depth + Carat * Depth, data = diam)
summary(threepredic)
##
## Call:
## lm(formula = TotalPrice ~ Carat + Depth + Carat * Depth, data = diam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8254.4 -1311.5 -157.2 1131.8 14513.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31171.41 4219.58 7.387 1.13e-12 ***
## Carat -11827.73 3436.47 -3.442 0.000648 ***
## Depth -598.18 65.47 -9.137 < 2e-16 ***
## Carat:Depth 408.45 51.96 7.861 4.84e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2592 on 347 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 935.7 on 3 and 347 DF, p-value: < 2.2e-16
# d
second = lm(TotalPrice ~ Carat + Depth + depthsq + casq + Carat * Depth, data = diam)
summary(second)
##
## Call:
## lm(formula = TotalPrice ~ Carat + Depth + depthsq + casq + Carat *
## Depth, data = diam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12196.1 -652.7 -38.5 485.7 10582.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24338.820 30297.912 0.803 0.4223
## Carat 7573.620 3040.787 2.491 0.0132 *
## Depth -728.700 904.439 -0.806 0.4210
## depthsq 5.276 6.727 0.784 0.4333
## casq 4761.592 330.246 14.418 <2e-16 ***
## Carat:Depth -83.891 53.530 -1.567 0.1180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2053 on 345 degrees of freedom
## Multiple R-squared: 0.9313, Adjusted R-squared: 0.9304
## F-statistic: 936.1 on 5 and 345 DF, p-value: < 2.2e-16
TotalPrice = -28406.783 + 766.369(Depth) - 3.233(Depth^2)
R^2 = 0.04748
Adjusted R^2 = 0.042
P-value = 0.0002111
TotalPrice = 1059.24 + 15087.01(Carat) - 134.94(Depth)
R^2 = 0.8704
Adjusted R^2 = 0.8696
Carat p-value = 2.2e-16
Depth p-value = 1.68e-05
TotalPrice = 31171.41 - 11827.73(Carat) - 598.18(Depth) + 408.45(Carat * Depth)
R^2 = 0.89
Adjusted R^2 = 0.889
Carat p-value = 0.000648
Depth p-value = < 2e-16
Carat * Depth p-value = 4.84e-14
TotalPrice = 24338.82 + 7573.62(Carat) - 728.70(Depth) + 5.276(Depth^2) + 4761.59(Carat^2) - 83.89(Carat * Depth)
R^2 = 0.9313
Adjusted R^2 = 0.9304
Carat p-value = 0.0132
Depth p-value = 0.4210
Depth^2 p-value = 0.4333
Carat^2 p-value = <2e-16
Carat * Depth p-value = 0.1180
Among these models, the three-predictor model should be used to predict the TotalPrice of diamonds. This model explains and accounts for a high amount of variability, as its R^2 value says it accounts for 89%. The variables in this model each have a low p-value, meaning they have a strong relationship with TotalPrice, the response variable.
plot(threepredic)
# log model
logthree = lm(log(TotalPrice) ~ Carat + Depth + Carat * Depth, data = diam)
summary(logthree)
##
## Call:
## lm(formula = log(TotalPrice) ~ Carat + Depth + Carat * Depth,
## data = diam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36271 -0.14008 0.03185 0.18673 0.93288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.846674 0.489146 7.864 4.75e-14 ***
## Carat 4.869049 0.398366 12.223 < 2e-16 ***
## Depth 0.046610 0.007589 6.142 2.24e-09 ***
## Carat:Depth -0.048814 0.006023 -8.105 9.15e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3005 on 347 degrees of freedom
## Multiple R-squared: 0.8808, Adjusted R-squared: 0.8798
## F-statistic: 854.8 on 3 and 347 DF, p-value: < 2.2e-16
plot(logthree)
As the depth of the diamond increases, the model for TotalPrice becomes less accurate. This means that there is not constant variance, as well as the absence of normality of the residuals.
After taking the log of TotalPrice, the constant variance condition is met better than it was prior.
##{r} poll = read.csv("~/downloads/Pollster08.csv") dayssq = poll$Days^2 quadpoll = lm(Margin ~ Days + dayssq, data = poll) summary(quadpoll) # find sse ssepoll = sum((fitted(quadpoll) - mean(poll$Margin))^2) ssepoll # subset date before and after sep 15 before = subset(poll, Charlie <= 0) after = subset(poll, Charlie >=1) nextmodel = lm(Margin ~ before$Days + after$Days, data = poll) ## ##### a. The SSE (sum of squared estimated errors) is equal to 483.0802, and the R^2 value is 0.3495