Summary

This week, I worked through chapter 3 of the Stat2 book, on Multiple Regression. This chapter introduced the topic of the project’s namesake, and is a foundation for everything I do within multiple regression for years to come.

While working through the chapters, I like to follow along with the examples within the book, meaning it is necessary to understand how to code different things within R. I thought constructing a multiple regression (MR) model in R would be very difficult, when in reality, it consists of simply choosing a dependent variable, and simply adding other predictor variables that you want to regress upon the model. I learned how to create different types of plots, such as QQ-plots, as well as how to subset data correctly that is to be used in a MR model. I also found how to compute a correlational matrix, which allows us to compile variables from a dataset and see the correlations between them. This is helpful when choosing predictors.

From my prior experience, I was well acquainted with scatterplots, but the book introduced a different type, called scatterplot matrices, where we can program R to show multiple scatterplots based on different pairs of variables. When I first though of multiple regression as a topic, I was curious as to how statisticians actually chose the variables they were going to use in the model, and this was a great introduction into that process. Scatterplot matrices are a great way to see what types of correlations, such as negative, positive, or none, are present within the data, and correlational matrices are useful as a number alongside a visual.

I remember an example within the book that used both correlational and scatterplot matrices to choose variables for a model, but after some exploration, had to re-do the process. I noticed that when I was tinkering with other datasets outside of the book, attempting to create MR models, I was scared by the immense amount of data, and the fear of putting in the wrong variable and creating a bad model. Observing the analytical way the book went about having to go back and create a new model gave me peace of mind, and also helped me to realize that all models are specific to their creator, and that a truly “perfect” model will never be created.

I was interested in how conditions of a MR model would be checked, curious to see if they would be any different from those of a simple linear model. I learned that rather than only using residual plots, we can check for conditions such as linearity and normality, but there are other conditions to check as well, such as multicollinearity, that require other graphs, such as quantile-quantile plots. The purpose of a QQ plot is to observe if data comes from a normal distribution, usually to see if the data comes from the same population.

Doing the homework, I was able to incorporate the ANOVA I had learned in the previous chapter alongside the creation of MR models. Like I said earlier, this chapter put my R skills to the test. For example, problem 3.23 had me create a scatterplot of brain pH on Age, which seems simple enough, but the problem asked to make Sex a grouping variable. After browsing the internet for a solid period of time, I figured out how to properly group the data within the scatterplot in the ggplot package by setting the color as Sex, the variable to be grouped. Learning this proved to be valuable. Looking at the scatterplot in just black datapoints leads the viewer to think there is a positive correlation between the variables, but by plotting regression lines by gender, we see that the data show a positive line for women, but a negative line for men.

This week, I feel I have established a solid foundation in my understanding of MR, and look forward to continuing to strengthen this foundation in the next chapter where I will learn how to apply other functions, such as ANOVA, to a MR model.

# import libraries
library(ggplot2)
library(stats)

3.18 Real Estate near Rails-to-Trails

rails = read.csv("~/downloads/Rails.csv")
# a
railslm = lm(adj2007 ~ distance, data = rails)
plot(rails$distance, rails$adj2007)

summary(railslm)
## 
## Call:
## lm(formula = adj2007 ~ distance, data = rails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -190.55  -58.19  -17.48   25.22  444.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  388.204     14.052  27.626  < 2e-16 ***
## distance     -54.427      9.659  -5.635 1.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 92.13 on 102 degrees of freedom
## Multiple R-squared:  0.2374, Adjusted R-squared:  0.2299 
## F-statistic: 31.75 on 1 and 102 DF,  p-value: 1.562e-07
# b
railsmr = lm(adj2007 ~ distance + squarefeet, data = rails)
summary(railsmr)
## 
## Call:
## lm(formula = adj2007 ~ distance + squarefeet, data = rails)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.835  -32.621   -1.903   27.369  145.504 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  109.742     20.057   5.472 3.25e-07 ***
## distance     -16.486      5.942  -2.775  0.00659 ** 
## squarefeet   150.780      9.998  15.080  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.34 on 101 degrees of freedom
## Multiple R-squared:  0.7655, Adjusted R-squared:  0.7608 
## F-statistic: 164.8 on 2 and 101 DF,  p-value: < 2.2e-16
# c
confint(railslm)
##                2.5 %    97.5 %
## (Intercept) 360.3317 416.07588
## distance    -73.5859 -35.26851
confint(railsmr)
##                 2.5 %     97.5 %
## (Intercept)  69.95460 149.530197
## distance    -28.27307  -4.698861
## squarefeet  130.94601 170.614247
a.

adj2007(hat) = 388.204 - 54.427(distance) r^2 = 23.74%

b.

adj2007(hat) = 109.742 - 16.486(distance) + 150.78(squarefeet) R^2 = 76.55% Adding squarefeet did change the estimate, as the r2/R2 value dramatically increased with the addition of the squarefeet variable. The coefficient for squarefeet is positive and much greater than the distance coefficient, which makes it very significant as well. Overall, this means that the combination of distance and squarefeet is more effective at explaining variability than the two variables are alone.

c.

For the first linear model, the interval is (-73.56, -35.26), while the MR model has an interval of (-28.27, -4.69). The smaller interval within the MR means that it gives a better prediction.

d.

Using the second MR equation with distance 0.5 miles and 1500 square feet, we get an estimated adjusted 2007 price of $226,271.50.

3.21 Enrollments in mathematics courses

math = read.csv("~/downloads/MthEnr.csv")

# a
mathlm = lm(Spring ~ Fall + AYear, data = math)
summary(mathlm)
## 
## Call:
## lm(formula = Spring ~ Fall + AYear, data = math)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.613 -23.022   5.416   7.541  55.357 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8743.3901  6141.5341  -1.424    0.192
## Fall           -0.2021     0.3589  -0.563    0.589
## AYear           4.5159     3.0492   1.481    0.177
## 
## Residual standard error: 31.09 on 8 degrees of freedom
## Multiple R-squared:  0.2773, Adjusted R-squared:  0.09663 
## F-statistic: 1.535 on 2 and 8 DF,  p-value: 0.2728
# c
anova(mathlm)
## Analysis of Variance Table
## 
## Response: Spring
##           Df Sum Sq Mean Sq F value Pr(>F)
## Fall       1  847.0  847.00  0.8763 0.3766
## AYear      1 2120.1 2120.06  2.1934 0.1769
## Residuals  8 7732.6  966.57
a.

The R^2 value of 0.2773 means that 27.73% of variation in Spring enrollment can be explained by Fall enrollment.

b.

The residual standard error for this model is 31.09

c.

Taking the averages of the F-values, we get a value of 1.53. The p-values are quite high, meaning we will fail to reject the null hypothesis.

3.23 Brain pH

# a
brain = read.csv("~/downloads/BrainpH.csv")

ggplot(brain, aes(x = Age, y = pH, color = Sex)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE,
              fullrange = TRUE)
## `geom_smooth()` using formula 'y ~ x'

# b
brainlm = lm(pH ~ Age, data = brain)
summary(brainlm)
## 
## Call:
## lm(formula = pH ~ Age, data = brain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56976 -0.21781  0.02032  0.16801  0.38649 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.8881113  0.1321194   52.13   <2e-16 ***
## Age         -0.0003905  0.0022944   -0.17    0.866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.235 on 52 degrees of freedom
## Multiple R-squared:  0.0005566,  Adjusted R-squared:  -0.01866 
## F-statistic: 0.02896 on 1 and 52 DF,  p-value: 0.8655
# c
brainmale = subset(brain, Sex == "M")
brainmlm = lm(pH ~ Age, data = brainmale)
summary(brainmlm)
## 
## Call:
## lm(formula = pH ~ Age, data = brainmale)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45463 -0.18439  0.02408  0.16533  0.40449 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.989109   0.129547  53.950   <2e-16 ***
## Age         -0.002279   0.002289  -0.996    0.325    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2175 on 42 degrees of freedom
## Multiple R-squared:  0.02306,    Adjusted R-squared:  -0.000199 
## F-statistic: 0.9914 on 1 and 42 DF,  p-value: 0.3251
brainfemale = subset(brain, Sex == "F")
brainflm = lm(pH ~ Age, data = brainfemale)
summary(brainflm)
## 
## Call:
## lm(formula = pH ~ Age, data = brainfemale)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42393 -0.09830  0.03404  0.16394  0.39209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.03963    0.50386  11.987 2.16e-06 ***
## Age          0.01374    0.00816   1.684    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2781 on 8 degrees of freedom
## Multiple R-squared:  0.2617, Adjusted R-squared:  0.1694 
## F-statistic: 2.835 on 1 and 8 DF,  p-value: 0.1307
a.

This plot shows that the correlation for women is positive while it is negative for men. This should be taken cautiously, as there are far more data points for males than for women.

b.

The R^2 value of 0.0005566 means that only 0.05% of the variability within the data can be explained by the model, meaning that there is not a linear association between the variables.

c.

Female pH = 6.04 + 0.013(Age) Male pH = 6.98 - 0.0022(Age)

3.32 Driving fatalities and speed limits

# a
speed = read.csv("~/downloads/Speed.csv")
speedlm = lm(FatalityRate ~ Year, data = speed)
summary(speedlm)
## 
## Call:
## lm(formula = FatalityRate ~ Year, data = speed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18959 -0.07550 -0.02576  0.09346  0.24606 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.320887   8.374227    10.9 1.28e-09 ***
## Year        -0.044870   0.004193   -10.7 1.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1164 on 19 degrees of freedom
## Multiple R-squared:  0.8577, Adjusted R-squared:  0.8502 
## F-statistic: 114.5 on 1 and 19 DF,  p-value: 1.75e-09
# b
resSpeed = resid(speedlm)
plot(fitted(speedlm), resSpeed, pch = 19)
abline(0,0)

barplot(resid(speedlm))

plot(speedlm)

# c
speedmr = lm(FatalityRate ~ Year + StateControl, data = speed)
summary(speedmr)
## 
## Call:
## lm(formula = FatalityRate ~ Year + StateControl, data = speed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20225 -0.06443 -0.02077  0.08021  0.24859 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  85.278326  15.809290   5.394 3.99e-05 ***
## Year         -0.041830   0.007942  -5.267 5.23e-05 ***
## StateControl -0.045012   0.099035  -0.455    0.655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1189 on 18 degrees of freedom
## Multiple R-squared:  0.8593, Adjusted R-squared:  0.8437 
## F-statistic: 54.96 on 2 and 18 DF,  p-value: 2.163e-08
plot(speedmr)

# subset by year
speedbefore = subset(speed, Year <= 1994)
speedafter = subset(speed, Year >= 1995)

speedlmbefore = lm(FatalityRate ~ Year + StateControl, data = speedbefore)
summary(speedlmbefore)
## 
## Call:
## lm(formula = FatalityRate ~ Year + StateControl, data = speedbefore)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.103571 -0.017619  0.007619  0.022738  0.091667 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  216.23071   19.24719   11.23 2.97e-05 ***
## Year          -0.10762    0.00967  -11.13 3.14e-05 ***
## StateControl        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06267 on 6 degrees of freedom
## Multiple R-squared:  0.9538, Adjusted R-squared:  0.9461 
## F-statistic: 123.9 on 1 and 6 DF,  p-value: 3.136e-05
speedlmafter = lm(FatalityRate ~ Year + StateControl, data = speedafter)
summary(speedlmafter)
## 
## Call:
## lm(formula = FatalityRate ~ Year + StateControl, data = speedafter)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.034066 -0.020769  0.002527  0.022473  0.035824 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.854121   3.754418   14.61 1.50e-08 ***
## Year         -0.026648   0.001876  -14.20 2.02e-08 ***
## StateControl        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02531 on 11 degrees of freedom
## Multiple R-squared:  0.9483, Adjusted R-squared:  0.9436 
## F-statistic: 201.7 on 1 and 11 DF,  p-value: 2.022e-08
plot(speedlmafter)

a.

Speed(hat) = 91.32 - 0.0448(Year)

b.

The residual plot exhibits a V-shape, meaning it is nonlinear. The points do not have equal spread, meaning that there is nonlinearity.

c.

After 1995, the data shows a more normal QQ plot

d.

The regression formula for years prior to 1995 is FatalityRate(hat) = 216.23 - 0.1076(Year) R^2 = 95.38%, while for years after and including 1995 it is FatalityRate(hat) = 54.8541 - 0.0266(Year) R^2 = 94.83%. There is a significant difference in the intercepts from before and after 1995.

3.37 Diamond prices: carat only

## 3.37
diam = read.csv("~/downloads/Diamond.csv")

# a
plot(diam$Carat, diam$TotalPrice)

# b
casq = diam$Carat^2
quadlm = lm(TotalPrice ~ Carat, data = diam)
summary(quadlm)
## 
## Call:
## lm(formula = TotalPrice ~ Carat, data = diam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8701.3 -1481.7  -136.1  1093.7 16310.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7181.3      347.5  -20.66   <2e-16 ***
## Carat        14638.4      311.8   46.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2881 on 349 degrees of freedom
## Multiple R-squared:  0.8633, Adjusted R-squared:  0.8629 
## F-statistic:  2204 on 1 and 349 DF,  p-value: < 2.2e-16
quaddiam = lm(TotalPrice ~ Carat + casq, data = diam)
summary(quaddiam)
## 
## Call:
## lm(formula = TotalPrice ~ Carat + casq, data = diam)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10207.4   -711.6   -167.9    355.0  12147.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -522.7      466.3  -1.121  0.26307    
## Carat         2386.0      752.5   3.171  0.00166 ** 
## casq          4498.2      263.0  17.101  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2127 on 348 degrees of freedom
## Multiple R-squared:  0.9257, Adjusted R-squared:  0.9253 
## F-statistic:  2168 on 2 and 348 DF,  p-value: < 2.2e-16
plot(quaddiam)

##### a. This plot shows that as the carat of the diamond increases there is more variability, meaning as the carat size increases, it becomes harder to predict the total price from the model.

b.

TotalPrice = -522.70 + 2386.0(Carat) + 4498.20(Carat^2) - R^2 = 0.9257 - Adjusted R^2 = 0.9253

3.38 Diamond prices: carat and depth

# a
depthsq = diam$Depth^2
casq = diam$Carat^2
diamquad = lm(TotalPrice ~ Depth + depthsq, data = diam)
summary(diamquad)
## 
## Call:
## lm(formula = TotalPrice ~ Depth + depthsq, data = diam)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9323  -4251  -2676   2134  45513 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28406.783 112211.790  -0.253    0.800
## Depth          766.369   3353.222   0.229    0.819
## depthsq         -3.233     24.869  -0.130    0.897
## 
## Residual standard error: 7616 on 348 degrees of freedom
## Multiple R-squared:  0.04748,    Adjusted R-squared:  0.042 
## F-statistic: 8.673 on 2 and 348 DF,  p-value: 0.0002111
anova(diamquad)
## Analysis of Variance Table
## 
## Response: TotalPrice
##            Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Depth       1 1.0050e+09 1005028980 17.3283 3.968e-05 ***
## depthsq     1 9.8029e+05     980292  0.0169    0.8966    
## Residuals 348 2.0184e+10   57999426                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# b
twopredic = lm(TotalPrice ~ Carat + Depth, data = diam)
summary(twopredic)
## 
## Call:
## lm(formula = TotalPrice ~ Carat + Depth, data = diam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9234.7 -1223.7  -274.3  1161.0 16368.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1059.24    1918.36   0.552    0.581    
## Carat       15087.01     320.96  47.006  < 2e-16 ***
## Depth        -134.94      30.92  -4.364 1.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2809 on 348 degrees of freedom
## Multiple R-squared:  0.8704, Adjusted R-squared:  0.8696 
## F-statistic:  1168 on 2 and 348 DF,  p-value: < 2.2e-16
# c
threepredic = lm(TotalPrice ~ Carat + Depth + Carat * Depth, data = diam)
summary(threepredic)
## 
## Call:
## lm(formula = TotalPrice ~ Carat + Depth + Carat * Depth, data = diam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8254.4 -1311.5  -157.2  1131.8 14513.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31171.41    4219.58   7.387 1.13e-12 ***
## Carat       -11827.73    3436.47  -3.442 0.000648 ***
## Depth         -598.18      65.47  -9.137  < 2e-16 ***
## Carat:Depth    408.45      51.96   7.861 4.84e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2592 on 347 degrees of freedom
## Multiple R-squared:   0.89,  Adjusted R-squared:  0.889 
## F-statistic: 935.7 on 3 and 347 DF,  p-value: < 2.2e-16
# d
second = lm(TotalPrice ~ Carat + Depth + depthsq + casq + Carat * Depth, data = diam)
summary(second)
## 
## Call:
## lm(formula = TotalPrice ~ Carat + Depth + depthsq + casq + Carat * 
##     Depth, data = diam)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12196.1   -652.7    -38.5    485.7  10582.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24338.820  30297.912   0.803   0.4223    
## Carat        7573.620   3040.787   2.491   0.0132 *  
## Depth        -728.700    904.439  -0.806   0.4210    
## depthsq         5.276      6.727   0.784   0.4333    
## casq         4761.592    330.246  14.418   <2e-16 ***
## Carat:Depth   -83.891     53.530  -1.567   0.1180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2053 on 345 degrees of freedom
## Multiple R-squared:  0.9313, Adjusted R-squared:  0.9304 
## F-statistic: 936.1 on 5 and 345 DF,  p-value: < 2.2e-16
a.

TotalPrice = -28406.783 + 766.369(Depth) - 3.233(Depth^2)

R^2 = 0.04748

Adjusted R^2 = 0.042

P-value = 0.0002111

b.

TotalPrice = 1059.24 + 15087.01(Carat) - 134.94(Depth)

R^2 = 0.8704

Adjusted R^2 = 0.8696

Carat p-value = 2.2e-16

Depth p-value = 1.68e-05

c.

TotalPrice = 31171.41 - 11827.73(Carat) - ​​598.18(Depth) + 408.45(Carat * Depth)

R^2 = 0.89

Adjusted R^2 = 0.889

Carat p-value = 0.000648

Depth p-value = < 2e-16

Carat * Depth p-value = 4.84e-14

d.

TotalPrice = 24338.82 + 7573.62(Carat) - 728.70(Depth) + 5.276(Depth^2) + 4761.59(Carat^2) - 83.89(Carat * Depth)

R^2 = 0.9313

Adjusted R^2 = 0.9304

Carat p-value = 0.0132

Depth p-value = 0.4210

Depth^2 p-value = 0.4333

Carat^2 p-value = <2e-16

Carat * Depth p-value = 0.1180

e.

Among these models, the three-predictor model should be used to predict the TotalPrice of diamonds. This model explains and accounts for a high amount of variability, as its R^2 value says it accounts for 89%. The variables in this model each have a low p-value, meaning they have a strong relationship with TotalPrice, the response variable.

3.39 Diamond prices: transformation

plot(threepredic)

# log model
logthree = lm(log(TotalPrice) ~ Carat + Depth + Carat * Depth, data = diam)
summary(logthree)
## 
## Call:
## lm(formula = log(TotalPrice) ~ Carat + Depth + Carat * Depth, 
##     data = diam)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36271 -0.14008  0.03185  0.18673  0.93288 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.846674   0.489146   7.864 4.75e-14 ***
## Carat        4.869049   0.398366  12.223  < 2e-16 ***
## Depth        0.046610   0.007589   6.142 2.24e-09 ***
## Carat:Depth -0.048814   0.006023  -8.105 9.15e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3005 on 347 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8798 
## F-statistic: 854.8 on 3 and 347 DF,  p-value: < 2.2e-16
plot(logthree)

a.

As the depth of the diamond increases, the model for TotalPrice becomes less accurate. This means that there is not constant variance, as well as the absence of normality of the residuals.

b.

After taking the log of TotalPrice, the constant variance condition is met better than it was prior.

3.43 2008 U.S. Presidential polls

##{r} poll = read.csv("~/downloads/Pollster08.csv") dayssq = poll$Days^2 quadpoll = lm(Margin ~ Days + dayssq, data = poll) summary(quadpoll) # find sse ssepoll = sum((fitted(quadpoll) - mean(poll$Margin))^2) ssepoll # subset date before and after sep 15 before = subset(poll, Charlie <= 0) after = subset(poll, Charlie >=1) nextmodel = lm(Margin ~ before$Days + after$Days, data = poll) ## ##### a. The SSE (sum of squared estimated errors) is equal to 483.0802, and the R^2 value is 0.3495