HW11_605

jbrnbrg

November 8, 2017

library(tidyverse)
data(cars)

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)


Cars:

The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.

Plot:

The response variable is the breaking distance and the explanatory variable is the vehicle speed.

ggplot(cars, aes(x = speed, y = dist)) + 
  geom_point(size = 2, alpha = .4) +
  geom_smooth(method = "lm", se = FALSE, alpha = .2) +
  labs(title = "Speed vs Stopping Distance", 
       x = "Speed (mph)", 
       y = "Stopping distance (ft)") 

Linear Model:

speed_v_stop <- lm(speed~dist, data = cars)

Residuals:

Here I’m looking for constant variability: The below plot depicts the residuals of my model and they appear to be kind of regularly dispersed about zero - I’m not absolutely convinced this model is displaying constant variability.

ggplot(data = cars, aes(x=speed, y=speed_v_stop$residuals)) + 
  geom_point(size = 2, alpha = .4) + 
  geom_abline(intercept = 0, slope = 0, color = "red") +
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(title = "Speed vs Linear Model Residuals", 
       x = "Speed (mph)", 
       y = "Model Residuals") 

To investigate further, I’ll need to check for Nearly Normal Residuals: below, the qqnorm normal probability plot of the linear model’s residuals supports a normal determination although there are tails.

qqnorm(speed_v_stop$residuals)
qqline(speed_v_stop$residuals)

Lastly, I’ll review the histogram and see if it appears normal in nature:

hist(speed_v_stop$residuals)

I’m cautiously optimistic that there’s a relationship here given the mildly normal nature of the above histogram. To test this, I’ll need to use inference.

Inference:

\(H_0:\) Speed and stopping distance are unrelated (i.e. \(\beta_1 = 0\))
\(H_A:\) Speed and stopping distance are positively correlated (i.e. \(\beta_1 > 0\))

Conclusion:

summary(speed_v_stop)
## 
## Call:
## lm(formula = speed ~ dist, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The summary of the linear model gives me the equation for the model:

\[\text{Break Distance} = 8.28391 + 0.16557\times \text{Speed}\]

With a resulting p-value close to zero, I’m happy to reject \(H_0\) in favor of \(H_A\) and am confident this model describes the relationship well.


Processor Performance Results:

Note: Much of the text, guidance, and data comes from Linear Regression Using R: An Introduction to Data Modeling:

The data frame int92.dat contains the data from the CPU DB database for all of the processors for which performance results were available for the SPEC Integer 1992 (Int1992) benchmark program. Similarly, fp92.dat contains the data for the processors that executed the Floating-Point 1992 (Fp1992) benchmarks, and so on.

# read in the data as instructed on pg11 of text
head(int00.dat)
##      nperf      perf clock threads cores TDP transistors dieSize voltage
## 1 11.07218  425.6607  1200       1     1  NA          NA      NA      NA
## 2 11.48725  438.0000  1300       1     1  NA          NA      NA      NA
## 3 12.96734  482.0000  1333       1     1  NA          NA      NA      NA
## 4 13.40464  495.0000  1400       1     1  56          NA     120     1.8
## 5 39.33987 1266.0000  2000       1     1  NA          NA      NA      NA
## 6 39.44078 1269.0000  2600       1     1  NA          NA      NA      NA
##   featureSize channel FO4delay L1icache L1dcache L2cache L3cache
## 1        0.18    0.10     36.0       64       64     256      NA
## 2        0.18    0.10     36.0       64       64     256      NA
## 3        0.18    0.10     36.0       64       64     256      NA
## 4        0.18    0.10     36.0       64       64     256      NA
## 5        0.13    0.07     25.2       64       64    1024      NA
## 6        0.13    0.07     25.2       64       64    1024      NA

Plot:

Review the plot of Performance v Clock:

ggplot(data = int00.dat, aes(x=clock, y=perf)) + 
  geom_point(size = 2, alpha = .4) +
  geom_smooth(method = "lm", se = FALSE, alpha = .2) +
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(title = "Clock vs. Performance",
       subtitle = "Int2000 benchmark versus the clock frequency",
       x = "Clock", 
       y = "Performance") 

#plot(fitted(int00.lm),resid(int00.lm))
#int00.lm$fitted.values

The Linear Model & Residuals:

clk_v_perf <- lm(clock~perf, data = int00.dat)

ggplot(data = int00.dat, aes(x=clock, y=clk_v_perf$residuals)) + 
  geom_point(size = 2, alpha = .4) + 
  geom_abline(intercept = 0, slope = 0, color = "red") +
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(title = "Speed vs Linear Model Residuals", 
       x = "Speed (mph)", 
       y = "Model Residuals") 

And the associated qq plots

qqnorm(clk_v_perf$residuals)
qqline(clk_v_perf$residuals)

hist(clk_v_perf$residuals)

There is some skewness out the higher end of the sample quartiles which suggests that the distribution’s tails are “heavier” than what I would expect from a normal distribution. The clock won’t be enough to explain the data.

Multi-factor Regression:

Looking vertical down the perf column, we can see some possible candidates for regression - or rather, other candidates for explanatory variable of the performance perf. Note that nperf is just a linear re-scaling of perf and will be used going forward:

pairs(int00.dat, gap = 0.5)

Below, I select multiple possible predictors noting:

pg29 of an Introduction to Data Modeling:

However, a good regression model explains the relationship between a system’s inputs and output as simply as possible. Thus, we should use the smallest number of predictors necessary to provide good predictions. Furthermore, using too many or redundant predictors builds the random noise in the data into the model. In this situation, we obtain an over-fitted model that is very good at predicting the outputs from the specific input data set used to train the model. It does not accurately model the overall system’s response, though, and it will not appropriately predict the system output for a broader range of inputs than those on which it was trained. Redundant or unnecessary predictors also can lead to numerical instabilities when computing the coefficients.

We use \(R_{\text{adjusted}} = 1 - \frac{n-1}{n-m}(1- R^2)\) where \(n\) is the number of observations and \(m\) is the number of predictors in the model.

Too many predictors will train our model to follow the data’s random variations (noise) too closely. Too few predictors will produce a model that may not be as accurate at predicting future values as a model with more predictors.

I’ll use backwards elimination per the text to help decide which predictors to keep in our model and which to exclude. We’ll start with all possible predictors and use lm() to compute the model with intention that my threshold for meaningfulness will be: \(p=0.05\).

my_multi <- lm(nperf ~ clock + threads + cores + transistors + 
                 dieSize + voltage + featureSize + channel + 
                 FO4delay + L1icache + sqrt(L1icache) + L1dcache + 
                 sqrt(L1dcache) + L2cache + sqrt(L2cache), data=int00.dat)
summary(my_multi)
## 
## Call:
## lm(formula = nperf ~ clock + threads + cores + transistors + 
##     dieSize + voltage + featureSize + channel + FO4delay + L1icache + 
##     sqrt(L1icache) + L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache), 
##     data = int00.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.804  -2.702   0.000   2.285   9.809 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.108e+01  7.852e+01  -0.268  0.78927    
## clock           2.605e-02  1.671e-03  15.594  < 2e-16 ***
## threads        -2.346e+00  2.089e+00  -1.123  0.26596    
## cores           2.246e+00  1.782e+00   1.260  0.21235    
## transistors    -5.580e-03  1.388e-02  -0.402  0.68897    
## dieSize         1.021e-02  1.746e-02   0.585  0.56084    
## voltage        -2.623e+01  7.698e+00  -3.408  0.00117 ** 
## featureSize     3.101e+01  1.122e+02   0.276  0.78324    
## channel         9.496e+01  5.945e+02   0.160  0.87361    
## FO4delay       -1.765e-02  1.600e+00  -0.011  0.99123    
## L1icache        1.102e+02  4.206e+01   2.619  0.01111 *  
## sqrt(L1icache) -7.390e+02  2.980e+02  -2.480  0.01593 *  
## L1dcache       -1.114e+02  4.019e+01  -2.771  0.00739 ** 
## sqrt(L1dcache)  7.492e+02  2.739e+02   2.735  0.00815 ** 
## L2cache        -9.684e-03  1.745e-03  -5.550 6.57e-07 ***
## sqrt(L2cache)   1.221e+00  2.425e-01   5.034 4.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.632 on 61 degrees of freedom
##   (179 observations deleted due to missingness)
## Multiple R-squared:  0.9652, Adjusted R-squared:  0.9566 
## F-statistic: 112.8 on 15 and 61 DF,  p-value: < 2.2e-16

I note that \(R^2\) and \(R_{\text{adjusted}}\) are close to 1 indicating that the model explains nperf values well but I’m not sure if it’s just good at modelling the data’s noise. I apply the backward elimination procedure by identifying the predictor with the largest p-value that exceeds our predetermined threshold of \(p = 0.05\). I note that \(\texttt{FO4delay0}'s\text{ }\text{p-val}=0.99123\) so I’ll run the model again w/out it:

my_multi <- lm(nperf ~ clock + threads + cores + transistors + 
                 dieSize + voltage + featureSize + channel + 
                 L1icache + sqrt(L1icache) + L1dcache + 
                 sqrt(L1dcache) + L2cache + sqrt(L2cache), data=int00.dat)
summary(my_multi)
## 
## Call:
## lm(formula = nperf ~ clock + threads + cores + transistors + 
##     dieSize + voltage + featureSize + channel + L1icache + sqrt(L1icache) + 
##     L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache), data = int00.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.795  -2.714   0.000   2.283   9.809 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.088e+01  7.584e+01  -0.275 0.783983    
## clock           2.604e-02  1.563e-03  16.662  < 2e-16 ***
## threads        -2.345e+00  2.070e+00  -1.133 0.261641    
## cores           2.248e+00  1.759e+00   1.278 0.206080    
## transistors    -5.556e-03  1.359e-02  -0.409 0.684020    
## dieSize         1.013e-02  1.571e-02   0.645 0.521488    
## voltage        -2.626e+01  7.302e+00  -3.596 0.000642 ***
## featureSize     3.104e+01  1.113e+02   0.279 0.781232    
## channel         8.855e+01  1.218e+02   0.727 0.469815    
## L1icache        1.103e+02  4.041e+01   2.729 0.008257 ** 
## sqrt(L1icache) -7.398e+02  2.866e+02  -2.581 0.012230 *  
## L1dcache       -1.115e+02  3.859e+01  -2.889 0.005311 ** 
## sqrt(L1dcache)  7.500e+02  2.632e+02   2.849 0.005937 ** 
## L2cache        -9.693e-03  1.494e-03  -6.488 1.64e-08 ***
## sqrt(L2cache)   1.222e+00  1.975e-01   6.189 5.33e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.594 on 62 degrees of freedom
##   (179 observations deleted due to missingness)
## Multiple R-squared:  0.9652, Adjusted R-squared:  0.9573 
## F-statistic: 122.8 on 14 and 62 DF,  p-value: < 2.2e-16

Next, I’ll remove featureSize from the predictors as it’s the next one that exceeds my predetermined threshold of \(p=0.05\). And continue down this route: removing predictors that exceed the threshold and recalculating. After featureSize I removed:

- `transistors`  
- `threads`  
- `dieSize`  

And the resulting summary is below. Note that each of the \(p\)-values are less than the predetermined threshold of \(p=0.05\) and therefore we end our backward elimination process.

my_multi <- lm(nperf ~ clock + cores + voltage + channel + L1icache + 
                 sqrt(L1icache) + L1dcache + sqrt(L1dcache) + 
                 L2cache + sqrt(L2cache), data=int00.dat)
summary(my_multi)
## 
## Call:
## lm(formula = nperf ~ clock + cores + voltage + channel + L1icache + 
##     sqrt(L1icache) + L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache), 
##     data = int00.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0240  -3.5195   0.3577   2.5486  12.0545 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.822e+01  3.840e+01  -1.516 0.133913    
## clock           2.482e-02  1.246e-03  19.922  < 2e-16 ***
## cores           2.397e+00  1.004e+00   2.389 0.019561 *  
## voltage        -2.358e+01  5.495e+00  -4.291 5.52e-05 ***
## channel         1.399e+02  3.960e+01   3.533 0.000726 ***
## L1icache        8.703e+01  1.972e+01   4.412 3.57e-05 ***
## sqrt(L1icache) -5.768e+02  1.391e+02  -4.146 9.24e-05 ***
## L1dcache       -8.903e+01  1.888e+01  -4.716 1.17e-05 ***
## sqrt(L1dcache)  5.980e+02  1.282e+02   4.665 1.41e-05 ***
## L2cache        -8.621e-03  1.273e-03  -6.772 3.07e-09 ***
## sqrt(L2cache)   1.085e+00  1.645e-01   6.598 6.36e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.683 on 71 degrees of freedom
##   (174 observations deleted due to missingness)
## Multiple R-squared:  0.9641, Adjusted R-squared:  0.959 
## F-statistic: 190.7 on 10 and 71 DF,  p-value: < 2.2e-16

Even though 10 is large number of predictors, the criteria has been met and we stick with it for now resulting in our model:

\[ \begin{equation} \begin{split} \text{nperf} =-58.22 + 0.02482\beta_{\text{clock}}+2.397\beta_{\text{cores}} \\ −23.58\beta_{\text{voltage}}+ 139.9\beta_{\text{channel}} + 87.03\beta_{\text{L1icache}} \\ - 576.8\beta_{\text{L1icache}} − 89.03\beta_{\sqrt{L1icache}} + 598\beta_{\sqrt{L1dcache}} \\ − 0.008621\beta_{L2cache} + 1.085\beta_{\sqrt{L2cache}} \end{split} \end{equation} \]

I’ll now look at the residuals to check for normality:

resids <- data.frame(clock = my_multi$fitted.values, 
                     multi_resids = my_multi$residuals)
#sum(resids$multi_resids)
ggplot(data = resids, aes(x=clock, y=multi_resids)) + 
  geom_point(size = 2, alpha = .4) + 
  geom_abline(intercept = 0, slope = 0, color = "red") +
  theme(panel.grid.major = element_line(color = "gray")) +
  labs(title = "Speed vs Mutli-Regression Residuals", 
       x = "Clock", 
       y = "Model Residuals") 

Above, I’m pretty satisfied that the points are normally sprinkled above and below the \(0\) value as expected. I’ll review the qq plot and histogram too:

qqnorm(my_multi$residuals)
qqline(my_multi$residuals)

hist(my_multi$residuals)

Above, the QQ plot shows significant improvement and the histogram appears more normal in nature. I am happy to proceed with this model with the understanding that not all are perfect.