library(tidyverse)
data(cars)
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
Cars:
The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.
Plot:
The response variable is the breaking distance and the explanatory variable is the vehicle speed.
ggplot(cars, aes(x = speed, y = dist)) +
geom_point(size = 2, alpha = .4) +
geom_smooth(method = "lm", se = FALSE, alpha = .2) +
labs(title = "Speed vs Stopping Distance",
x = "Speed (mph)",
y = "Stopping distance (ft)")
Linear Model:
speed_v_stop <- lm(speed~dist, data = cars)
Residuals:
Here I’m looking for constant variability: The below plot depicts the residuals of my model and they appear to be kind of regularly dispersed about zero - I’m not absolutely convinced this model is displaying constant variability.
ggplot(data = cars, aes(x=speed, y=speed_v_stop$residuals)) +
geom_point(size = 2, alpha = .4) +
geom_abline(intercept = 0, slope = 0, color = "red") +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Speed vs Linear Model Residuals",
x = "Speed (mph)",
y = "Model Residuals")
To investigate further, I’ll need to check for Nearly Normal Residuals: below, the qqnorm normal probability plot of the linear model’s residuals supports a normal determination although there are tails.
qqnorm(speed_v_stop$residuals)
qqline(speed_v_stop$residuals)
Lastly, I’ll review the histogram and see if it appears normal in nature:
hist(speed_v_stop$residuals)
I’m cautiously optimistic that there’s a relationship here given the mildly normal nature of the above histogram. To test this, I’ll need to use inference.
Inference:
\(H_0:\) Speed and stopping distance are unrelated (i.e. \(\beta_1 = 0\))
\(H_A:\) Speed and stopping distance are positively correlated (i.e. \(\beta_1 > 0\))
Conclusion:
summary(speed_v_stop)
##
## Call:
## lm(formula = speed ~ dist, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The summary of the linear model gives me the equation for the model:
\[\text{Break Distance} = 8.28391 + 0.16557\times \text{Speed}\]
With a resulting p-value close to zero, I’m happy to reject \(H_0\) in favor of \(H_A\) and am confident this model describes the relationship well.
Processor Performance Results:
Note: Much of the text, guidance, and data comes from Linear Regression Using R: An Introduction to Data Modeling:
The data frame
int92.dat
contains the data from the CPU DB database for all of the processors for which performance results were available for the SPEC Integer 1992 (Int1992) benchmark program. Similarly,fp92.dat
contains the data for the processors that executed the Floating-Point 1992 (Fp1992) benchmarks, and so on.
# read in the data as instructed on pg11 of text
head(int00.dat)
## nperf perf clock threads cores TDP transistors dieSize voltage
## 1 11.07218 425.6607 1200 1 1 NA NA NA NA
## 2 11.48725 438.0000 1300 1 1 NA NA NA NA
## 3 12.96734 482.0000 1333 1 1 NA NA NA NA
## 4 13.40464 495.0000 1400 1 1 56 NA 120 1.8
## 5 39.33987 1266.0000 2000 1 1 NA NA NA NA
## 6 39.44078 1269.0000 2600 1 1 NA NA NA NA
## featureSize channel FO4delay L1icache L1dcache L2cache L3cache
## 1 0.18 0.10 36.0 64 64 256 NA
## 2 0.18 0.10 36.0 64 64 256 NA
## 3 0.18 0.10 36.0 64 64 256 NA
## 4 0.18 0.10 36.0 64 64 256 NA
## 5 0.13 0.07 25.2 64 64 1024 NA
## 6 0.13 0.07 25.2 64 64 1024 NA
Plot:
Review the plot of Performance v Clock:
ggplot(data = int00.dat, aes(x=clock, y=perf)) +
geom_point(size = 2, alpha = .4) +
geom_smooth(method = "lm", se = FALSE, alpha = .2) +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Clock vs. Performance",
subtitle = "Int2000 benchmark versus the clock frequency",
x = "Clock",
y = "Performance")
#plot(fitted(int00.lm),resid(int00.lm))
#int00.lm$fitted.values
The Linear Model & Residuals:
clk_v_perf <- lm(clock~perf, data = int00.dat)
ggplot(data = int00.dat, aes(x=clock, y=clk_v_perf$residuals)) +
geom_point(size = 2, alpha = .4) +
geom_abline(intercept = 0, slope = 0, color = "red") +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Speed vs Linear Model Residuals",
x = "Speed (mph)",
y = "Model Residuals")
And the associated qq plots
qqnorm(clk_v_perf$residuals)
qqline(clk_v_perf$residuals)
hist(clk_v_perf$residuals)
There is some skewness out the higher end of the sample quartiles which suggests that the distribution’s tails are “heavier” than what I would expect from a normal distribution. The clock won’t be enough to explain the data.
Multi-factor Regression:
Looking vertical down the perf column, we can see some possible candidates for regression - or rather, other candidates for explanatory variable of the performance perf
. Note that nperf
is just a linear re-scaling of perf
and will be used going forward:
pairs(int00.dat, gap = 0.5)
Below, I select multiple possible predictors noting:
pg29 of an Introduction to Data Modeling:
However, a good regression model explains the relationship between a system’s inputs and output as simply as possible. Thus, we should use the smallest number of predictors necessary to provide good predictions. Furthermore, using too many or redundant predictors builds the random noise in the data into the model. In this situation, we obtain an over-fitted model that is very good at predicting the outputs from the specific input data set used to train the model. It does not accurately model the overall system’s response, though, and it will not appropriately predict the system output for a broader range of inputs than those on which it was trained. Redundant or unnecessary predictors also can lead to numerical instabilities when computing the coefficients.
We use \(R_{\text{adjusted}} = 1 - \frac{n-1}{n-m}(1- R^2)\) where \(n\) is the number of observations and \(m\) is the number of predictors in the model.
Too many predictors will train our model to follow the data’s random variations (noise) too closely. Too few predictors will produce a model that may not be as accurate at predicting future values as a model with more predictors.
I’ll use backwards elimination per the text to help decide which predictors to keep in our model and which to exclude. We’ll start with all possible predictors and use lm()
to compute the model with intention that my threshold for meaningfulness will be: \(p=0.05\).
my_multi <- lm(nperf ~ clock + threads + cores + transistors +
dieSize + voltage + featureSize + channel +
FO4delay + L1icache + sqrt(L1icache) + L1dcache +
sqrt(L1dcache) + L2cache + sqrt(L2cache), data=int00.dat)
summary(my_multi)
##
## Call:
## lm(formula = nperf ~ clock + threads + cores + transistors +
## dieSize + voltage + featureSize + channel + FO4delay + L1icache +
## sqrt(L1icache) + L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache),
## data = int00.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.804 -2.702 0.000 2.285 9.809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.108e+01 7.852e+01 -0.268 0.78927
## clock 2.605e-02 1.671e-03 15.594 < 2e-16 ***
## threads -2.346e+00 2.089e+00 -1.123 0.26596
## cores 2.246e+00 1.782e+00 1.260 0.21235
## transistors -5.580e-03 1.388e-02 -0.402 0.68897
## dieSize 1.021e-02 1.746e-02 0.585 0.56084
## voltage -2.623e+01 7.698e+00 -3.408 0.00117 **
## featureSize 3.101e+01 1.122e+02 0.276 0.78324
## channel 9.496e+01 5.945e+02 0.160 0.87361
## FO4delay -1.765e-02 1.600e+00 -0.011 0.99123
## L1icache 1.102e+02 4.206e+01 2.619 0.01111 *
## sqrt(L1icache) -7.390e+02 2.980e+02 -2.480 0.01593 *
## L1dcache -1.114e+02 4.019e+01 -2.771 0.00739 **
## sqrt(L1dcache) 7.492e+02 2.739e+02 2.735 0.00815 **
## L2cache -9.684e-03 1.745e-03 -5.550 6.57e-07 ***
## sqrt(L2cache) 1.221e+00 2.425e-01 5.034 4.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.632 on 61 degrees of freedom
## (179 observations deleted due to missingness)
## Multiple R-squared: 0.9652, Adjusted R-squared: 0.9566
## F-statistic: 112.8 on 15 and 61 DF, p-value: < 2.2e-16
I note that \(R^2\) and \(R_{\text{adjusted}}\) are close to 1 indicating that the model explains nperf
values well but I’m not sure if it’s just good at modelling the data’s noise. I apply the backward elimination procedure by identifying the predictor with the largest p-value that exceeds our predetermined threshold of \(p = 0.05\). I note that \(\texttt{FO4delay0}'s\text{ }\text{p-val}=0.99123\) so I’ll run the model again w/out it:
my_multi <- lm(nperf ~ clock + threads + cores + transistors +
dieSize + voltage + featureSize + channel +
L1icache + sqrt(L1icache) + L1dcache +
sqrt(L1dcache) + L2cache + sqrt(L2cache), data=int00.dat)
summary(my_multi)
##
## Call:
## lm(formula = nperf ~ clock + threads + cores + transistors +
## dieSize + voltage + featureSize + channel + L1icache + sqrt(L1icache) +
## L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache), data = int00.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.795 -2.714 0.000 2.283 9.809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.088e+01 7.584e+01 -0.275 0.783983
## clock 2.604e-02 1.563e-03 16.662 < 2e-16 ***
## threads -2.345e+00 2.070e+00 -1.133 0.261641
## cores 2.248e+00 1.759e+00 1.278 0.206080
## transistors -5.556e-03 1.359e-02 -0.409 0.684020
## dieSize 1.013e-02 1.571e-02 0.645 0.521488
## voltage -2.626e+01 7.302e+00 -3.596 0.000642 ***
## featureSize 3.104e+01 1.113e+02 0.279 0.781232
## channel 8.855e+01 1.218e+02 0.727 0.469815
## L1icache 1.103e+02 4.041e+01 2.729 0.008257 **
## sqrt(L1icache) -7.398e+02 2.866e+02 -2.581 0.012230 *
## L1dcache -1.115e+02 3.859e+01 -2.889 0.005311 **
## sqrt(L1dcache) 7.500e+02 2.632e+02 2.849 0.005937 **
## L2cache -9.693e-03 1.494e-03 -6.488 1.64e-08 ***
## sqrt(L2cache) 1.222e+00 1.975e-01 6.189 5.33e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.594 on 62 degrees of freedom
## (179 observations deleted due to missingness)
## Multiple R-squared: 0.9652, Adjusted R-squared: 0.9573
## F-statistic: 122.8 on 14 and 62 DF, p-value: < 2.2e-16
Next, I’ll remove featureSize
from the predictors as it’s the next one that exceeds my predetermined threshold of \(p=0.05\). And continue down this route: removing predictors that exceed the threshold and recalculating. After featureSize
I removed:
- `transistors`
- `threads`
- `dieSize`
And the resulting summary is below. Note that each of the \(p\)-values are less than the predetermined threshold of \(p=0.05\) and therefore we end our backward elimination process.
my_multi <- lm(nperf ~ clock + cores + voltage + channel + L1icache +
sqrt(L1icache) + L1dcache + sqrt(L1dcache) +
L2cache + sqrt(L2cache), data=int00.dat)
summary(my_multi)
##
## Call:
## lm(formula = nperf ~ clock + cores + voltage + channel + L1icache +
## sqrt(L1icache) + L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache),
## data = int00.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0240 -3.5195 0.3577 2.5486 12.0545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.822e+01 3.840e+01 -1.516 0.133913
## clock 2.482e-02 1.246e-03 19.922 < 2e-16 ***
## cores 2.397e+00 1.004e+00 2.389 0.019561 *
## voltage -2.358e+01 5.495e+00 -4.291 5.52e-05 ***
## channel 1.399e+02 3.960e+01 3.533 0.000726 ***
## L1icache 8.703e+01 1.972e+01 4.412 3.57e-05 ***
## sqrt(L1icache) -5.768e+02 1.391e+02 -4.146 9.24e-05 ***
## L1dcache -8.903e+01 1.888e+01 -4.716 1.17e-05 ***
## sqrt(L1dcache) 5.980e+02 1.282e+02 4.665 1.41e-05 ***
## L2cache -8.621e-03 1.273e-03 -6.772 3.07e-09 ***
## sqrt(L2cache) 1.085e+00 1.645e-01 6.598 6.36e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.683 on 71 degrees of freedom
## (174 observations deleted due to missingness)
## Multiple R-squared: 0.9641, Adjusted R-squared: 0.959
## F-statistic: 190.7 on 10 and 71 DF, p-value: < 2.2e-16
Even though 10 is large number of predictors, the criteria has been met and we stick with it for now resulting in our model:
\[ \begin{equation} \begin{split} \text{nperf} =-58.22 + 0.02482\beta_{\text{clock}}+2.397\beta_{\text{cores}} \\ −23.58\beta_{\text{voltage}}+ 139.9\beta_{\text{channel}} + 87.03\beta_{\text{L1icache}} \\ - 576.8\beta_{\text{L1icache}} − 89.03\beta_{\sqrt{L1icache}} + 598\beta_{\sqrt{L1dcache}} \\ − 0.008621\beta_{L2cache} + 1.085\beta_{\sqrt{L2cache}} \end{split} \end{equation} \]
I’ll now look at the residuals to check for normality:
resids <- data.frame(clock = my_multi$fitted.values,
multi_resids = my_multi$residuals)
#sum(resids$multi_resids)
ggplot(data = resids, aes(x=clock, y=multi_resids)) +
geom_point(size = 2, alpha = .4) +
geom_abline(intercept = 0, slope = 0, color = "red") +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Speed vs Mutli-Regression Residuals",
x = "Clock",
y = "Model Residuals")
Above, I’m pretty satisfied that the points are normally sprinkled above and below the \(0\) value as expected. I’ll review the qq plot and histogram too:
qqnorm(my_multi$residuals)
qqline(my_multi$residuals)
hist(my_multi$residuals)
Above, the QQ plot shows significant improvement and the histogram appears more normal in nature. I am happy to proceed with this model with the understanding that not all are perfect.