library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library()
Question 1: Demonstration of the Central Limit Theorem — If you have a sample of 25 numbers drawn from a population with a mean of 60, and a standard deviation of 20. What is the expected sample mean and the expected sample standard deviation? What is the expected standard error of the mean? Do about 1000 simulations of sampling N=25 from a population distributed as N(μ = 60,σ2 = 202) and confirm that the mean of those samples is close to the expected mean, that the standard deviation of those samples is close to the expected standard deviation, and the standard error of the mean is close to the expected standard error of the mean. How would you describe the difference between the expected standard deviation of the sample and the expected standard error of the mean?
n <- 25
pop_mean <- 60
pop_sd <- 20
n_sim <- 1000
n_means <- rep(0, n_sim)
n_sd <- rep(0, n_sim)
n_se <- rep(0, n_sim)
for (i in 1:n_sim) {
sample <- rnorm(n, mean = pop_mean, sd = pop_sd)
n_means[i] <- mean(sample)
n_sd[i] <- sd(sample)
n_se[i] <- pop_sd/sqrt(n)
}
mean(n_means)
## [1] 59.86853
mean(n_sd)
## [1] 19.78673
mean(n_se)
## [1] 4
The expected mean and standard deviation of the sample is expected to be similar to that of the population. The standard error is the expected deviation of the sampling distributions’ mean from the population mean. The sample means are 59.97 which is very close to population mean 60 and the sample standard deviations is 19.79 which is very close to the population sd of 20. The expected sd of the sample is representative of the variability within the individual samples. The expected se of the mean is representative of the variability of the samples as a whole around the population’s mean.
Question 2:
crash_data <- read.csv("crash.csv")
Part A: Fit at least two models each with at least two inputs to understand the relationships between the variables. Explain your choices of predictor and outcome variables as well as any transformations you performed before fitting the models. Consider at least one model with interactions.
One predictor models
lm_no_rest_used <- lm(Alc_10 ~ KillUnre, data = crash_data)
summary(lm_no_rest_used)
##
## Call:
## lm(formula = Alc_10 ~ KillUnre, data = crash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6287 -5.0559 -0.6566 3.7442 11.5288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.16397 5.29762 5.883 3.55e-07 ***
## KillUnre 0.02812 0.09176 0.306 0.761
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.032 on 49 degrees of freedom
## Multiple R-squared: 0.001913, Adjusted R-squared: -0.01846
## F-statistic: 0.09391 on 1 and 49 DF, p-value: 0.7606
In the above model, I am looking at how well the percentage of those killed while not using restraint predicts the percent of fatalities that involved blood alcohol levels above .10g/dl. When the percentage of those killed while not using restraint is 0, or held constant, the percent of fatalities involving a BAC > .10g/dl is 31.16. For every one unit increase in the percent of those killed while using restraint there is a .03 increase in the fatalities involving a BAC > .10g/dl. This model accounts for 0% of variance (or -0.01846 according to the summary function output).
lm_rest_used <- lm(Alc_10 ~ KillRest, data = crash_data)
summary(lm_rest_used)
##
## Call:
## lm(formula = Alc_10 ~ KillRest, data = crash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7179 -4.8326 -0.5475 3.6748 12.2196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.26163 3.71304 9.497 1.09e-12 ***
## KillRest -0.07324 0.10616 -0.690 0.494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.009 on 49 degrees of freedom
## Multiple R-squared: 0.00962, Adjusted R-squared: -0.01059
## F-statistic: 0.476 on 1 and 49 DF, p-value: 0.4935
In the above model, I am looking at how well the percentage of those killed while using restraint predicts the percent of fatalities that involved blood alcohol levels above .10g/dl. When the percentage of those killed while using restraint is 0 the percent of fatalities involving a BAC > .10g/dl is 35.26. For every one unit increase in the percent of those killed while using restraint there is a .07 decrease in the fatalities involving a BAC > .10g/dl. This model accounts for 0% of variance.
Two predictor models:
lm_restraint <- lm(Alc_10 ~ KillRest + BeltUse, data = crash_data)
summary(lm_restraint)
##
## Call:
## lm(formula = Alc_10 ~ KillRest + BeltUse, data = crash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6020 -4.5548 -0.1099 3.5751 12.6953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.5324 4.6347 6.372 7.34e-08 ***
## KillRest -0.2586 0.1475 -1.753 0.0862 .
## BeltUse 0.1883 0.0994 1.894 0.0643 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.888 on 47 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.07742, Adjusted R-squared: 0.03816
## F-statistic: 1.972 on 2 and 47 DF, p-value: 0.1505
In the above model, I am looking at how the state percentage of those killed while using restraint and the state estimated seat belt use rate predicts the percent of fatalities that involved blood alcohol levels above .10g/dl. When both predictors or independent variables are 0 the percent of fatalities is 29.53. For every one unit increase in those killed while using restraint there is a .26 decrease in fatalities involving a BAC > .10g/dl. For every one unit increase in state compliance with seat belt use there is a .19 unit increase in fatalities involving a BAC > .10g/dl. This model accounts for 4% of the total variance.
lm_no_restraint <- lm(Alc_10 ~ KillUnre + BeltUse, data = crash_data)
summary(lm_no_restraint)
##
## Call:
## lm(formula = Alc_10 ~ KillUnre + BeltUse, data = crash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8215 -5.1300 -0.7285 2.8652 13.4944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.4565 11.6932 1.578 0.121
## KillUnre 0.1138 0.1218 0.934 0.355
## BeltUse 0.1219 0.0939 1.298 0.200
##
## Residual standard error: 6.022 on 47 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.03503, Adjusted R-squared: -0.006034
## F-statistic: 0.853 on 2 and 47 DF, p-value: 0.4326
In the above model, I am looking at how the state percentage of those killed while not using restraint and the state compliance with seat belt use predicts the percent of fatalities that involved blood alcohol levels above .10g/dl. When both predictors or independent variables are 0 the percent of fatalities involving a blood alcohol level above .10g/dl is 18.46. For every one unit increase in those killed while not using restraint there is a .11 increase in fatalities involving a BAC > .10g/dl. For every one unit increase in state compliance with seat belt use there is a .12 unit increase in fatalities involving a BAC > .10g/dl. This model accounts for 0% of the total variance.
model with an interaction
lm_restraint_interaction <- lm(Alc_10 ~ KillRest * BeltUse, data = crash_data)
summary(lm_restraint_interaction)
##
## Call:
## lm(formula = Alc_10 ~ KillRest * BeltUse, data = crash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2672 -4.2410 -0.7067 3.8862 12.5998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.408465 18.275503 2.266 0.0282 *
## KillRest -0.637154 0.582472 -1.094 0.2797
## BeltUse 0.002414 0.294116 0.008 0.9935
## KillRest:BeltUse 0.005755 0.008563 0.672 0.5049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.923 on 46 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.08639, Adjusted R-squared: 0.02681
## F-statistic: 1.45 on 3 and 46 DF, p-value: 0.2406
In the above model, I am looking to see if there is an interaction between state percentage of those killed while using restraint and the state estimated seat belt use rate, in which the relationship between the impact of the state percentage of those killed while using restraint and the percent of fatalities that involved blood alcohol levels above .10g/dl is dependent on the state’s estimated seat belt use rate. When both predictors or independent variables are 0 the percent of fatalities is 41.41. For every one unit increase in those killed while using restraint there is a .64 decrease in fatalities involving a BAC > .10g/dl. For every one unit increase in state compliance with seat belt use there is no unit change in fatalities involving a BAC > .10g/dl. Depending on the state percentage of those killed while using restraint for every increase in state seat belt use compliance there is a .01 increase in fatalities involving a BAC > .10g/dl. This model accounts for 2% of the total variance.
Part B: Display the fitted model graphically, and explain the meaning of each of the coefficients, along with the residual standard deviation. Plot the residuals versus fitted values. Comment on any relationships in the data that surprised you initially and whether or not you were able to make sense of them.
par(mfrow=c(2,2))
plot(lm_no_rest_used)
par(mfrow=c(2,2))
plot(lm_rest_used)
par(mfrow=c(2,2))
plot(lm_restraint)
par(mfrow=c(2,2))
plot(lm_no_restraint)
Looking at the last two graphs for the models with two predictors
(lm_restraint and lm_no_restraint) I am see that there is slight
clustering.
plotting the models with one predictor
plot(crash_data$KillUnre, crash_data$Alc_10, xlab = "Restraint Not Used", ylab = "Fatlities BAC > .10")
abline(lm_no_rest_used)
plot(crash_data$KillRest, crash_data$Alc_10, xlab = "Restraint Used", ylab = "Fatlities BAC > .10")
abline(lm_rest_used)
plotting the models with two predictors
plot(fitted(lm_no_restraint), resid(lm_no_restraint),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs. Fitted Values")
plot(fitted(lm_restraint), resid(lm_restraint),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs. Fitted Values")
In the first plot for lm_no_restraint there is slight clustering in the
middle and in the second plot for lm_restraint there is slight
clustering towards the left side.
Residuals are the difference between the observed and predicted values. The standard deviation of the residuals is measuring the variability of the residuals. The dispersion is pretty even so we can assume the assumption of homoscedasticity is not violated.
Question 3:
Part A: For two inputs in your models above (X1 and X2), compute their mean and standard deviation, X1 and sX1 , respectively, and vary them so that you have all 9 combinations of X1 − sX1 , X1, X1 +sX1 and X2 −sX2, X2, X2 +sX2.
(KillUnre <- mean(crash_data$KillUnre, na.rm=TRUE) +
c(-1, 0, 1) * sd(crash_data$KillUnre, na.rm=TRUE))
## [1] 47.69781 56.99412 66.29043
(BeltUse <- mean(crash_data$BeltUse, na.rm=TRUE) +
c(-1, 0, 1) * sd(crash_data$BeltUse, na.rm=TRUE))
## [1] 51.85677 63.72000 75.58323
(combs <- expand_grid(KillUnre, BeltUse))
## # A tibble: 9 × 2
## KillUnre BeltUse
## <dbl> <dbl>
## 1 47.7 51.9
## 2 47.7 63.7
## 3 47.7 75.6
## 4 57.0 51.9
## 5 57.0 63.7
## 6 57.0 75.6
## 7 66.3 51.9
## 8 66.3 63.7
## 9 66.3 75.6
Part B: Now use the predict() function with one of your fits from lm() for question 2 to generate 9 predictions. Visualize those values and explain how they correspond to your understanding about the relationships you examined previously.
predictions <- predict(lm_no_restraint, newdata = combs)
predictions
## 1 2 3 4 5 6 7 8
## 30.20733 31.65375 33.10018 31.26527 32.71170 34.15813 32.32322 33.76964
## 9
## 35.21607
plot(combs$KillUnre, predictions, type = "b",
xlab = "Predicted KillUnre", ylab = "Predicted Alc_10",
main = "Predicted Alc_10 vs. Predicted KillUnre")
plot(combs$BeltUse, predictions, type = "b",
xlab = "Predicted BeltUse", ylab = "Predicted Alc_10",
main = "Predicted Alc_10 vs. Predicted BeltUse")
They correspond in the sense that an increase in the percentage of those
killed while not using restraint is associated with an increase in the
the percent of fatalities that involved blood alcohol levels above
.10g/dl.