Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension.
If an exercise asks you to use R, include a copy of all relevant code and output in your submitted homework file. You can copy/paste your code, take screenshots, or compile your work in an Rmarkdown document.
If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manual calculations on your exams, so practice accordingly.
You must include an explanation and/or intermediate calculations for an exercise to be complete.
10 points are awarded for completion: exercise prompts are attempted, explanations (and code/output if relevant) are provided, and the submission is clear and easy to read.
S1. Suppose we are interested in exploring the relationship between city air particulate and rates of childhood asthma. We sample 15 cities for particulate (X) measured in parts-per-million (ppm) of large particulate matter and for the rate of childhood asthma (Y) measured in percents.
| variable | size | mean | variance |
|---|---|---|---|
| X | 15 | 11.42 | 13.05 |
| Y | 15 | 14.513 | 2.636 |
- Plot the data as you see fit and summarize the pattern’s shape, direction, and strength in the context of the problem.
From plotting the particulate and asthma data on the X and Y axis respectively, it seems there is a strong, positive, linear pattern for this data. So the level of particulate and the percent of childhood asthma have this pattern between them.
particulate <- c(11.6, 15.9, 15.7, 7.9, 6.3, 13.7, 13.1,
10.8, 6.0, 7.6, 14.8, 7.4, 16.2, 13.1, 11.2)
asthma <- c(14.5, 16.6, 16.5, 12.6, 12.0, 15.8, 15.1,
14.2, 12.2, 13.1, 16.0, 12.9, 16.4, 15.4, 14.4)
plot(particulate, asthma)
- Calculate the correlation coefficient (with R) and explain how the value corresponds to what you observed in the graph in part (a).
cor(particulate, asthma) = 0.9931873
This value makes sense when referring to our plot, since the data points showed a positive linear pattern and were quite close together, which corresponds to having a correlation coefficient close to 1.
- Build a linear regression model with least squares estimators for slope and y intercept for the data.
- First, build a regression model by hand using the correlation computed in (b) and summary statistics given above.
estimated slope = \(\hat{\beta}_1\) = r * \(s_y / s_x\) = 0.9931873 * (sqrt(2.636)/sqrt(13.05)) = 0.4463737
intercept = \(\bar{y} - \hat{\beta}_1 \bar{x}\) = 14.513-0.4463737*(11.42) = 9.415412
model = \(y = 0.4463737 * x + 9.415412\)
- Check your computations using
lmin R.
asthma_mod <- lm(asthma ~ particulate)
asthma_mod
##
## Call:
## lm(formula = asthma ~ particulate)
##
## Coefficients:
## (Intercept) particulate
## 9.4163 0.4463
- Interpret the estimated intercept and slope in the context of the question.
The estimated intercept of 9.415412 is referring to the estimated average rate of asthma in cities that have large particulate matter measurement of 0, which is not within the range of our data.
The estimated slope is 0.4463737 which means that as the measurement of large particulate matter increases, the rate of childhood asthma on average increases by 0.4463737 percentage points.
- Construct a residual plot of fitted y values on the x axis and residuals on the y. Also make a qq-plot of the residuals. Graphically assess whether the correct model and constant variance assumptions are reasonably met.
I don’t see any obvious deviations from constant variance or any sort of quadratic pattern on this plot which shows that our decision to use a linear model was a good choice and the constant variances assumption is met. The qqplot shows the points in a relatively straight line following the qqline well, which means the residuals are normal.
plot(fitted(asthma_mod), resid(asthma_mod))
qqnorm(resid(asthma_mod))
qqline(resid(asthma_mod))
summary(asthma_mod)
##
## Call:
## lm(formula = asthma ~ particulate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34226 -0.12842 -0.01514 0.12130 0.29164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.41626 0.17344 54.29 < 2e-16 ***
## particulate 0.44633 0.01452 30.73 1.6e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1963 on 13 degrees of freedom
## Multiple R-squared: 0.9864, Adjusted R-squared: 0.9854
## F-statistic: 944.4 on 1 and 13 DF, p-value: 1.595e-13
- Identify what (particulate, asthma) point results in the residual with the largest magnitude. Is that point above or below the fitted regression line? Show how the residual is calculated.
The list of residuals calculated with R provides that -0.34225706 (the 4th point) is the furthest from 0 (greatest magnitude). This point is negative so it is below the regression line.
Observed y = 12.6 Observed x = 7.9 (4th data points from particulate and asthma rate vectors)
predicted value = 9.415412 + (0.4463737 * 7.9) = 12.94176
asthma_mod$residuals
## 1 2 3 4 5 6
## -0.09367246 0.08711504 0.07638074 -0.34225706 -0.22813148 0.26903771
## 7 8 9 10 11 12
## -0.16316519 -0.03660967 0.10576707 0.29164149 -0.02192362 0.18090719
## 13 14 15
## -0.24678350 0.13683481 -0.01514107
S2. Continue working with the linear model for the relationship between city air particulate and rates of childhood asthma.
- Suppose we sample a new city whose particulate is 13 ppm. If reasonable, create a 95% interval for the predicted rate of childhood asthma in this city. If not reasonable, explain why.
point estimate = \(\hat{y}_i\) = 9.415412 + 13*(0.4463737) = 15.21827
critical value = qt(0.975, df=15-2) = 2.160369
\(SS_E\) = \({\sum{Residuals^2}}\) = (-0.09367246)^2 + 0.08711504^2 + 0.07638074^2 + (-0.34225706)^2 + (-0.22813148)^2 + 0.26903771^2 + (-0.16316519)^2 + (-0.03660967)^2 + 0.10576707^2 + 0.29164149^2 + (-0.02192362)^2 + 0.18090719^2 + (-0.24678350)^2 + 0.13683481^2 + (-0.01514107)^2 = 0.5010305
\(MS_E\) = \(\frac{\sum{Residuals^2}}{n-2}\) = 0.5010305 / (15-2) = 0.03854081
s = sqrt(0.03854081) = 0.1963181 (residual standard error)
\(\sum^{15}_{i=1} (x_i-\bar{x})^2\) = (n-1)\(s^2_x\) = (15-1) 13.05 = 182.7
mean of particulate = 11.42 \((x^* - \bar{x}^2)\) = (13 - 11.42)^2 = 2.4964
\(\hat{se}(\hat{y}|x^* = 13)\) = 0.1963181 * sqrt((1) + (1/15) + ((2.4964)/(182.7))) = 0.204051
Prediction Interval = 15.21827 \(\pm\) 2.160369 * 0.204051 = (14.77744, 15.6591)
- Create a 95% confidence interval for the average rate of childhood asthma among all cities with 10 ppm of large particulate. Is this confidence interval wider or narrower than a 95% prediction interval for the rate of childhood asthma in the next city with 10 ppm of large particulate? Explain why you know this without having to build the PI.
point estimate = \(\hat{y}_i\) = 9.415412 + 10*(0.4463737) = 13.87915
critical value = qt(0.975, df=15-2) = 2.160369
\(SS_E\) = \({\sum{Residuals^2}}\) = (-0.09367246)^2 + 0.08711504^2 + 0.07638074^2 + (-0.34225706)^2 + (-0.22813148)^2 + 0.26903771^2 + (-0.16316519)^2 + (-0.03660967)^2 + 0.10576707^2 + 0.29164149^2 + (-0.02192362)^2 + 0.18090719^2 + (-0.24678350)^2 + 0.13683481^2 + (-0.01514107)^2 = 0.5010305
\(MS_E\) = \(\frac{\sum{Residuals^2}}{n-2}\) = 0.5010305 / (15-2) = 0.03854081
s = sqrt(0.03854081) = 0.1963181
\(\hat{se}(E(\hat{y}|x^* = 13))\) = 0.1963181 * sqrt((1/15) + ((2.4964)/(182.7))) = 0.05564176
Prediction Interval = 13.87915 \(\pm\) 2.160369 * 0.05564176 = (13.75894, 13.99936)
This confidence interval is narrower than the Prediction Interval from part A, I would know this without even building the PI because part A was a Type 2 Prediction (predicting a single NEW data point) whereas this CI is predicting the AVERAGE rate of childhood asthma) which is more reliable and has less uncertainty.
- If reasonable, create a 95% interval for the predicted rate of childhood asthma in the next city sampled that has 3 ppm of large particulate. If this is not reasonable, explain why.
This is not reasonable because 3ppm of large particulate is WAY outside of the range of particulate of the observed data, thus this would be extrapolation as we have no reason to believe that the model is correct outside of our data range.
- Will a 95% confidence interval for the average rate of childhood asthma constructed at \(x=12\) be wider/narrower/same width as one constructed at \(x=7\)? Explain how you can tell without having to build the new interval.
A 95% CI for the average rate of childhood asthma constructed at \(x=12\) be narrower than one constructed at \(x=7\) because \(x=12\) is closer to the mean of variable X (11.42), so the numerator of \(((x^* - \bar{x})^2)/(\sum(x_i - \bar{x})^2))\) will be smaller leading to a smaller standard error and thus a narrower interval.
S3. In the paper “Artificial Trees as a Cavity Substrate for Woodpeckers”, scientists provided polystyrene cylinders as an alternative roost. The paper related values of x = ambient temperature (C) and y = cavity depth (cm). A scatterplot in the paper showed a strong linear relationship between x and y. The summary values for x and y are given below:
| Variable | Size | Mean | Variance |
|---|---|---|---|
| Temp (x) | 12 | 10.92 | 137.17 |
| Depth (y) | 12 | 16.36 | 21.28 |
A least-squares linear model for (Depth ~ Temp) was fit, and the intercept was estimated to be 20.12506 with standard error 0.94023. The slope was estimated to be -0.34504 with standard error 0.06008. The MSE for the model is \(2.334^2\).
- Determine the sample correlation (r) from the summaries given.
\(\hat{\beta}_1\) = \(r\frac{s_y}{s_x}\)
r = \(\hat{\beta}_1\frac{s_y}{s_x}\) = -0.34504 * (sqrt(137.17)/sqrt(21.28)) = 0.8760183
Determine test statistics and p values for the tests in parts (b)-(e).
- \(H_0: \beta_1=0\) vs \(H_A: \beta_1 \ne 0\)
Test Statistic = (-0.34504-0)/(0.06008) = -5.743009
P-value = 2*pt(-5.743009, df=10) = 0.0001869473
- \(H_0: \beta_1 \ge 0\) vs \(H_A: \beta_1 < 0\)
Test Statistic = (-0.34504-0)/(0.06008) = -5.743009
P-value = pt(-5.743009, df=10) = 9.347365e-05
- \(H_0: \beta_1 \le 0\) vs \(H_A: \beta_1 > 0\)
Test Statistic = (-0.34504-0)/(0.06008) = -5.743009
P-value = pt(-5.743009, df=10, lower.tail=F) = 0.9999065
- \(H_0: \beta_1=-0.5\) vs \(H_A: \beta_1 \ne -0.5\)
Test Statistic = (-0.34504-(-0.5))/(0.06008) = 2.579228
P-value = 2*pt(2.579228, df=10, lower.tail=F) = 0.02745264
- Compute and interpret a 98% confidence interval for the slope of the regression line \(\beta_1\).
CV = qt(0.99, df=10) = 2.763769 \(-0.34504 \pm t_{0.01, 10} * 0.06008 = -0.34504 \pm 2.764 * 0.06008\) = (−0.511, −0.179)
This interval means that we are 98% confident that the true slope for this scenario, which relates temperature and cavity depth is between -0.511 and -0.179. Or in other words, the temperature increasing by 1 degree is related to cavity depth decreasing by between 0.511cm and 0.179cm.
S4. The paper “Linkage Studies of the Tomato” reported the following data on phenotypes resulting from crossing tall cut-leaf tomatoes with dwarf potato-leaf tomatoes. We wish to investigate whether the following frequencies are consistent with genetic laws, which state that the phenotypes should occur in the ratio 9:3:3:1.
| Phenotype: | Tall cut | Tall Potato | Dwarf Cut | Dwarf Potato |
|---|---|---|---|---|
| Frequency: | 926 | 288 | 293 | 104 |
- Explain why the null hypothesis given below is appropriate to test whether the data support the proposed genetic model. Also give an appropriate alternative hypothesis.
\[H_0:\quad \pi_{TC}=9/16,\quad \pi_{TP}=3/16,\quad \pi_{DC}=3/16,\quad \pi_{DP}=1/16\]
This null hypothesis is appropriate because we are trying to test if the observed frequencies of each phenotype match the expected phenotypes from genetic laws, that being a ratio of 9:3:3:1. The expected ratios of each phenotype would thus be 9/16 (Tall cut), 3/16 (tall potato), 3/16 (dwarf cut), 1/16 (dwarf potato). So the null hypothesis setting the proportions of each phenotype equal to their respective expected proportions is appropriate.
Alternative Hypothesis = At least one of the observed phenotypes does not match its expected ratio.
- Carry out the appropriate test for the hypotheses in (a), using \(\alpha = 0.1\), being sure to check assumptions are reasonably met. You can use the table below to organize the expected counts.
| Phenotype: | Tall cut | Tall Potato | Dwarf Cut | Dwarf Potato | Total |
|---|---|---|---|---|---|
| Obs Frequency: | 926 | 288 | 293 | 104 | 1611 |
| Exp Frequency: | 1611*(9/16)=906.188 | 1611*(3/16)= 302.0625 | 1611*(3/16)= 302.0625 | 1611*(1/16)= 100.6875 | 1611 |
Assumptions: Sample Size is large (1611 observations total) and the expected amounts for each phenotype is at least 5.
Chi-square statistic = ((926-906.188)^2/906.188) + ((288-302.0625)^2/302.0625) + ((293-302.0625)^2/302.0625) + ((104-100.6875)^2/100.6875) = 1.4687
df = 4-1=3
pchisq(1.4687, df=3, lower.tail=F) = 0.689513
We fail to reject the null because a p-value of 0.689513 is greater than alpha = 0.1. Therefore the observed data is not statistically significantly different from the expected data of the ratios of the phenotypes.
C1. Compared to a fitted regression line, the point \((x_1, y_1) = (3, 11)\) has residual 1 and the point \((x_2, y_2) = (5, 12)\) has residual -2.
- Which point lies above the fitted regression line and which point lies below the line? Explain how you know.
Positive Residual = above fitted regression line Negative Residual = below fitted regression line
The first point, \((x_1, y_1) = (3, 11)\) has a residual of 1 (positive) so this point lies above the fitted regression line. The second point, \((x_2, y_2) = (5, 12)\), has a residual of -2 (negative) so this point lies below the fitted regression line.
- Recover the slope and intercept of the fitted regression line.
Using the points and their residuals, the predicted value at x=3 is 11-1, so \(\hat{y}_1\) = 11-1=10 and at x=5 \(\hat{y}_2\) = 12+2=14. Thus two points on the fitted regression line are (3,10) and (5,14) which give a slope of ((14-10)/(5-3))=2. We can solve for the intercept using the first point. 10=2*3+intercept, intercept = 4.
- The mean of the \(X\) data is \(\bar{x} = 4.5\). Find \(\bar{y}\).
\(\bar{y}=\hat{y}*(\bar{x})\) SO \(\bar{y}=\) 2*(4.5) + 4 = 13
- The point (5.5, 15) is added to the data. How do the slope and intercept of the fitted regression line change?
\(\hat{y} = 2 * (5.5) + 4\) = 15. This point (5.5, 15) falls directly on the fitted regression line we already have, so it does not change the slope or intercept of the fitted regression line at all.
C2. Using the definitions from lecture, show mathematically that \[\hat{\beta}_1 \;=\; \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2}\] Sum of squared residuals SSR = $_{i=1}^n e^2_i = $ = \(\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\)
set \(\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\) = 0
\(n\hat{\beta}_0 = \sum_{i=1}^n y_i - \hat{\beta}_1 \sum_{i=1}^nx_i\)
solve for \(\hat{\beta}_0\) = \(\frac{1}{n}\sum_{i=1}^n y_i - \hat{\beta}_1 \frac{1}{n}\sum_{i=1}^nx_i\)
\(\hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\)
\(\sum_{i=1}^n(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)x_i=0\)
substitute \(\hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\) :
\(\sum_{i=1}^ny_ix_i-\bar{y}\bar{x}+\hat{\beta}_1n\bar{x}^2 - \hat{\beta}_1\sum_{i=1}^nx^2_i = 0\)
solve for \(\hat{\beta}_1\) recognizing that \(\sum_{i=1}^n(x_i-\bar{x})^2 = \sum_{i=1}^nx_i^2-n\bar{x}^2\) therefore : \(\hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2}\)
C3. Your R installation comes with a built-in dataset
mtcars that measures several characteristics of different
automobiles. You can access the entire dataset or a specific column as
follows:
# Full dataset
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# Get the column for miles per gallon
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
- Using R, fit a linear model for miles per gallon (y) in terms of weight (x), in 1000s of lbs. Interpret the estimated slope and intercept.
- Perform a residual analysis to validate the fit of your linear model.
- Data analysts use different quantitative metrics to evaluate the fit of a model and compare it to other models. Using R, calculate the value of the following metrics for the linear model in (a).
- Bias is the sum of the residuals, which tells us whether a model has the tendency to overestimate or underestimate.
\[Bias \;=\; \sum_{i=1}^n (y_i - \hat{y}_i)\]
- Adjusted R-squared has a similar interpretation to the coefficient of determination \(R^2\), but it penalizes over-complicated models.
\[R^2_{adj} \;=\; 1 - \Big(\frac{(1 - R^2)(n - 1)}{n - 2}\Big)\]
- Mean absolute error is the typical magnitude of the errors in the linear model.
\[MAE \;=\; \frac{1}{n}\sum_{i=1}^n|y_i - \hat{y}_i|\]
- Mean absolute percentage error is the typical difference between an actual and predicted value, expressed as a percentage.
\[MAPE \;=\; 100\times \frac{1}{N}\sum_{i=1}^n \Big|\frac{y_i - \hat{y}_i}{y_i}\Big|\]