Note on Assignments

Assignments are not stand alone and are designed to be answered in conjunction with lecture notes and case studies. You need to follow the R-code taught in the course when completing the assignment. Alternative R code (and interpretation) not taught in the course and extraneous R output (and interpretation) included in your answers can lead to deductions in marks. Note: AI use frequently will generate alternative code and interpretation that does not follow the course material.

1 Question 1 [13 Marks]

A study was conducted in schools from Michigan. The researchers were interested in whether the students from the Detroit area consumed more vegetables than students from the other area in Michigan. Random samples were collected, and the servings of vegetables consumed per day were estimated from a standard food frequency questionnaire.

The data was stored in veg.csv and includes variables:

Variable Description
veg The number of vegetable (serving) consumed per day by the student,
area The area of Michigan the student is from (Detroit or Other).

Instructions:

1.1 Question of interest/goal of the study

We are interested in comparing the vegetable consumption from students in and out of Detroit area.

1.2 Read the data.

veg.df=read.csv(file='veg.csv', header = TRUE, stringsAsFactors = TRUE)

1.3 Inspect the data:

boxplot(veg~area,horizontal=TRUE,data=veg.df)

summaryStats(veg~area,data=veg.df)
##         Sample Size     Mean Median  Std Dev Midspread
## Detroit        1107 1.447683  0.985 1.480083    1.4250
## Other          1107 1.686287  1.150 1.693791    1.6195
boxplot(log(veg)~area,horizontal=TRUE,data=veg.df)

summaryStats(log(veg)~area,data=veg.df)
##         Sample Size        Mean      Median  Std Dev Midspread
## Detroit        1107 -0.08806390 -0.01511364 1.022407  1.370696
## Other          1107  0.06449011  0.13976194 1.030286  1.326023

1.4 Comment on plots of the data

The box plot of vegetable servings (veg) by area shows that both groups are right-skewed, with a large number of outliers extending to the far right. The median servings for Detroit (0.985) and Other areas (1.150) are both quite low, but they have large standard deviations (1.48 and 1.69) and the midspreads differ (1.425 - 1.619), this suggests an unequal variance between groups. The higher centre suggests a higher spread which dosen’t agree with the equality of variance assumptions. The boxplot of log(veg) by area shows a clear improvement in symmetry. On the log scale, the standard deviations are very similiar (Detroit: 1.022 and Other: 1.030), and the midspreads are also similair (1.371 - 1.326). The medians on the log scale are also close (-0.015 for Detroit and 0.140 for Other), suggesting that the Other area students consume slightly more vegetables than Detroit Students.

1.5 Comment why it is more appropriate to log the number of vegetable servings to analyse this data.

The raw vegetable serving data is very right-skewed with many outliers, and the spread increases within the centres, the Other area group has a larger standard deviation (1.694) and midspread (1.620) than the Detroit group (SD: 1.480 and MS: 1.425). This is a non-constant variance which violates the equality of variance assumption that is required for a linear model. After log-transforming, the distribution becomes more symmetrical and the spreads between the two groups become very similair. The standard deviations are now more equal (Detroit: 1.022, Other: 1.030) and the midspreads are also more similair (1.371 - 1.326). Log-transforming now satisfies the equality of variance assumption much better. Therefore, it is more appropriate to analyse log(veg) rather than veg in our linear model.

1.6 Fit an appropriate linear model, including model checks. (DO NOT use the Welch test here.) Generate confidence intervals for the model.

veg.fit = lm(log(veg) ~ area, data = veg.df)
plot(veg.fit, which = 1)

normcheck(veg.fit)

cooks20x(veg.fit)

summary(veg.fit)
## 
## Call:
## lm(formula = log(veg) ~ area, data = veg.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4169 -0.6169  0.0753  0.7322  2.6122 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08806    0.03085  -2.855  0.00435 ** 
## areaOther    0.15255    0.04363   3.497  0.00048 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 2212 degrees of freedom
## Multiple R-squared:  0.005498,   Adjusted R-squared:  0.005048 
## F-statistic: 12.23 on 1 and 2212 DF,  p-value: 0.00048
cbind(exp(coef(veg.fit)), exp(confint(veg.fit)))
##                           2.5 %    97.5 %
## (Intercept) 0.9157024 0.8619505 0.9728063
## areaOther   1.1648054 1.0692989 1.2688422

1.7 Method and Assumption Checks

The boxplots of vegetable servings by area showed that the data was heavily right-skewed with non-constant variance. Hence, we logged the vegetable servings. The boxplots of log(veg) by area showed clear improvement in symmetry and similair spreads between the two groups. So, we fitted a linear model with log(veg) being explained by the area. The residuals vs the fitted plot shows constant spread around zero with no trend, this satisifies the equality of variance and no-trend assumptions. The Q-Q plot shows the points following the diagonal line reasonably well, and the histogram of residuals is unimodal and approximately symmetric so the normality is satisfied. The CLT is also helpful here given the large sample size where n = 1107 per group. There aren’t really any influential points with all of the Cook’s Distances well below 0.4. Our final model is: \[\log(\text{Veg}_i) = \beta_0 + \beta_1 \times \text{AreaOther}_i + \epsilon_i\]

where \(\epsilon_i \sim \text{iid } N(0, \sigma^2)\). If the student is from outside Detroit, otherwise it is zero. Our model explained less than 1% of the variability in the logged vegetable servings (Multiple R-Squared = 0.005), suggesting that area alone is not a strong predictor of vegetable consumption.

1.8 Executive Summary

We wanted to see whether students from the Detroit area consumed different amounts of vegetables compared to students from other areas of Michigan. There was clear evidence that students from other areas of Michigan consumed more vegetables than students from Detroit (P-value = 0.00048). We estimate that the median vegetable consumption for Detroit students was between 0.86 and 0.97 servings per day. We estimate that students from other areas of Michigan consume between 6.9% and 26.9% more vegetables per day than students from Detroit.


2 Question 2 [7 Marks]

A displacement hull boat is one that pushes through the water rather than skimming over the top of the water. The laws of hydrodynamics say that the maximum speed (km per hour) of a displacement hull boat with a waterline length of wl metres is given by: \(Max speed = 4.50 \times \sqrt{wl}\)

That is, theory says that maximum speed is proportional to the square root of waterline length. Logging both variables and cancelling out we get: \(log(Max speed) = log(4.50) + 0.5 \times log(wl)\)

A researcher was interested in testing this theory. Data were collected on the maximum speeds and waterline lengths of a random sample of 40 popular designs of sailboat. The data can be found in a data set called “BoatSpeed.csv” which contains the variables:

Speed the maximum hull speed of the boat (km/h) WL the waterline length of the boat (metres)

Variable Description
MaxSpeed The maximum hull speed of the boat (km/h),
WL The waterline length of the boat (metres).

Instructions:

2.1 Question of interest/goal of the study

We are interested in whether there is a power-law relationship between the maximum speed of a displacement hull boat and its waterline length.

2.2 Read in and inspect the data:

MaxSpeed.df = read.csv("BoatSpeed.csv")
plot(MaxSpeed ~ WL, main = "Maximum Speed vs Waterline length", data = MaxSpeed.df)

plot(log(MaxSpeed) ~ log(WL), main = "log-speed vs log-length", data = MaxSpeed.df)

2.3 Comment on plots (Done for you - do not edit)

The plot of maximum speed versus waterline length shows a modest amount of concave curvature in the trend. Variation is reasonably constant, though there is a hint of less variability for smaller lengths. The plot of the logged values looks to closely follow a linear trend, with constant variability.

2.4 Fit Model and Check Assumptions (Done for you - do not edit)

MaxSpeed.plm = lm(log(MaxSpeed) ~ log(WL), data = MaxSpeed.df)
modelcheck(MaxSpeed.plm)

summary(MaxSpeed.plm)
## 
## Call:
## lm(formula = log(MaxSpeed) ~ log(WL), data = MaxSpeed.df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.039610 -0.012881  0.002012  0.011322  0.056251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.506974   0.020717   72.74   <2e-16 ***
## log(WL)     0.499378   0.008329   59.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02227 on 38 degrees of freedom
## Multiple R-squared:  0.9895, Adjusted R-squared:  0.9893 
## F-statistic:  3595 on 1 and 38 DF,  p-value: < 2.2e-16
confint(MaxSpeed.plm)
##                 2.5 %    97.5 %
## (Intercept) 1.4650337 1.5489140
## log(WL)     0.4825175 0.5162384
exp(confint(MaxSpeed.plm))
##                2.5 %   97.5 %
## (Intercept) 4.327689 4.706356
## log(WL)     1.620148 1.675712

2.5 Add confidence interval output to answer final question

1.1^(confint(MaxSpeed.plm)[2,1])
## [1] 1.047063

2.6 Plot the data with your model superimposed over it. (Done for you - do not edit)

plot(MaxSpeed ~ WL, main = "Maximum Speed vs Waterline length", data = MaxSpeed.df)
newWL = 4:26
WLpred = data.frame(WL = newWL)
points(newWL, exp(predict(MaxSpeed.plm, newdata = WLpred)), type = "l")

2.7 Are the data consistent with hydrodynamic theory? Fully justify your answer. (There are three things you need to consider for your justification.)

Yes, the fitted model is valid. The Residuals vs fitted plot shows no trend and constand speed around zero. The Q-Q plot follows the diagonal line well and the histogram of residuals is unimodal and approximately symmetrical. There aren’t really any influential points with all of Cook’s distances being below 0.4 therefore, all model assumptions are satisfied.

The hypothesised intercept is log(4.50) = 1.5041. The 95% confidence interval for the intercept from the fitted model is (1.4650, 15489). Because 1.5041 lies within this interval, the fitted intercept is consistent with the hydrodynamic theory.

The hypothesised slope is 0.5. The 95% confidence interval for the slope from the fitted model is (0.4825, 0.5162). Since 0.5 falls within this interval, the fitted slope is consistent with the hydrodynamic theory.

Overall, the data is consistent with hydrodynamic theory on all three counts.

2.8 Effect of a 10% increase in waterline length. (Write your answer in a sentence as if for an Executive Summary.)

We estimate that a 10% increase in waterline length results in an increase in median maximum speed of between 4.7% and 5.0%, for boats of the same design.


3 Question 3 [22 Marks]

The actual speed of vehicles on our roads is, of course, dependent on the posted speed limit. It may also depend on the police tolerance for speeding, which is reduced during holiday periods.

A researcher wanted to investigate the relationship of posted speed limit and holiday on vehicle speeds. Also, if any effect of posted speed limit on vehicle speeds depend on whether or not it was a holiday?

Data was collected on hourly median speeds at different locations around New Zealand (with varying speed limits) at different times (most during holiday period, with some outside the holiday period for comparison). The resulting data is in the file speed.csv, which contains the variables:

Variable Description
Speed The hourly median speed at the location, (km/h),
Posted The posted speed limit at the location (km/h),
Holiday Whether or not it was a holiday when the speed was recorded: (Coded as N for No or Y for Yes).

The researcher wanted several questions answered in particular:

Instructions:

3.1 Question of interest/goal of the study

We investigated the effect on the actual speed of vehicles of the posted speed limit as well as the reduced police tolerance for speeding that applies on holidays. Also, we wished to see if the effect on the speed of vehicles of the posted speed limit depended on whether or not it was a holiday.

3.2 Read in and inspect the data:

Speed.df=read.csv("speed.csv", header=T,stringsAsFactors=TRUE)

plot(Speed~Posted,main="Actual Speed by Speed Limit and if Holiday",
     col=ifelse(Holiday=="N","red","blue"),pch=ifelse(Holiday=="N",1,2),data=Speed.df)
legend('topleft',c("Regular Days", "Holidays"),col=c("red","blue"),pch=c(1,2))

# Same plot, with offset on posted speeds so values don't overlap for the two groups.
Speed.df$PostedOffset = Speed.df$Posted+ifelse(Speed.df$Holiday=="Y",-0.5,0.5)
plot(Speed~PostedOffset,main="Actual Speed by Speed Limit and if Holiday",
     col=ifelse(Holiday=="N","red","blue"),pch=ifelse(Holiday=="N",1,2),data=Speed.df)
legend('topleft',c("Regular Days", "Holidays"),col=c("red","blue"),pch=c(1,2))

3.3 Comment on plots

Both plots show the same data, with the second plot using a slightly more horizontal offset to separate the two groups at each speed limit for clarity. There is a clear positive linear relationship between the posted speed limit and the actual speed of the vehicles. As the posted speed limit increases, the actual speed increases for both holidays and regular days. The variability in actual speed also appears to increase slightly with the posted speed limit. On regular days (red circles), vehicles tend to travel at higher speeds compared to holidays (blue triangles) at the same posted speed limit, this suggests a holiday effect on vehicle speeds. This difference appears to be relatively consistent across all posted speed limits, suggesting the two groups may have similair slopes but different intercepts. Both groups however, show a reasonably linear trend within each level of posted speed limit, supporting the use of a linear model with posted speed limit as a predictor and holidays as a factor variable.

3.4 Fit an appropriate model to the data. Check the model assumptions. Generate summary confidence intervals for the model. (You may need to fit an additional adjusted model to get all appropriate confidence intervals needed for this analysis.)

Speed.fit = lm(Speed ~ Posted * Holiday, data = Speed.df)
plot(Speed.fit, which = 1)

normcheck(Speed.fit)

cooks20x(Speed.fit)

summary(Speed.fit)
## 
## Call:
## lm(formula = Speed ~ Posted * Holiday, data = Speed.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2012  -1.9569   0.1367   2.0431   6.8431 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.69601    1.17274   3.152  0.00173 ** 
## Posted           0.98810    0.01492  66.226  < 2e-16 ***
## HolidayY        -2.86212    1.29428  -2.211  0.02752 *  
## Posted:HolidayY -0.09964    0.01680  -5.930 6.11e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.034 on 441 degrees of freedom
## Multiple R-squared:  0.9783, Adjusted R-squared:  0.9781 
## F-statistic:  6626 on 3 and 441 DF,  p-value: < 2.2e-16
confint(Speed.fit)
##                      2.5 %      97.5 %
## (Intercept)      1.3911540  6.00086851
## Posted           0.9587812  1.01742798
## HolidayY        -5.4058379 -0.31840235
## Posted:HolidayY -0.1326651 -0.06662174

3.5 Plot the data with your appropriate model superimposed over it.

plot(Speed~Posted,main="Actual Speed by Speed Limit and if Holiday",
     col=ifelse(Holiday=="N","red","blue"),pch=ifelse(Holiday=="N",1,2),data=Speed.df)
legend('topleft',c("Regular Days", "Holidays"),col=c("red","blue"),pch=c(1,2))

b = coef(Speed.fit)
abline(b[1], b[2], col = "red", lty = 2)
abline(b[1] + b[3], b[2], col = "blue", lty = 2)

3.6 Use the predict function to get EXPECTED speeds when the posted speed limit is 50 kph for both holiday and non-holiday times.

pred.df = data.frame(Posted = c(50, 50), Holiday = c("N", "Y"))
predict(Speed.fit, pred.df, interval = "confidence")
##        fit      lwr      upr
## 1 53.10124 52.11707 54.08540
## 2 45.25695 44.84211 45.67179

3.7 Method and Assumption Checks

As we have two explantory variables, Posted speed limit (numeric) and Holiday (factor), we first fitted a linear model with an interaction term to allow for different intercepts and slopes for each group. The interaction term was highly significant (P-value = 6.11e-09) and could not be dropped, therefore the effect of the posted speed limit on actual vehicle speed differs between holiday and non-holiday periods. The residuals vs fitted plot shows no trend and constant speed around zero, satisfying the equality of variance and no-trend assumptions. The Q-Q plot follows the diagonal line closely and the histogram of the residuals is unimodal and symmetircal, which satisfies the normality assumption. There are no significant or influential points, with all of Cook’s distances below 0.4 with the largest being 158 at approximately 0.08).Our final model is:

\[\text{Speed}_i = \beta_0 + \beta_1 \times \text{Posted}_i + \beta_2 \times \text{HolidayY}_i + \beta_3 \times \text{Posted}_i \times \text{HolidayY}_i + \epsilon_i\]

where \(\epsilon_i \sim \text{iid } N(0, \sigma^2)\) and \(\text{HolidayY}_i = 1\) if it was a holiday, otherwise zero. Our model explained 97.8% of the variability in vehicle speeds.

3.8 Executive Summary (Remember to answer ALL the questions asked.)

We investigated the relationship between posted speed limits and actual vehicle speeds and whether this relationship differed between holiday and non-holiday periods. There was clear evidence that the effect of posted speed limit on actual vehicle speed differed between holiday and non-holiday periods (P-value = 6.11e-09).

The Expected speed at a posted limit of 50km/h: We estimate that on regular days, the expected actual speed when the posted limit is 50km/h is between 52.1 and 54.1 km/h. On holidays, the expected actual speed is lower, between 44.8 and 45.7 km/h.

The Expected increase in speed for a 10km/h increase in posted limit: On regular days, we estimate that a 10km/h increase in the posted speed limit corresponds to an increase in expected actual speed of between 9.6 and 10.2km/h. On holidays, the increase is smaller, between 8.7km/h and 9.3km/h.

Was speeding reduced during holiday periods: Yes, there is strong evidence that suggests vehicle speeds are lower during holiday periods compared to regular days at the same posted speed limit (P-value = 0.028), which is consist with lower police tolerance for speeding during holidays.