Note on Assignments

Assignments are not stand alone and are designed to be answered in conjunction with lecture notes and case studies. You need to follow the R-code taught in the course when completing the assignment. Alternative R code (and interpretation) not taught in the course and extraneous R output (and interpretation) included in your answers can lead to deductions in marks. Note: AI use frequently will generate alternative code and interpretation that does not follow the course material.

1 Question 1 [12 Marks]

A researcher is doing a study on New Zealand drivers abilities to read road signs depending on their age. They are particularly interested in which factors make the signs easier to read. They were especially concerned whether the signs could be read easily by older drivers. They recruited a sample of people to take part in the study. People in the study were then shown a series of different road signs starting at a long distance and then brought slowly closer until they could be understood, to establish the maximum distance the signs could be read.

A student was looking at the data and decided to see how representative the ages of the people recruited were of the general population.

The data is in the file Age.csv, which contains the variable:

Variable Description
age The age of the study participant (in years).

Instructions:

1.1 Question of interest/goal of the study

The student is interested if the researchers had collected a representative sample of people of varying ages.

1.2 Read in and inspect the data:

Age.df = read.csv("Age.csv", header = TRUE)
age=Age.df$age

stripchart(age, method = "stack", pch = 1, main = "Ages of volunteers")

hist(age)

summary(age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   53.00   50.51   66.00   82.00

1.3 Comment on the plots/exploratory data analysis

The Ages are centered just above 50 years old. The distribution is roughly symmetrical.

1.4 Use R-code to manually calculate the t-statistic and 95% confidence interval for the default t-test with mu=0.

Formulas: \(T = \frac{\bar{y}-\mu_0}{se(\bar{y})}\) and 95% confidence interval \(\bar{y} \pm t_{df, 0.975} \times se(\bar{y})\)

NOTES: The R code mean(y) calculates \(\bar{y}\). The standard error is \(se(\bar{y}) = \frac{s}{\sqrt{n}}\) where \(s\) is the standard deviation of \(y\) and is calculated by sd(y), and \(n\) is the number of data-points calculated by length(y). The degrees of freedom is \(df = n-1\). The \(t_{df,0.975}\) multiplier is given by the R code qt(0.975, df).

( mn_age = mean(Age.df$age) )  # Sample Mean
## [1] 50.50562
( sd_age = sd(Age.df$age) )  # Sample standard deviation
## [1] 18.54302
( n_age = length(Age.df$age) ) # Sample size
## [1] 89
(( tmult_age = qt(1 - 0.05/2, df = n_age - 1) ) ) # t-multiplier
## [1] 1.98729
( CI_age = mn_age + tmult_age * c(-1, 1) * sd_age/sqrt(n_age) ) # Confidence Interval
## [1] 46.59949 54.41175
( se_age = sd_age/sqrt(n_age) ) # Standard error
## [1] 1.965557
# t-statistic for H0: mu=0:

(t_stat_age = (mn_age - 0)/(se_age) ) # t-stat
## [1] 25.69533
# 95% confidence interval for the mean:

( CI_age = mn_age + tmult_age * c(-1, 1) * sd_age/sqrt(n_age) ) # Confidence Interval
## [1] 46.59949 54.41175

1.5 Repeat the same calculation using the t.test function (done for you):

t.test(age,data=Age.df)
## 
##  One Sample t-test
## 
## data:  age
## t = 25.695, df = 88, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  46.59949 54.41175
## sample estimates:
## mean of x 
##  50.50562

Note: You should get exactly the same results from the manual calculations and using the \(t.test\) function. Doing this was to give you practice using some R code. The \(t.test\) function also delivers the p-value that we did not calculate above.

1.6 Fit and check the null model (done for you):

Age.lm=lm(age~1,data=Age.df)
normcheck(Age.lm)

cooks20x(Age.lm)

summary(Age.lm)
## 
## Call:
## lm(formula = age ~ 1, data = Age.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.506 -17.506   2.494  15.494  31.494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   50.506      1.966    25.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.54 on 88 degrees of freedom
confint(Age.lm)
##                2.5 %   97.5 %
## (Intercept) 46.59949 54.41175

1.7 Why is the default hypothesis test using mu = 0 not of any interest or use in this case?

It is not of any interest because the actual mean age must be greater than zero. It must be greater than zero because we are using adult participants who will all be older than 0 years old. so it’s an uninteresting test to interpret because it’s substantively meaningless.

1.8 The student found that the actual mean age of people in New Zealand is 39.4 years. Adjust and repeat a t-test to get a P-value to test the hypothesis that mu = 39.4 instead.

t.test (Age.df$age, mu = 39.4)
## 
##  One Sample t-test
## 
## data:  Age.df$age
## t = 5.6501, df = 88, p-value = 1.945e-07
## alternative hypothesis: true mean is not equal to 39.4
## 95 percent confidence interval:
##  46.59949 54.41175
## sample estimates:
## mean of x 
##  50.50562

1.9 What do you conclude from testing the hypothesis that mu = 39.4? Suggest why this hypothese would also not be sensible in this case.

We can reject the hypthothesis that mu = 39.4 because 39.4 is outside of the bounds of the 95 percent confidence interval. And this hypothesis is not sensible because there is a mismatch between the sample and the greater nz population. mu =39.4 makes sense if the sample was representative of the new zealand population but since it’s not it is not a sensible hypothesis.

1.10 Method and Assumption Checks (Done for you)

These data come from a random sample of 89 drivers. As it is a single quantitative variable, we have fitted a null linear model - a one sample t-test and confidence interval. The age distribution is not normally distributed as it is flat with very short tails. As our data is symmetric about the mean value and as we have a reasonable sample size of \(n=89\) we may use the CLT to talk make inference about the average age of drivers. We have independence from taking a sample and there are no unduly influential points.

Our preferred model is: \(Age_i = \mu + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)

\(\mu\) is the average age of the drivers.

1.11 Write a sentence, as if for a Executive Summary estimating the average age of the underlying population that this sample was drawn from.

We estimate that the average age of the volunteers from the population that sample was drawn from is between 47 years old and 54 years old.

1.12 Look at the website https://www.populationpyramid.net/new-zealand/2024/#google_vignette (or similar) which shows the current population demographics by age for the New Zealand population. Write two or three sentences of plain English comparing the NZ population distribution to the distribution of the ages of the volunteers used in the study and suggesting why the researcher recruited a sample that was different.

31 out of 89 or 34.8 % (1d.p.) of the volunteers used in this study were over 60 years old. This is more than the new zealand population which has 21.4% of people older than 60 years. The researcher may have recruited a sample with a greater proportion of older volunteers because the researcher was especially concerned about if older new zealanders could read road signs.


2 Question 2 [12 Marks]

A researcher at the United Kingdom’s Office of National Statistics was interested in how happiness was related to age. In particular, we want to predict the percentage of 25 year olds who are happy. A random sample of people were selected from a large survey where various data was collected from UK citizens. This included age group and their current level of happiness. For the purpose of this analysis, we will treat the age group data as the mid-point value of the group and use it as an explanatory variable. The response is the percentage of people in the group who said they were either Happy or Very Happy.

The data is in the file carpets.csv, which contains the variables:

Variable Description
Age The midpoint age of the age group (in years),
Happy The percentage of people in the group that said they were either Happy or Very Happy.

Instructions:

2.1 Question of interest/goal of the study

We are interested in how happiness was related to age. In particular, we want to predict the percentage of 25 year olds who are happy or very happy.

2.2 Read in and inspect the data:

Happy.df=read.csv("happy.csv",header=T)
plot(Happy~Age, main="Percentage of Happy People by Age", data=Happy.df)

2.3 Comment on the plot

Happiness tends to decrease with age until it reaches a minimum just after 52 years old. Then it bounces back and continues to increase through the 60s and 70s. Looks like a curved relationship.

# Let's make this graph less depressing
plot(Happy~Age, ylim=c(0,100),main="Happier Perspective",data=Happy.df)

2.4 Comment on the plot

changing the scale of the y-axis makes the dip in the late 40s look less drastic. It still looks like a curved relationship.

## Fit an appropriate linear model, including model checks.

HappyAge.fit = lm(Happy ~ Age, data = Happy.df)
plot(HappyAge.fit, which = 1) # We can use this code as an alternative to eovcheck

HappyAge.fit2 = lm(Happy ~ Age + I(Age^2), data = Happy.df)
plot(HappyAge.fit2, which = 1)

normcheck(HappyAge.fit2)

cooks20x(HappyAge.fit2)

summary(HappyAge.fit2)
## 
## Call:
## lm(formula = Happy ~ Age + I(Age^2), data = Happy.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3398 -0.3527  0.1790  0.4677  1.1011 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 95.062692   2.318504  41.002 1.38e-10 ***
## Age         -0.854527   0.104786  -8.155 3.80e-05 ***
## I(Age^2)     0.008662   0.001091   7.937 4.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7991 on 8 degrees of freedom
## Multiple R-squared:  0.8938, Adjusted R-squared:  0.8672 
## F-statistic: 33.66 on 2 and 8 DF,  p-value: 0.0001273
confint(HappyAge.fit2)
##                    2.5 %       97.5 %
## (Intercept) 89.716211426 100.40917319
## Age         -1.096164317  -0.61288930
## I(Age^2)     0.006145493   0.01117852

2.5 Plot the data with your appropriate model superimposed over it.

plot(Happy~Age, main="Percentage of Happy People by Age", data=Happy.df)
x = 22.5:72.5
HappyAge.fit = lm(Happy ~ Age, data = Happy.df)
##abline(HappyAge.fit, lty = 2, col ="blue")
lines(x, predict(HappyAge.fit, data.frame(Age = x)), col = "blue")
lines(x, predict(HappyAge.fit2, data.frame(Age = x)), col = "red")

2.6 Write appropriate Methods and Assumption Checks.

The scatter plot of Happiness vs Age suggests curvature in the relationship. We began with a linear model to describe this relationship but the residual plot showed strong curvature. And a visual comparison of the fitted lines shows that a quadratic term is appropriate. So, we added a quadratic term to the linear model.

Our final model is:

\(\beta_0 + \beta_1 \times Age_i + \beta_2 \times Age_i^2 + \epsilon_i\)

The residuals for this model are approximately normal and there are no outliers.

Our model explained 89% of the variability in happiness for the UK citizens from the sample.

2.7 Use your model to predict the percentage of happy people who are 25 years of age.

predAge.df = data.frame(Age = c(25))
predict(HappyAge.fit2, predAge.df, interval = "prediction")
##        fit      lwr      upr
## 1 79.11328 76.93481 81.29174

2.8 Write an sentence, as if for an Executive Summary, interpreting the requested prediction interval.

We estimate that for 25 year olds, the expected happiness percentage will range from 77% to 81%.