Note on Assignments
Assignments are not stand alone and are designed to be answered in conjunction with lecture notes and case studies. You need to follow the R-code taught in the course when completing the assignment. Alternative R code (and interpretation) not taught in the course and extraneous R output (and interpretation) included in your answers can lead to deductions in marks. Note: AI use frequently will generate alternative code and interpretation that does not follow the course material.
A researcher is doing a study on New Zealand drivers abilities to read road signs depending on their age. They are particularly interested in which factors make the signs easier to read. They were especially concerned whether the signs could be read easily by older drivers. They recruited a sample of people to take part in the study. People in the study were then shown a series of different road signs starting at a long distance and then brought slowly closer until they could be understood, to establish the maximum distance the signs could be read.
A student was looking at the data and decided to see how representative the ages of the people recruited were of the general population.
The data is in the file Age.csv, which contains the variable:
| Variable | Description |
|---|---|
| age | The age of the study participant (in years). |
Instructions:
The student is interested if the researchers had collected a representative sample of people of varying ages.
Age.df = read.csv("Age.csv", header = TRUE)
age=Age.df$age
stripchart(age, method = "stack", pch = 1, main = "Ages of volunteers")
hist(age)
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 53.00 50.51 66.00 82.00
Formulas: \(T = \frac{\bar{y}-\mu_0}{se(\bar{y})}\) and 95% confidence interval \(\bar{y} \pm t_{df, 0.975} \times se(\bar{y})\)
NOTES: The R code mean(y) calculates \(\bar{y}\). The standard error is \(se(\bar{y}) = \frac{s}{\sqrt{n}}\) where
\(s\) is the standard deviation of
\(y\) and is calculated by
sd(y), and \(n\) is the
number of data-points calculated by length(y). The degrees
of freedom is \(df = n-1\). The \(t_{df,0.975}\) multiplier is given by the R
code qt(0.975, df).
( mn_age = mean(Age.df$age) ) # Sample Mean
## [1] 50.50562
( sd_age = sd(Age.df$age) ) # Sample standard deviation
## [1] 18.54302
( n_age = length(Age.df$age) ) # Sample size
## [1] 89
(( tmult_age = qt(1 - 0.05/2, df = n_age - 1) ) ) # t-multiplier
## [1] 1.98729
( CI_age = mn_age + tmult_age * c(-1, 1) * sd_age/sqrt(n_age) ) # Confidence Interval
## [1] 46.59949 54.41175
( se_age = sd_age/sqrt(n_age) ) # Standard error
## [1] 1.965557
# t-statistic for H0: mu=0:
(t_stat_age = (mn_age - 0)/(se_age) ) # t-stat
## [1] 25.69533
# 95% confidence interval for the mean:
( CI_age = mn_age + tmult_age * c(-1, 1) * sd_age/sqrt(n_age) ) # Confidence Interval
## [1] 46.59949 54.41175
t.test(age,data=Age.df)
##
## One Sample t-test
##
## data: age
## t = 25.695, df = 88, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 46.59949 54.41175
## sample estimates:
## mean of x
## 50.50562
Note: You should get exactly the same results from the manual calculations and using the \(t.test\) function. Doing this was to give you practice using some R code. The \(t.test\) function also delivers the p-value that we did not calculate above.
Age.lm=lm(age~1,data=Age.df)
normcheck(Age.lm)
cooks20x(Age.lm)
summary(Age.lm)
##
## Call:
## lm(formula = age ~ 1, data = Age.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.506 -17.506 2.494 15.494 31.494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.506 1.966 25.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.54 on 88 degrees of freedom
confint(Age.lm)
## 2.5 % 97.5 %
## (Intercept) 46.59949 54.41175
It is not of any interest because the actual mean age must be greater than zero. It must be greater than zero because we are using adult participants who will all be older than 0 years old. so it’s an uninteresting test to interpret because it’s substantively meaningless.
t.test (Age.df$age, mu = 39.4)
##
## One Sample t-test
##
## data: Age.df$age
## t = 5.6501, df = 88, p-value = 1.945e-07
## alternative hypothesis: true mean is not equal to 39.4
## 95 percent confidence interval:
## 46.59949 54.41175
## sample estimates:
## mean of x
## 50.50562
We can reject the hypthothesis that mu = 39.4 because 39.4 is outside of the bounds of the 95 percent confidence interval. And this hypothesis is not sensible because there is a mismatch between the sample and the greater nz population. mu =39.4 makes sense if the sample was representative of the new zealand population but since it’s not it is not a sensible hypothesis.
These data come from a random sample of 89 drivers. As it is a single quantitative variable, we have fitted a null linear model - a one sample t-test and confidence interval. The age distribution is not normally distributed as it is flat with very short tails. As our data is symmetric about the mean value and as we have a reasonable sample size of \(n=89\) we may use the CLT to talk make inference about the average age of drivers. We have independence from taking a sample and there are no unduly influential points.
Our preferred model is: \(Age_i = \mu + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)
\(\mu\) is the average age of the drivers.
We estimate that the average age of the volunteers from the population that sample was drawn from is between 47 years old and 54 years old.
31 out of 89 or 34.8 % (1d.p.) of the volunteers used in this study were over 60 years old. This is more than the new zealand population which has 21.4% of people older than 60 years. The researcher may have recruited a sample with a greater proportion of older volunteers because the researcher was especially concerned about if older new zealanders could read road signs.
A researcher at the United Kingdom’s Office of National Statistics was interested in how happiness was related to age. In particular, we want to predict the percentage of 25 year olds who are happy. A random sample of people were selected from a large survey where various data was collected from UK citizens. This included age group and their current level of happiness. For the purpose of this analysis, we will treat the age group data as the mid-point value of the group and use it as an explanatory variable. The response is the percentage of people in the group who said they were either Happy or Very Happy.
The data is in the file carpets.csv, which contains the variables:
| Variable | Description |
|---|---|
| Age | The midpoint age of the age group (in years), |
| Happy | The percentage of people in the group that said they were either Happy or Very Happy. |
Instructions:
We are interested in how happiness was related to age. In particular, we want to predict the percentage of 25 year olds who are happy or very happy.
Happy.df=read.csv("happy.csv",header=T)
plot(Happy~Age, main="Percentage of Happy People by Age", data=Happy.df)
Happiness tends to decrease with age until it reaches a minimum just after 52 years old. Then it bounces back and continues to increase through the 60s and 70s. Looks like a curved relationship.
# Let's make this graph less depressing
plot(Happy~Age, ylim=c(0,100),main="Happier Perspective",data=Happy.df)
changing the scale of the y-axis makes the dip in the late 40s look less drastic. It still looks like a curved relationship.
## Fit an appropriate linear model, including model checks.
HappyAge.fit = lm(Happy ~ Age, data = Happy.df)
plot(HappyAge.fit, which = 1) # We can use this code as an alternative to eovcheck
HappyAge.fit2 = lm(Happy ~ Age + I(Age^2), data = Happy.df)
plot(HappyAge.fit2, which = 1)
normcheck(HappyAge.fit2)
cooks20x(HappyAge.fit2)
summary(HappyAge.fit2)
##
## Call:
## lm(formula = Happy ~ Age + I(Age^2), data = Happy.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3398 -0.3527 0.1790 0.4677 1.1011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.062692 2.318504 41.002 1.38e-10 ***
## Age -0.854527 0.104786 -8.155 3.80e-05 ***
## I(Age^2) 0.008662 0.001091 7.937 4.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7991 on 8 degrees of freedom
## Multiple R-squared: 0.8938, Adjusted R-squared: 0.8672
## F-statistic: 33.66 on 2 and 8 DF, p-value: 0.0001273
confint(HappyAge.fit2)
## 2.5 % 97.5 %
## (Intercept) 89.716211426 100.40917319
## Age -1.096164317 -0.61288930
## I(Age^2) 0.006145493 0.01117852
plot(Happy~Age, main="Percentage of Happy People by Age", data=Happy.df)
x = 22.5:72.5
HappyAge.fit = lm(Happy ~ Age, data = Happy.df)
##abline(HappyAge.fit, lty = 2, col ="blue")
lines(x, predict(HappyAge.fit, data.frame(Age = x)), col = "blue")
lines(x, predict(HappyAge.fit2, data.frame(Age = x)), col = "red")
The scatter plot of Happiness vs Age suggests curvature in the relationship. We began with a linear model to describe this relationship but the residual plot showed strong curvature. And a visual comparison of the fitted lines shows that a quadratic term is appropriate. So, we added a quadratic term to the linear model.
Our final model is:
\(\beta_0 + \beta_1 \times Age_i + \beta_2 \times Age_i^2 + \epsilon_i\)
The residuals for this model are approximately normal and there are no outliers.
Our model explained 89% of the variability in happiness for the UK citizens from the sample.
predAge.df = data.frame(Age = c(25))
predict(HappyAge.fit2, predAge.df, interval = "prediction")
## fit lwr upr
## 1 79.11328 76.93481 81.29174
We estimate that for 25 year olds, the expected happiness percentage will range from 77% to 81%.
1.3 Comment on the plots/exploratory data analysis
The Ages are centered just above 50 years old. The distribution is roughly symmetrical.