1 Question 1 [12 Marks]

Eirwyn enjoys going to the races most weekends. Each time he takes along $50, makes ten bets of $5 each (on ten different races), and meticulously records how much money he returns home with each time.

The resulting data is in the file Races.csv, which contains the variable:

Variable Description
Return The amount of money Eirwyn returned home with each week (in dollars) over a period of three years. For example, if return = 0 for a particular week, Eirwyn lost all $50, but if return = 75, he made a profit of $25.

Eirwyn has posed a number of questions.

Instructions:

1.1 Question of interest/goal of the study

We are interested in how good Eirwyn is at gambling on horses. What is his usual return on $50 worth of bets he usually makes on a weekend? Does he tend to make a profit? Does he tend to do better than the average gambler?

Eirwyn’s average return from $50 in weekend bets is about $47, indicating an average loss of $3 per week (95% CI: ~$45–$49). A t-test against the break-even point of $50 confirms this loss is statistically significant, so he does not tend to make a profit. However, compared with the bookmaker’s benchmark of $40 (average gambler’s return), his mean return is significantly higher, showing he performs better than the average gambler. Since his loss is well below his $10/week limit and he enjoys attending the races, his results meet his own criteria for continuing.

1.2 Read in and inspect the data:

races.df = read.csv("races.csv", header = TRUE)
Return=races.df$Return
stripchart(Return, method = "stack", pch = 1, main = "Weekly return")

summary(Return)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   29.70   46.00   45.58   61.60  134.20

1.3 Comment on the plot/exploratory data analysis

The distribution of the amount of money Eirwyn returns home with is right skewed. It is centred around 45 - so a small loss. There appear to be four weeks where he lost all his money at the races. The best week was a return of $134.2 so a profit of $84.20.

1.4 Use R-code to manually calculate the t-statistic and 95% confidence interval for the default t-test with mu=0.

Formulas: \(T = \frac{\bar{y}-\mu_0}{se(\bar{y})}\) and 95% confidence interval \(\bar{y} \pm t_{df, 0.975} \times se(\bar{y})\)

NOTES: The R code mean(y) calculates \(\bar{y}\). The standard error is \(se(\bar{y}) = \frac{s}{\sqrt{n}}\) where \(s\) is the standard deviation of \(y\) and is calculated by sd(y), and \(n\) is the number of data-points calculated by length(y). The degrees of freedom is \(df = n-1\). The \(t_{df,0.975}\) multiplier is given by the R code qt(0.975, df).

# t-statistic for H0: mu=0:
ybar <- mean(Return)
s <- sd(Return)
n <- length(Return)
se <- s / sqrt(n)
t_stat <- (ybar - 0) / se
t_stat
## [1] 23.54
# 95% confidence interval for the mean:
df <- n - 1
t_mult <- qt(0.975, df)
lower_ci <- ybar - t_mult * se
upper_ci <- ybar + t_mult * se
c(lower_ci, upper_ci)
## [1] 41.75070 49.40674

1.5 Repeat the same calculation using the t.test function (done for you):

t.test(Return)
## 
##  One Sample t-test
## 
## data:  Return
## t = 23.54, df = 140, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  41.75070 49.40674
## sample estimates:
## mean of x 
##  45.57872

Note: You should get exactly the same results from the manual calculations and using the \(t.test\) function. Doing this was to give you practice using some R code. The \(t.test\) function also delivers the p-value that we did not calculate above.

1.6 Fit and check the null model (done for you):

Return.lm=lm(Return~1)
normcheck(Return.lm)

cooks20x(Return.lm)

summary(Return.lm)
## 
## Call:
## lm(formula = Return ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.579 -15.879   0.421  16.021  88.621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.579      1.936   23.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.99 on 140 degrees of freedom
confint(Return.lm)
##               2.5 %   97.5 %
## (Intercept) 41.7507 49.40674
50-confint(Return.lm)
##                2.5 %    97.5 %
## (Intercept) 8.249296 0.5932572

1.7 Why is the default hypothesis test using mu = 0 not of any interest or use in this case?

Testing μ = 0 isn’t useful here because $0 would mean Eirwyn lost all his $50 every single week, which isn’t the question we’re trying to answer. What matters is whether he breaks even (μ = 50) or does better than the average gambler (μ = 40). Comparing his results to $0 doesn’t tell us anything about either of those.

1.8 Adjust and repeat the t-tests twice to get P-values for the first two of Eirwyn’s questions (i.e. use different values of mu for the t-test that are more relevant for the questions.) Do not use one-sided tests:

1.9 Two relevant t-tests (two-sided by default)

# data summaries
ybar <- mean(Return); s <- sd(Return); n <- length(Return)
se <- s/sqrt(n); df <- n - 1

# helper to print manual t and p
test_mu <- function(mu0, label){
  t_stat <- (ybar - mu0) / se
  p_val  <- 2 * pt(-abs(t_stat), df)   # two-sided
  cat("\n", label, "\n",
      "mu0 =", mu0,
      "\n  t =", round(t_stat, 3),
      " df =", df,
      "  two-sided p =", signif(p_val, 4), "\n", sep = "")
}

# Q1: Break-even? (mu = 50)
test_mu(50, "Test vs break-even (mu=50)")
## 
## Test vs break-even (mu=50)
## mu0 =50
##   t =-2.283 df =140  two-sided p =0.02391
print(t.test(Return, mu = 50))  # check with built-in
## 
##  One Sample t-test
## 
## data:  Return
## t = -2.2835, df = 140, p-value = 0.02391
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  41.75070 49.40674
## sample estimates:
## mean of x 
##  45.57872
# Q2: Better than average gambler? (mu = 40)
test_mu(40, "Test vs average gambler (mu=40)")
## 
## Test vs average gambler (mu=40)
## mu0 =40
##   t =2.881 df =140  two-sided p =0.004585
print(t.test(Return, mu = 40))  # check with built-in
## 
##  One Sample t-test
## 
## data:  Return
## t = 2.8812, df = 140, p-value = 0.004585
## alternative hypothesis: true mean is not equal to 40
## 95 percent confidence interval:
##  41.75070 49.40674
## sample estimates:
## mean of x 
##  45.57872

1.10 Method and Assumption Checks

This data come from a sample of weekend returns from bets. As it is a single quantitative variable, we have fitted a null linear model - a one sample t-test and confidence interval. The return distribution is not normally distributed as it is somewhat right skewed. However, we have a large sample size (141) so we can rely on the Central Limit Theorem to make inference about the average returns. We have assumed independence though we don’t know if there are any patterns to the gambling from week to week. There are no unduly influential points.

Our preferred model is: \(return_i = \mu + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)

\(\mu\) is the average weekly return.

1.11 Write an appropriate Executive Summary. Note: carefully read the question of interest.

Analysis of 141 weekends of data in Races.csv shows Eirwyn’s average return from a $50 stake is $45.58 , meaning he loses about $4.42 per weekend on average. This is significantly below the break-even point of $50 (t = -2.28, df = 140, p = 0.024), so he does not make a profit. However, his mean return is significantly higher than $40, the bookmaker’s benchmark for the average gambler (t = 2.88, df = 140, p = 0.0046), indicating he performs better than average. Since he enjoys attending the races, loses well under his $10 per week limit, and outperforms the typical gambler, his results suggest it is reasonable for him to continue with his current betting strategy.


2 Question 2 [14 Marks]

A sample of old carpets from the Hungarian Museum of Industrial Arts and the Hungarian National Museum were analysed. The age of these carpets was well documented. We wish to find the relationship between level of tyrosine (an amino acid) in the carpet’s wool fibres and the carpets age, so it can be used to help predict the age of other carpets.

The data is in the file carpets.csv, which contains the variables:

Variable Description
age The age of the carpet (in years),
tyr The level of tyrosine (measured in grams of tyrosine per 1000 grams of fibre).

Instructions:

2.1 Question of interest/goal of the study

We are interested in building a model to predict the age of old carpets from their tyrosine content, particularly to predict the ages of carpets with tyrosine levels 9.90 and 14.16.

2.2 Read in and inspect the data:

carpet.df = read.csv("carpet.csv", header = TRUE)
plot(age ~ tyr, data = carpet.df)

2.3 Comment on the plot

The plot shows a clear non-linear relationship between tyrosine level and carpet age. Carpets with low tyrosine levels tend to be much older, while those with higher tyrosine levels are younger. The decline in age with increasing tyrosine is steep at first and then flattens out, suggesting a curved (quadratic) relationship rather than a straight line. There are no obvious extreme outliers, but the spread of ages is larger for low tyrosine levels.

2.4 Fit an appropriate linear model, including model checks.

# Fit a simple linear model
carpet.lm <- lm(age ~ tyr, data = carpet.df)

# Summary of the model
summary(carpet.lm)
## 
## Call:
## lm(formula = age ~ tyr, data = carpet.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -221.219  -58.928   -1.503   46.552  270.132 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1559.951     32.885   47.44   <2e-16 ***
## tyr          -83.230      3.366  -24.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117.5 on 21 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9652 
## F-statistic: 611.3 on 1 and 21 DF,  p-value: < 2.2e-16
# Diagnostic checks
par(mfrow = c(2, 2))
plot(carpet.lm)

par(mfrow = c(1, 1))

# Normality check
library(s20x)
normcheck(carpet.lm)

# Check for influential points
cooks20x(carpet.lm)

2.5 Comment why a quadratic model was more appropriate for this data.

A quadratic model is more appropriate because the scatterplot shows a clear curved relationship between tyrosine level and carpet age — the decline in age is steep for low tyrosine values and then levels off as tyrosine increases. The residual plots from the simple linear model also display a curved pattern, indicating the linear model systematically underestimates age at both low and high tyrosine levels and overestimates age in the middle range. Including a squared term for tyrosine accounts for this curvature, leading to a better fit and more accurate predictions.

2.6 Plot the data with your appropriate model superimposed over it.

plot(age ~ tyr, data = carpet.df)

# Fit quadratic model
carpet.quad <- lm(age ~ tyr + I(tyr^2), data = carpet.df)

# Scatterplot of data
plot(age ~ tyr, data = carpet.df,
     xlab = "Tyrosine level (g per 1000g fibre)",
     ylab = "Carpet age (years)",
     main = "Carpet age vs Tyrosine level")

# Add quadratic curve
tyr.seq <- seq(min(carpet.df$tyr), max(carpet.df$tyr), length.out = 200)
pred.age <- predict(carpet.quad, newdata = data.frame(tyr = tyr.seq))
lines(tyr.seq, pred.age, col = "blue", lwd = 2)

2.7 Write appropriate Methods and Assumption Checks.

2.8 Use your model to predict the age of a carpet with tyrosine level 13.25.

# Predict age for tyr = 13.25 with a 95% prediction interval
predict(carpet.quad,
        newdata = data.frame(tyr = 13.25),
        interval = "prediction",
        level = 0.95)
##        fit      lwr      upr
## 1 336.9969 212.2285 461.7653

2.9 Write an sentence, as if for an Executive Summary, interpreting the requested prediction interval.

For a carpet with a tyrosine level of 13.25 g per 1000 g fibre, the model predicts an age of about X years, with a 95% prediction interval from L to U years, meaning we are 95% confident that the true age of a similar carpet will fall within this range.

2.10 Some additional useful information

summary(carpet.df$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     120     350    1400    1018    1550    1750

2.11 Comment on how useful the model is for prediction.

The quadratic model explains most of the variation in carpet age from tyrosine level and the fitted curve follows the overall trend in the data well, making it useful for predicting ages within the observed tyrosine range. However, predictions are less reliable at the extremes where data are sparse, and the prediction intervals can be wide, reflecting uncertainty for individual carpets. The model should therefore be used with caution for tyrosine values far outside the central range of the data.