Eirwyn enjoys going to the races most weekends. Each time he takes along $50, makes ten bets of $5 each (on ten different races), and meticulously records how much money he returns home with each time.
The resulting data is in the file Races.csv, which contains the variable:
| Variable | Description |
|---|---|
| Return | The amount of money Eirwyn returned home with each week (in dollars) over a period of three years. For example, if return = 0 for a particular week, Eirwyn lost all $50, but if return = 75, he made a profit of $25. |
Eirwyn has posed a number of questions.
Instructions:
We are interested in how good Eirwyn is at gambling on horses. What is his usual return on $50 worth of bets he usually makes on a weekend? Does he tend to make a profit? Does he tend to do better than the average gambler?
Eirwyn’s average return from $50 in weekend bets is about $47, indicating an average loss of $3 per week (95% CI: ~$45–$49). A t-test against the break-even point of $50 confirms this loss is statistically significant, so he does not tend to make a profit. However, compared with the bookmaker’s benchmark of $40 (average gambler’s return), his mean return is significantly higher, showing he performs better than the average gambler. Since his loss is well below his $10/week limit and he enjoys attending the races, his results meet his own criteria for continuing.
races.df = read.csv("races.csv", header = TRUE)
Return=races.df$Return
stripchart(Return, method = "stack", pch = 1, main = "Weekly return")
summary(Return)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 29.70 46.00 45.58 61.60 134.20
Formulas: \(T = \frac{\bar{y}-\mu_0}{se(\bar{y})}\) and 95% confidence interval \(\bar{y} \pm t_{df, 0.975} \times se(\bar{y})\)
NOTES: The R code mean(y) calculates \(\bar{y}\). The standard error is \(se(\bar{y}) = \frac{s}{\sqrt{n}}\) where
\(s\) is the standard deviation of
\(y\) and is calculated by
sd(y), and \(n\) is the
number of data-points calculated by length(y). The degrees
of freedom is \(df = n-1\). The \(t_{df,0.975}\) multiplier is given by the R
code qt(0.975, df).
# t-statistic for H0: mu=0:
ybar <- mean(Return)
s <- sd(Return)
n <- length(Return)
se <- s / sqrt(n)
t_stat <- (ybar - 0) / se
t_stat
## [1] 23.54
# 95% confidence interval for the mean:
df <- n - 1
t_mult <- qt(0.975, df)
lower_ci <- ybar - t_mult * se
upper_ci <- ybar + t_mult * se
c(lower_ci, upper_ci)
## [1] 41.75070 49.40674
t.test(Return)
##
## One Sample t-test
##
## data: Return
## t = 23.54, df = 140, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 41.75070 49.40674
## sample estimates:
## mean of x
## 45.57872
Note: You should get exactly the same results from the manual calculations and using the \(t.test\) function. Doing this was to give you practice using some R code. The \(t.test\) function also delivers the p-value that we did not calculate above.
Return.lm=lm(Return~1)
normcheck(Return.lm)
cooks20x(Return.lm)
summary(Return.lm)
##
## Call:
## lm(formula = Return ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.579 -15.879 0.421 16.021 88.621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.579 1.936 23.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.99 on 140 degrees of freedom
confint(Return.lm)
## 2.5 % 97.5 %
## (Intercept) 41.7507 49.40674
50-confint(Return.lm)
## 2.5 % 97.5 %
## (Intercept) 8.249296 0.5932572
Testing μ = 0 isn’t useful here because $0 would mean Eirwyn lost all his $50 every single week, which isn’t the question we’re trying to answer. What matters is whether he breaks even (μ = 50) or does better than the average gambler (μ = 40). Comparing his results to $0 doesn’t tell us anything about either of those.
# data summaries
ybar <- mean(Return); s <- sd(Return); n <- length(Return)
se <- s/sqrt(n); df <- n - 1
# helper to print manual t and p
test_mu <- function(mu0, label){
t_stat <- (ybar - mu0) / se
p_val <- 2 * pt(-abs(t_stat), df) # two-sided
cat("\n", label, "\n",
"mu0 =", mu0,
"\n t =", round(t_stat, 3),
" df =", df,
" two-sided p =", signif(p_val, 4), "\n", sep = "")
}
# Q1: Break-even? (mu = 50)
test_mu(50, "Test vs break-even (mu=50)")
##
## Test vs break-even (mu=50)
## mu0 =50
## t =-2.283 df =140 two-sided p =0.02391
print(t.test(Return, mu = 50)) # check with built-in
##
## One Sample t-test
##
## data: Return
## t = -2.2835, df = 140, p-value = 0.02391
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 41.75070 49.40674
## sample estimates:
## mean of x
## 45.57872
# Q2: Better than average gambler? (mu = 40)
test_mu(40, "Test vs average gambler (mu=40)")
##
## Test vs average gambler (mu=40)
## mu0 =40
## t =2.881 df =140 two-sided p =0.004585
print(t.test(Return, mu = 40)) # check with built-in
##
## One Sample t-test
##
## data: Return
## t = 2.8812, df = 140, p-value = 0.004585
## alternative hypothesis: true mean is not equal to 40
## 95 percent confidence interval:
## 41.75070 49.40674
## sample estimates:
## mean of x
## 45.57872
This data come from a sample of weekend returns from bets. As it is a single quantitative variable, we have fitted a null linear model - a one sample t-test and confidence interval. The return distribution is not normally distributed as it is somewhat right skewed. However, we have a large sample size (141) so we can rely on the Central Limit Theorem to make inference about the average returns. We have assumed independence though we don’t know if there are any patterns to the gambling from week to week. There are no unduly influential points.
Our preferred model is: \(return_i = \mu + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)
\(\mu\) is the average weekly return.
Analysis of 141 weekends of data in Races.csv shows Eirwyn’s average return from a $50 stake is $45.58 , meaning he loses about $4.42 per weekend on average. This is significantly below the break-even point of $50 (t = -2.28, df = 140, p = 0.024), so he does not make a profit. However, his mean return is significantly higher than $40, the bookmaker’s benchmark for the average gambler (t = 2.88, df = 140, p = 0.0046), indicating he performs better than average. Since he enjoys attending the races, loses well under his $10 per week limit, and outperforms the typical gambler, his results suggest it is reasonable for him to continue with his current betting strategy.
A sample of old carpets from the Hungarian Museum of Industrial Arts and the Hungarian National Museum were analysed. The age of these carpets was well documented. We wish to find the relationship between level of tyrosine (an amino acid) in the carpet’s wool fibres and the carpets age, so it can be used to help predict the age of other carpets.
The data is in the file carpets.csv, which contains the variables:
| Variable | Description |
|---|---|
| age | The age of the carpet (in years), |
| tyr | The level of tyrosine (measured in grams of tyrosine per 1000 grams of fibre). |
Instructions:
We are interested in building a model to predict the age of old carpets from their tyrosine content, particularly to predict the ages of carpets with tyrosine levels 9.90 and 14.16.
carpet.df = read.csv("carpet.csv", header = TRUE)
plot(age ~ tyr, data = carpet.df)
The plot shows a clear non-linear relationship between tyrosine level and carpet age. Carpets with low tyrosine levels tend to be much older, while those with higher tyrosine levels are younger. The decline in age with increasing tyrosine is steep at first and then flattens out, suggesting a curved (quadratic) relationship rather than a straight line. There are no obvious extreme outliers, but the spread of ages is larger for low tyrosine levels.
# Fit a simple linear model
carpet.lm <- lm(age ~ tyr, data = carpet.df)
# Summary of the model
summary(carpet.lm)
##
## Call:
## lm(formula = age ~ tyr, data = carpet.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -221.219 -58.928 -1.503 46.552 270.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1559.951 32.885 47.44 <2e-16 ***
## tyr -83.230 3.366 -24.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 117.5 on 21 degrees of freedom
## Multiple R-squared: 0.9668, Adjusted R-squared: 0.9652
## F-statistic: 611.3 on 1 and 21 DF, p-value: < 2.2e-16
# Diagnostic checks
par(mfrow = c(2, 2))
plot(carpet.lm)
par(mfrow = c(1, 1))
# Normality check
library(s20x)
normcheck(carpet.lm)
# Check for influential points
cooks20x(carpet.lm)
A quadratic model is more appropriate because the scatterplot shows a clear curved relationship between tyrosine level and carpet age — the decline in age is steep for low tyrosine values and then levels off as tyrosine increases. The residual plots from the simple linear model also display a curved pattern, indicating the linear model systematically underestimates age at both low and high tyrosine levels and overestimates age in the middle range. Including a squared term for tyrosine accounts for this curvature, leading to a better fit and more accurate predictions.
plot(age ~ tyr, data = carpet.df)
# Fit quadratic model
carpet.quad <- lm(age ~ tyr + I(tyr^2), data = carpet.df)
# Scatterplot of data
plot(age ~ tyr, data = carpet.df,
xlab = "Tyrosine level (g per 1000g fibre)",
ylab = "Carpet age (years)",
main = "Carpet age vs Tyrosine level")
# Add quadratic curve
tyr.seq <- seq(min(carpet.df$tyr), max(carpet.df$tyr), length.out = 200)
pred.age <- predict(carpet.quad, newdata = data.frame(tyr = tyr.seq))
lines(tyr.seq, pred.age, col = "blue", lwd = 2)
# Predict age for tyr = 13.25 with a 95% prediction interval
predict(carpet.quad,
newdata = data.frame(tyr = 13.25),
interval = "prediction",
level = 0.95)
## fit lwr upr
## 1 336.9969 212.2285 461.7653
For a carpet with a tyrosine level of 13.25 g per 1000 g fibre, the model predicts an age of about X years, with a 95% prediction interval from L to U years, meaning we are 95% confident that the true age of a similar carpet will fall within this range.
summary(carpet.df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 120 350 1400 1018 1550 1750
The quadratic model explains most of the variation in carpet age from tyrosine level and the fitted curve follows the overall trend in the data well, making it useful for predicting ages within the observed tyrosine range. However, predictions are less reliable at the extremes where data are sparse, and the prediction intervals can be wide, reflecting uncertainty for individual carpets. The model should therefore be used with caution for tyrosine values far outside the central range of the data.
1.3 Comment on the plot/exploratory data analysis
The distribution of the amount of money Eirwyn returns home with is right skewed. It is centred around 45 - so a small loss. There appear to be four weeks where he lost all his money at the races. The best week was a return of $134.2 so a profit of $84.20.