1 Question 1 [21 Marks]

A promotions manager for a council located in the West Coast was writing an article promoting how much cheaper it was to live in the West Coast of the South Island compared to living in Auckland. For part of the article, they decided to compare rents. They took a large random sample of rents charged in the West Coast and then took the random sample of the same size from rents charged in Auckland.

The data was stored in rent_2regions.csv and includes variables:

Variable	Description
Rent	The weekly rent for a dwelling ($),
Inversions	The region of New Zealand (either ‘Auckland’ or ‘WestCoast’).

Instructions:

Make sure you change your name and UPI/ID number at the top of the assignment.
Comment on the plots and summary statistics of the data.
We will fit models to the log-transformed data and to the untransformed data. Why aren’t we concerned about the lack of Normality for using either an untransformed or log model?
Fit a linear model on the untransformed data, including model checks. Generate confidence intervals for the model. (DO NOT use the Welch test here.)
Interpret BOTH confidence intervals from above model (as if for an Executive Summary).
Fit a linear model on the logged data, including model checks. Generate confidence intervals for the model. (DO NOT use the Welch test here.)
Interpret BOTH confidence intervals from above model (as if for an Executive Summary).
Write the equation for each of the two models fitted (as if for Methods and Assumption Checks).
Select which model you would use to report in the story. Justify your choice.

Note: For written answers, you may need to type \$ instead of $ in RStudio (This prevents R Markdown interpreting your written answers as equations).

1.1 Question of interest/goal of the study

We were interested in quantifying the difference in rent between the Auckland and West Coast regions of New Zealand.

1.2 Read the data.

rent.df=read.csv(file='rent_2regions.csv', header = TRUE, stringsAsFactors = TRUE)

1.3 Inspect the data:

stripchart(Rent~Region,main="Rent by Region",method="stack",pch=1,data=rent.df)

summaryStats(Rent~Region, data=rent.df)

##           Sample Size     Mean Median   Std Dev Midspread
## Auckland          341 379.6716    364 114.37160       191
## WestCoast         341 189.4751    192  63.13396       125

stripchart(log(Rent)~Region,main="log(Rent) by Region",method="stack",pch=1,data=rent.df)

summaryStats(log(Rent)~Region, data=rent.df)

##           Sample Size     Mean   Median   Std Dev Midspread
## Auckland          341 5.893910 5.897154 0.3029815 0.5274145
## WestCoast         341 5.185237 5.257495 0.3493017 0.7053673

1.4 Comment on plots and summary statistics

The ‘rent by region’ plot and table show that rent in Auckland is approximately double the rent compared to the West Coast. Auckland also shows more variability as rent is spread out from around 200-600+, while the West Coast rent has a more dense bracket with around 100-300. This shows the regional difference in rent prices with Auckland showing consistency in being more expensive. The log(Rent) by Region plot and table reinforce the same conclusion but on a scale where statistical models are more appropriate, as the log transformation reduces skew and produces more comparable spreads. Overall, both plots provide consistent evidence that Auckland rents are higher

1.5 We will fit models to the log-transformed data and to the untransformed data. Why aren’t we concerned about the lack of Normality for using either an untransformed or log model?

We are not concerned about lack of Normality because with large balanced samples (n=341 per region), the central limit theorem ensures the sampling distribution of the mean is approximately Normal, making t-procedures and regression robust even if the raw data is skewed.

1.6 Fit linear model on the untransformed data, including model checks. Generate confidence intervals for the model.

1.7 Interpret BOTH confidence intervals from above model (as if for an Executive Summary)

1.8 Fit linear model on the logged data, including model checks. Generate confidence intervals for the model.

# Fit linear model (untransformed data)
rent.lm <- lm(Rent ~ Region, data = rent.df)

# Model summary
summary(rent.lm)

## 
## Call:
## lm(formula = Rent ~ Region, data = rent.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -185.672  -71.475   -6.975   62.525  233.328 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      379.672      5.002   75.90   <2e-16 ***
## RegionWestCoast -190.196      7.075  -26.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 92.38 on 680 degrees of freedom
## Multiple R-squared:  0.5152, Adjusted R-squared:  0.5145 
## F-statistic: 722.8 on 1 and 680 DF,  p-value: < 2.2e-16

# Model checks
plot(rent.lm)

# Confidence intervals
confint(rent.lm)

##                     2.5 %    97.5 %
## (Intercept)      369.8494  389.4937
## RegionWestCoast -204.0871 -176.3059

1.9 Interpret BOTH confidence intervals from above model (as if for an Executive Summary).

The analysis estimates that average weekly rent in Auckland is between $370 and $390. In contrast, rents in the West Coast are between $176 and $204 lower on average than in Auckland. This provides strong evidence of a substantial regional difference in rent, with Auckland rents consistently much higher.

1.10 Write the equation for each of the two models fitted (as if for Methods and Assumption Checks).

For the untransformed data we fit an additive model:

\[ Rent_i = \beta_0 + \beta_1 I(Region_i = WestCoast) + \varepsilon_i \]

where $\beta_0$ is the mean rent in Auckland and $\beta_1$ is the additive difference in mean rent between the West Coast and Auckland.

For the log-transformed data we fit a multiplicative model:

\[ \log(Rent_i) = \beta_0 + \beta_1 I(Region_i = WestCoast) + \varepsilon_i \]

which back-transforms to

\[ Median \; Rent_i = e^{\beta_0}\times (e^{\beta_1})^{I(Region_i = WestCoast)} . \]

Here $e^{\beta_0}$ is the median rent in Auckland, and $e^{\beta_1}$ is the multiplicative factor comparing West Coast rents to Auckland rents.

1.11 Select which model you would use to report in the story. Justify your choice.

I would report the log-transformed model. The raw rent data are strongly right-skewed, so the untransformed model shows violations of Normality and constant variance assumptions. Taking logs makes the residuals approximately Normal with more even variance, which improves the reliability of inference. Importantly, the log model provides results in terms of medians rather than means. For skewed money data, the median gives a better sense of the “typical” rent. Back-transforming the coefficients allows interpretation as multiplicative differences, meaning we can report West Coast rents as a percentage of Auckland rents, which is clearer and more meaningful

2 Question 2 [20 Marks]

The initial height of froth in a poured glass of beer and rate that the froth disperses gives information about the properties of the beer and can even be used to tell different beers apart. It is known that the froth disperses at an exponentially decreasing rate. The initial amount of froth and the rate that the froth disperses can be used as a identifiers for brands of beer.

In the experiment, a cylindrical beer mug with diameter 7.2 cm was filled with beer immediately after opening the bottle. The temperature of the bottle was 19 degrees C. The beer froth appears while the mugs is filled reaches a maximum height after a few seconds. Once it has reached a maximum height, the amount of froth diminishes over the next few minutes. The Height of froth is measured when it reaches its maximum height (time 0) and then at 15 second intervals for 2 minutes then 30 second intervals for a further 2 minutes. This was done for four times for each of two brands of beer.

(Note: for the purposes of this question, we will treat the results of each measurement as independent. In reality, there are linked measurements for each bottle of beer, but independence between bottles of beer.)

The resulting data is in the file beer.csv, which contains the variables:

Variable	Description
Time	Time from the froth reaching its maximum height (seconds),
Brand	Beer brand: E for Erdinger Weissbier or B for Budweiser Budvar,
Height	Height of the beer froth (cm).

For this question we are particularly interested in:

Seeing if there is evidence that the two brands of beer differ with respect to initial height of froth and/or rate of decline.
Estimating the initial height of the beer froth (at time 0) for the two brands of beer.
Estimating the rate at which the beer froth declines every minute for the two brands of beer.

Instructions:

Comment on the initial plots of the data.
Comment why we should be using log(Height) instead of Height for this analysis.
Fit an appropriate model to the data. Check the model assumptions.
(Optional - no marks allocated) Plot the data (on the log scale) with your appropriate model superimposed over it.
Write appropriate Methods and Assumption Checks.
Write an appropriate Executive Summary. (Remember to address ALL of the questions the researcher is interested in.)

2.1 Question of interest/goal of the study

We are interested in quantifying the initial height and rate of decline for the froth on top of two brands of beer.

2.2 Read in and inspect the data:

beer.df=read.csv(file='beer.csv', header = TRUE, stringsAsFactors = TRUE)

plot(Height~Time,data=beer.df,ylab='Height(cm)',xlab='Time (seconds)',
     main = "Height of the beer froth by Time",
     pch  = ifelse(Brand == 'B', 'B', 'E'),col = ifelse(Brand == 'E', 'blue', 'red'))
legend('topright',c('Erdinger Weissbie','Budweiser Budvar'),col=c("blue","red"),pch=c('E', 'B'))

plot(log(Height)~Time,data=beer.df,ylab='log(Height)',xlab='Time (seconds)',
     main = "log(Height) of the beer froth by Time",
     pch  = ifelse(Brand == 'B', 'B', 'E'),col = ifelse(Brand == 'E', 'blue', 'red'))
legend('topright',c('Erdinger Weissbie','Budweiser Budvar'),col=c("blue","red"),pch=c('E', 'B'))

2.3 Comment on plots

The plots provide clear evidence of differences between the two brands. In the Height vs Time plot, Erdinger Weissbier consistently begins with a greater froth height than Budweiser Budvar and maintains more froth throughout the experiment. Both brands show a rapid initial decrease in froth followed by a slower decline, consistent with exponential decay. This is confirmed in the log(Height) vs Time plot, where the data for each brand fall close to straight lines, supporting the exponential model. The roughly parallel slopes suggest that the rate of decline is similar between brands, while the initial height is higher for Erdinger.

2.4 Comment why we should be using log(Height) instead of Height for this analysis.

We use log(Height) because the froth declines exponentially. Logging the response makes the relationship with time linear, stabilises the variance, and produces residuals closer to Normal. This allows valid model assumptions and interpretable results in terms of multiplicative effects (percentage rates of decline).

2.5 Fit an appropriate linear Model and Check Assumptions

# Fit log-linear model: log(Height) against Time and Brand
beer.lm <- lm(log(Height) ~ Time * Brand, data = beer.df)

# Model summary
summary(beer.lm)

## 
## Call:
## lm(formula = log(Height) ~ Time * Brand, data = beer.df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115115 -0.034889  0.004341  0.037486  0.143747 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.631e+00  1.247e-02  210.98   <2e-16 ***
## Time        -6.430e-03  9.991e-05  -64.36   <2e-16 ***
## BrandE       1.827e-01  1.763e-02   10.36   <2e-16 ***
## Time:BrandE  2.974e-03  1.413e-04   21.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05229 on 100 degrees of freedom
## Multiple R-squared:  0.987,  Adjusted R-squared:  0.9866 
## F-statistic:  2524 on 3 and 100 DF,  p-value: < 2.2e-16

# Diagnostic plots to check assumptions
par(mfrow = c(2, 2))
plot(beer.lm)

par(mfrow = c(1, 1))

# Confidence intervals for coefficients
confint(beer.lm)

##                    2.5 %       97.5 %
## (Intercept)  2.606077482  2.655555690
## Time        -0.006628234 -0.006231798
## BrandE       0.147677262  0.217650015
## Time:BrandE  0.002693183  0.003253829

2.6 (Optional - no marks allocated): Plot the data (on the log scale) with your appropriate model superimposed over it.

2.7 Method and Assumption Checks

We fitted a linear regression with log(Height) as the response and Time, Brand, and their interaction as predictors. This allows comparison of both the initial froth height (intercepts) and the rate of decline (slopes) across brands. The model assumes independent errors that are Normally distributed with mean zero and constant variance.

Diagnostic plots supported these assumptions. On the log scale, Height vs Time was approximately linear, residuals showed roughly constant spread (equal variance), and the Q–Q plot indicated residuals were close to Normal. Measurements are treated as independent between bottles, as specified in the instructions.

2.8 Executive Summary (Remember to answer ALL the questions asked.)

We compared froth decline in Erdinger Weissbier and Budweiser Budvar using a log-linear model of froth height against time. The analysis showed clear brand differences. Erdinger started with a significantly higher initial froth height, while Budvar began lower. On the log scale, both brands declined approximately linearly over time, consistent with exponential decay. The slopes were nearly parallel, indicating that the rate of decline is similar between the two brands, but Erdinger consistently maintained greater froth.

In conclusion, the main distinction between brands lies in their **initial froth height, with Erdinger producing more froth overall, while the decay rates are not substantially different.

STATS 201/8 Assignment 3

SKY RASMUSSEN 581634184

Due Date: 3pm, Thursday 28th August