M923 Assessment

Part 1: Data Management

The first step is to load the data. The data files are saved in the same folder as the .Rmd so there is no need to change the working directory.

load("forecasts.RData")
load("cities.RData")

head(forecasts)

head(cities)

Tidying forecast data

Now I will tidy the data, starting with the forecasts dataset:

library("tidyverse")
library("dplyr") # Despite loading tidyverse above, the distinct function was not working unless I also explicitly called dplyr

tidy_forecasts <- forecasts %>%
  # Create unique variables for State and City
  separate(State_City, c("State", "City"), ":") %>%
  # Make each level of Measurement have its own column with measure from Response
  pivot_wider(names_from = Measurement, values_from = Response) %>%
  # Obtain only rows with accurate observed_temp and where year is 2021
  filter(!(possible_error == "observed_temp"), substr(date, 1, 4) == "2021") %>%
  # Calculate average temperature of each city and store in variable avg_temp rounded to 3 decimal places
  group_by(State, City) %>%
  mutate(avg_temp = round(mean(observed_temp, na.rm = TRUE), 3)) %>%
  # Retain only one row from each city in each state
  distinct(State, City, .keep_all = TRUE)

head(tidy_forecasts)

Tidy data principles state that each variable should have its own column, each observation its own row, and each value its own cell. The original dataset violated those principles because the Measurement column contained multiple variables (i.e. observed_temp, forecast temp etc), whilst the State and City variables were combined into one column. The combined State_City column was therefore separated out, and pivoting moves the measurements into separate columns. I also filtered out instances where the observed temperature may be inaccurate, which should improve the accuracy of any analysis performed with this dataset.

Tidying cities data

Moving on to the cities dataset:

tidy_cities <- cities %>%
  # Create variables koppen and avg_annual_precip
  pivot_longer(cols = 10:25, names_to = "koppen", values_to = "avg_annual_precip") %>%
  # Drop rows with missing precipitation
  drop_na(avg_annual_precip) %>%
  # Convert 'city' to uppercase so I can join with forecast data
  rename(City = city, State = state)

head(tidy_cities)

# Join with tidy_forecasts, matching cities between each dataset. State is also included to account for possibility of cities with the same name from different states
combined_data <- inner_join(tidy_cities, tidy_forecasts, by = c("City", "State"))

# Filter out unwanted columns
weather <- combined_data %>%
  dplyr::select(State, City, lon, lat, koppen, elevation, distance_to_coast, wind, elevation_change_four, elevation_change_eight, avg_annual_precip, avg_temp) # Once I included the MASS package, I had some issues with 'select' so have explicitly called it from the dplyr package

head(weather)

As with the forecasts data, the cities data required pivoting. This time it was pivoting the data from wide to long format, ensuring that each Koppen classification has its own single column. The data was then joined with the forecasts data by City and State, thereby ensuring that each row represents a unique city and state combination.

Before moving on, I will verify that this matches the weather.Rdata provided, which is saved in the same folder.

# Load data in new environment to avoid potential conflicts with my created dataset
weather_env <- new.env()
load("weather_check.RData", envir = weather_env)
weather_check <- get(ls(weather_env), envir = weather_env)
head(weather_check)

# Check if created dataset matches the original
identical(weather_check, weather)

## [1] FALSE

# This returned FALSE, so I will interrogate further
setdiff(names(weather_check), names(weather))

## character(0)

setdiff(names(weather), names(weather_check))

## character(0)

all.equal(weather_check, weather)

## [1] "Component \"avg_temp\": Mean relative difference: 0.003531672"

The analysis shows that the only difference is that avg_temp has a mean relative difference of 0.0035. This likely owes to rounding or floating-point precision. Therefore, I can be happy that the data has been successfully transformed.

Part 2: Exploring the Data

Whilst the minor discrepancy in avg_temp is unlikely to meaningfully impact analysis, for consistency’s sake I will load the original dataset for this exploratory data analysis.

weather <- weather_check

a) Spread and location

First I will create a histogram for average temperature:

library(ggplot2)

# Histogram with mean & median lines
ggplot(weather, aes(x = avg_temp)) +
  geom_histogram(binwidth = 2, fill = "blue", alpha = 0.6, color = "black") +
  labs(
    title = "Histogram of Average Temperature",
    x = "Average Temperature in Fahrenheit",
    y = "Frequency"
  ) +
  theme_minimal()

The histogram does show a slight skew to the right, indicating that there are more cities with relatively high temperatures. However, the data is still roughly symmetrical and does not have any extreme outliers. Therefore the mean is a reasonable measure of location and standard deviation a reasonable measure of spread.

weather_stats <- weather %>%
  summarise(
    Mean_Temperature = mean(avg_temp, na.rm = TRUE),
    SD_Temperature = sd(avg_temp, na.rm = TRUE)
  )

print(weather_stats)

## # A tibble: 1 × 2
##   Mean_Temperature SD_Temperature
##              <dbl>          <dbl>
## 1             50.8           8.91

The summary statistics show that cities have a mean average temperature of 50.8°f, with a standard deviation of 8.95°f, indicating moderate variation amongst cities.

b) Correlation coefficients

Next, I will calculate the correlation coefficient for average temperature against longitude and then latitude

# Calculate correlation coefficients
cor_long <- cor(weather$avg_temp, weather$lon, use = "complete.obs")
cor_lat <- cor(weather$avg_temp, weather$lat, use = "complete.obs")

# Print correlation results
print(paste("Correlation between Avg Temp and Longitude:", round(cor_long, 3)))

## [1] "Correlation between Avg Temp and Longitude: 0.159"

print(paste("Correlation between Avg Temp and Latitude:", round(cor_lat, 3)))

## [1] "Correlation between Avg Temp and Latitude: -0.878"

library(ggplot2)

# Avg_Temp vs Longitude
ggplot(weather, aes(x = lon, y = avg_temp)) +
  geom_point(color = "#3498db", alpha = 0.7) +
  geom_smooth(method = "lm", color = "#e74c3c", fill = "#e74c3c", alpha = 0.2, se = FALSE) +
  theme_minimal() +
  labs(
    title = "Average Temperature vs. Longitude",
    x = "Longitude",
    y = "Average Temperature (°f)"
  )

## `geom_smooth()` using formula = 'y ~ x'

# Avg_Temp vs Latitude
ggplot(weather, aes(x = lat, y = avg_temp)) +
  geom_point(color = "#18bc9c", alpha = 0.7) +
  geom_smooth(method = "lm", color = "#e74c3c", fill = "#e74c3c", alpha = 0.2, se = FALSE) +
  theme_minimal() +
  labs(
    title = "Average Temperature vs. Latitude",
    x = "Latitude",
    y = "Average Temperature (°f)"
  )

## `geom_smooth()` using formula = 'y ~ x'

The correlation between average temperature and longitude is 0.151, showing a weak positive linear relationship. This indicates that longitude alone does not meaningfully impact city temperature.

For average temperature and latitude, the correlation is -0.881, which is a very strong negative linear relationship. That means as longitude increases (i.e. as one moves east) average temperature will increase slightly. And as latitude increases (i.e. as one moves north) there is a much larger fall in average temperature.

These differences in correlation strength are clearly captured in the scatter plots, which show very diffuse plotting for points in the avg_temp vs longitude compared with a very close fit to the line for the points in the avg_temp vs latitude graph.

c) Koppen classification

To analyze how average temperature (avg_temp) varies across different Köppen climate classifications (koppen), I will use a boxplot. A boxplot is an appropriate visualisation here because it will illustrate the distribution of temperature across classifications. To facilitate analysis, I will also produce some summary statistics.

library(ggplot2)

ggplot(weather, aes(x = koppen, y = avg_temp)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(
    title = "Average Temperature by Köppen Climate Classification",
    x = "Köppen Classification",
    y = "Average Temperature (°C)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Compute summary statistics
koppen_summary <- weather %>%
  group_by(koppen) %>%
  summarise(
    Median_Temp = median(avg_temp, na.rm = TRUE),
    IQR_Temp = IQR(avg_temp, na.rm = TRUE),
    Min_Temp = min(avg_temp, na.rm = TRUE),
    Max_Temp = max(avg_temp, na.rm = TRUE),
    Spread = Max_Temp - Min_Temp
  ) %>%
  arrange(desc(Median_Temp))

print(koppen_summary)

## # A tibble: 15 × 6
##    koppen Median_Temp IQR_Temp Min_Temp Max_Temp Spread
##    <chr>        <dbl>    <dbl>    <dbl>    <dbl>  <dbl>
##  1 Aw            75.9    0         75.9     75.9   0   
##  2 As            73.5    0         73.5     73.5   0   
##  3 Am            73.0    0         73.0     73.0   0   
##  4 Af            71.7    0         71.7     71.7   0   
##  5 BWh           67.0    0.674     66.4     67.7   1.35
##  6 BWk           60.9    3.37      57.5     64.3   6.73
##  7 BSh           60.8    0         60.8     60.8   0   
##  8 Cfa           54.5    6.83      47.0     71.8  24.8 
##  9 Cfb           49.0    5.66      41.3     52.6  11.3 
## 10 Csa           48.3    3.41      44.9     51.7   6.82
## 11 Csb           47.1    5.59      35.7     58.1  22.4 
## 12 Dfa           46.1    2.62      37.9     49.9  12.0 
## 13 Dfb           43.9    5.81      36.8     49.0  12.2 
## 14 BSk           41.7    8.77      35.4     60    24.6 
## 15 Dfc           32.3    7.23      23.4     37.9  14.5

The temperature here is the average for 2021, and the variability captures differences in temperature between cities within the same climate categories. Therefore the boxplot and statistics show both that average temperature varies between categories, but also that certain categories accommodate a wider range of temperatures than others. Specifically,we see that A category climates have the highest average temperatures, exceeding 70 fahrenheit. They also have zero variation, which may indicate that there is only one city in each of these categories, since it is unlikely that multiple cities have identical average temperatures.

B category cities show lower average temperatures (all between 60 and 70° fahrenheit) and some minor variability (though this remains below 7°). For the C categories, temperatures fall again, averaging between 47.7° and 55° fahrehneit. Variability also increases, particularly for Cfa cities which have a variability of 24.8° (the highest for any category). This suggests the C category includes the most diverse range of cities. Finally, category D cities have the lowest average temperatures (between 46.1° and 32.3° fahrenheit) and also reasonably high variability of between 12° and 24.6°.

d) Further Koppen analysis

Moving on to count the number of cities with an A classification:

# Initialize counter
a_count <- 0

# Loop through each city's Köppen classification
for (climate in weather$koppen) {
  # Check climate is NOT missing and starts with "A"
  if (!is.na(climate) && substr(climate, 1, 1) == "A") {
    a_count <- a_count + 1
  }
}

# Print the result
print(a_count)

## [1] 4

There are 4 cities with an A classification, which confirms my suspicion that there was only 1 city per A category given the absence of any variation for those categories. We therefore have few entries in the dataset for cities where the average temperature in the coolest month is 18° celsius or higher.

e) Named list

Moving on to create a function that returns the total number of cities and average temperature for each state:

state_summary <- function(state) {
  # Subset data
  state_data <- weather[weather$State == state, ]

  # Create values to be returned
  total_cities <- nrow(state_data)
  avg_temp_state <- mean(state_data$avg_temp, na.rm = TRUE)

  # Return values as a list
  return(list(
    Total_Cities = total_cities,
    Avg_Temperature = round(avg_temp_state, 2)
  ))
}

# Test the function for FL, NY, SD
print(state_summary("FL"))

## $Total_Cities
## [1] 9
## 
## $Avg_Temperature
## [1] 68.39

print(state_summary("NY"))

## $Total_Cities
## [1] 5
## 
## $Avg_Temperature
## [1] 46.73

print(state_summary("SD"))

## $Total_Cities
## [1] 2
## 
## $Avg_Temperature
## [1] 40.05

Using this function, we can see that cities in Florida on average have a warm temperature, and cities in New York and South Dakota are moderately cold. We can also see that South Dakota only has 2 cities included in the dataset, which could mean less robust conclusions about the state.

Part 3: Building a Model

Having completed the exploratory data analysis, I can now move on to building a model.

a) Predictors

From the correlation analysis earlier, latitude had a far stronger correlation (-0.881) with average temperature than longitude (0.151). I will therefore use latitude as the predictor for my linear regression analysis to be carried out on the climate variables.

First, I will now fit a linear regression model and create an ANOVA table and carry out an F test.

library(kableExtra)

# Fit a linear regression model
model <- lm(
  avg_temp ~ lat + elevation + wind + avg_annual_precip +
    elevation_change_four + elevation_change_eight + distance_to_coast,
  data = weather
)

# Display summary of the model
summary(model)

## 
## Call:
## lm(formula = avg_temp ~ lat + elevation + wind + avg_annual_precip + 
##     elevation_change_four + elevation_change_eight + distance_to_coast, 
##     data = weather)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0597 -1.6134 -0.2268  1.2049 10.8574 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.016e+02  1.971e+00  51.564  < 2e-16 ***
## lat                    -1.152e+00  4.253e-02 -27.076  < 2e-16 ***
## elevation              -5.804e-03  8.303e-04  -6.990  7.4e-11 ***
## wind                   -4.997e-01  3.515e-01  -1.421   0.1572    
## avg_annual_precip      -4.425e-02  1.878e-02  -2.356   0.0197 *  
## elevation_change_four   4.686e-03  2.818e-03   1.663   0.0983 .  
## elevation_change_eight -3.555e-03  2.433e-03  -1.461   0.1460    
## distance_to_coast      -3.076e-03  1.183e-03  -2.601   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.031 on 157 degrees of freedom
## Multiple R-squared:  0.8891, Adjusted R-squared:  0.8842 
## F-statistic: 179.8 on 7 and 157 DF,  p-value: < 2.2e-16

# Calculate degrees of freedom
df_residual <- nrow(weather) - length(coef(model))
df_regression <- length(coef(model)) - 1
df_total <- nrow(weather) - 1

# Calculate sums of squares
SS_residual <- sum(resid(model)^2)
SS_total <- sum((weather$avg_temp - mean(weather$avg_temp))^2)
SS_regression <- SS_total - SS_residual

# Calculate mean squares
MS_regression <- SS_regression / df_regression
MS_residual <- SS_residual / df_residual

# Calculate F-statistic
F_value <- MS_regression / MS_residual

# Create a data frame to display the ANOVA table
anova_table <- data.frame(
  Source = c("Regression", "Residual", "Total"),
  DF = c(df_regression, df_residual, df_total),
  Sum_Sq = c(SS_regression, SS_residual, SS_total),
  Mean_Sq = c(MS_regression, MS_residual, NA),
  F_value = c(F_value, NA, NA)
)

# Round numerical columns to 3 decimal places to tidy table
anova_table$Sum_Sq <- round(anova_table$Sum_Sq, 3)
anova_table$Mean_Sq <- round(anova_table$Mean_Sq, 3)
anova_table$F_value <- round(anova_table$F_value, 3)

# Replace NA values with empty strings to tidy table up
anova_table[is.na(anova_table)] <- "" # Replace NAs with empty strings

# Use kable and kableExtra to produce a formatted ANOVA table
kable(anova_table, "html", caption = "ANOVA Table (values rounded to 3 decimal places)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered")) %>%
  row_spec(0, extra_css = "border-bottom: 1px solid #ddd;")

ANOVA Table (values rounded to 3 decimal places)
Source	DF	Sum_Sq	Mean_Sq	F_value
Regression	7	11565.717	1652.245	179.842
Residual	157	1442.391	9.187
Total	164	13008.108

p_value <- pf(F_value, df_regression, df_residual, lower.tail = FALSE)
print(p_value)

## [1] 1.367117e-71

print(round(p_value, 3))

## [1] 0

Model fit

The F-statistic tests the hypothesis that at least one predictor from the model is significantly associated with average temperature. The very small p-value of <0.001 means we can reject H₀ that there are no statistically significant predictors and conclude there is a linear association between average temperature and at least one of the predictors.
If we divide the sum of squares regression by sum of squares total: \[ R^2 = \frac{SS_{\text{regression}}}{SS_{\text{total}}} = \frac{12229.176}{13781.852} \approx 0.887. \] we can see that the R² is approximately 0.887, meaning the model explains 88.7% of variance in the response variable. That is a very robust result with regards the model’s predictive strength.

Significant predictors

The variable with the strongest effect on average temperature is latitude, with p-value < 0.001. Looking at the summary table we can see that for every 1° increase in latitude, average temperature decreases by roughly 1.18° fahrenheit.
Elevation has a p-value < 0.001, and for every 1 metre increase in elevation, average temperature decreases by roughly 0.0058° fahrenheit.
For distance to coast, the p-value is 0.0013, which is statistically significant. Average temperature decreases by 0.0032° fahrenheit for every 1km closer to the coast.
Average annual precipitation has a p-value of 0.0123 and a coefficient of -0.0493, meaning an increase of 1 inch of average annual precipitation is linked to a fall in average temperature of -0.0493° fahrenheit.
None of the remaining variables (wind, elevation_change_four or elevation_change_eight) are statistically significant.

Next, I will test whether the slope coefficient of the distance to cost is equal to -0.01:

# Extract coefficient and standard error for distance_to_coast
coefs <- summary(model)$coefficients
beta_dist_coast <- coefs["distance_to_coast", 1] # Estimate of coefficient
se_dist_coast <- coefs["distance_to_coast", 2] # Standard error

# Compute t-statistic
t_stat <- (beta_dist_coast - (-0.01)) / se_dist_coast

# Compute p-value (two-tailed test)
p_value <- 2 * pt(abs(t_stat), df = model$df.residual, lower.tail = FALSE)

# Print results
print(round(t_stat, 3))

## [1] 5.854

print(round(p_value, 5))

## [1] 0

# Interpretation
if (p_value < 0.05) {
  cat("Reject H₀: The slope coefficient is significantly different from -0.01.\n")
} else {
  cat("Fail to reject H₀: No significant evidence that the slope coefficient differs from -0.01.\n")
}

## Reject H₀: The slope coefficient is significantly different from -0.01.

With a p-value of < 0.001, which is less than the significance level of 0.05, we reject H₀ and conclude that the slope coefficient is significantly different from -0.01.

b) Variable selection

In this section, I use automatic variable selection methods to explore whether it is possible to build an improved version of the model above. I will test forward, backward and stepwise models:

# Forward selection
forward_model <- step(lm(avg_temp ~ 1, data = weather),
  scope = formula(model),
  direction = "forward",
  trace = FALSE
)

summary(forward_model)

## 
## Call:
## lm(formula = avg_temp ~ lat + elevation + distance_to_coast + 
##     avg_annual_precip, data = weather)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3873 -1.7567 -0.3787  1.2942 11.3469 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.003e+02  1.827e+00  54.895  < 2e-16 ***
## lat               -1.152e+00  4.079e-02 -28.251  < 2e-16 ***
## elevation         -5.924e-03  7.436e-04  -7.967 2.87e-13 ***
## distance_to_coast -3.644e-03  1.043e-03  -3.495 0.000613 ***
## avg_annual_precip -4.906e-02  1.810e-02  -2.711 0.007434 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.048 on 160 degrees of freedom
## Multiple R-squared:  0.8857, Adjusted R-squared:  0.8829 
## F-statistic:   310 on 4 and 160 DF,  p-value: < 2.2e-16

# Backward elimination
backward_model <- step(model,
  direction = "backward",
  trace = FALSE
)

summary(backward_model)

## 
## Call:
## lm(formula = avg_temp ~ lat + elevation + wind + avg_annual_precip + 
##     elevation_change_four + elevation_change_eight + distance_to_coast, 
##     data = weather)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0597 -1.6134 -0.2268  1.2049 10.8574 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.016e+02  1.971e+00  51.564  < 2e-16 ***
## lat                    -1.152e+00  4.253e-02 -27.076  < 2e-16 ***
## elevation              -5.804e-03  8.303e-04  -6.990  7.4e-11 ***
## wind                   -4.997e-01  3.515e-01  -1.421   0.1572    
## avg_annual_precip      -4.425e-02  1.878e-02  -2.356   0.0197 *  
## elevation_change_four   4.686e-03  2.818e-03   1.663   0.0983 .  
## elevation_change_eight -3.555e-03  2.433e-03  -1.461   0.1460    
## distance_to_coast      -3.076e-03  1.183e-03  -2.601   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.031 on 157 degrees of freedom
## Multiple R-squared:  0.8891, Adjusted R-squared:  0.8842 
## F-statistic: 179.8 on 7 and 157 DF,  p-value: < 2.2e-16

# Stepwise selection (both forward and backward)
stepwise_model <- step(model,
  direction = "both",
  trace = FALSE
)

summary(stepwise_model)

## 
## Call:
## lm(formula = avg_temp ~ lat + elevation + wind + avg_annual_precip + 
##     elevation_change_four + elevation_change_eight + distance_to_coast, 
##     data = weather)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0597 -1.6134 -0.2268  1.2049 10.8574 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.016e+02  1.971e+00  51.564  < 2e-16 ***
## lat                    -1.152e+00  4.253e-02 -27.076  < 2e-16 ***
## elevation              -5.804e-03  8.303e-04  -6.990  7.4e-11 ***
## wind                   -4.997e-01  3.515e-01  -1.421   0.1572    
## avg_annual_precip      -4.425e-02  1.878e-02  -2.356   0.0197 *  
## elevation_change_four   4.686e-03  2.818e-03   1.663   0.0983 .  
## elevation_change_eight -3.555e-03  2.433e-03  -1.461   0.1460    
## distance_to_coast      -3.076e-03  1.183e-03  -2.601   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.031 on 157 degrees of freedom
## Multiple R-squared:  0.8891, Adjusted R-squared:  0.8842 
## F-statistic: 179.8 on 7 and 157 DF,  p-value: < 2.2e-16

cat("Full Model AIC:", AIC(model), "\n")

## Full Model AIC: 843.9882

cat("Forward Selection AIC:", AIC(forward_model), "\n")

## Forward Selection AIC: 842.9679

cat("Backward Elimination AIC:", AIC(backward_model), "\n")

## Backward Elimination AIC: 843.9882

cat("Stepwise Selection AIC:", AIC(stepwise_model), "\n")

## Stepwise Selection AIC: 843.9882

Forward, backward and stepwise all resulted in the same AIC of 837.7161, compared with an AIC for the full model of 856.1. The fact that the three automatic variable selection models produced the same models suggests the removed variables did not contribute significantly to predicting average temperature. The fact that these models have a lower AIC does mean the adjusted models are more efficient, which is to say they explains the data well with fewer predictors than the full model.

c) Model comparison

Since forward, backward and stepwise selection all had the same AIC, I can just select one to compare more thoroughly with the original model. For the analysis below, I have selected the stepwise model.

# Calculate number of observations and p
n <- nrow(model.frame(model))
p_full <- length(coef(model))

# Create summary objects
sum_full <- summary(model)
sum_step <- summary(stepwise_model)

# Compute MSE for full model
MSE_full <- sum(resid(model)^2) / (n - p_full)

# Compute Mallows' Cp for the full model
CP_full <- sum(resid(model)^2) / MSE_full + 2 * p_full - n

# Compute p and then Mallows' CP for stepwise model
p_stepwise <- length(coef(stepwise_model))
CP_stepwise <- sum(resid(stepwise_model)^2) / MSE_full + 2 * p_stepwise - n

# Create a dataframe comparing R², Adjusted R², AIC, and Mallows' Cp
model_comparison <- data.frame(
  Model = c("Full Model", "Stepwise Selection"),
  R_Squared = c(summary(model)$r.squared, summary(stepwise_model)$r.squared),
  Adjusted_R2 = c(summary(model)$adj.r.squared, summary(stepwise_model)$adj.r.squared),
  AIC = c(AIC(model), AIC(stepwise_model)),
  Mallows_Cp = c(CP_full, CP_stepwise)
)

print(model_comparison)

##                Model R_Squared Adjusted_R2      AIC Mallows_Cp
## 1         Full Model  0.889116   0.8841721 843.9882          8
## 2 Stepwise Selection  0.889116   0.8841721 843.9882          8

# Compare stepwise model to full model using ANOVA
print(anova(model, stepwise_model))

## Analysis of Variance Table
## 
## Model 1: avg_temp ~ lat + elevation + wind + avg_annual_precip + elevation_change_four + 
##     elevation_change_eight + distance_to_coast
## Model 2: avg_temp ~ lat + elevation + wind + avg_annual_precip + elevation_change_four + 
##     elevation_change_eight + distance_to_coast
##   Res.Df    RSS Df Sum of Sq F Pr(>F)
## 1    157 1442.4                      
## 2    157 1442.4  0         0

I have already discussed the AIC above. Turning to the R², there is a very small (0.0005) drop in the stepwise model relative to the original model, which means removing variables did not significantly reduce the model’s explanatory power. The adjusted R² is very marginally improved (0.8826 vs 0.8823 for the full model, representing an increase in 0.0003) which means when adjusting for the number of predictors the stepwise model performs slightly better.

With regards Mallow’s Cp, if the stepwise model performs similarly to the full model than Cp should be approximately p + 1. The stepwise model has a p of 6 (5 predictors + intercept) and has a Cp of 5.66, representing only a small distance from p. Where two models perform similarly but one is smaller with a lower p value, we prefer the smaller model. We therefore have strong grounds from the Mallow’s Cp analysis to use the smaller stepwise model.

The ANOVA results paint a similar picture, with a slight increase in residual sum of squares for the stepwise model. But the F-statistic has a p-value of 0.438, above the significance threshold of 0.05. This means that the difference in model fit between the full and stepwise models is not statistically significant. In other words, removing elevation_change_four and elevation_change_eight does not meaningfully reduce the model’s ability to explain variation in avg_temp.

Finally, we can compare each model’s PRESS statistic, which allows us to check how well a model will perform on new data.

h_model <- hatvalues(model)
press_model <- sum((residuals(model) / (1 - h_model))^2)

h_step <- hatvalues(stepwise_model)
press_step <- sum((residuals(stepwise_model) / (1 - h_step))^2)

# Print both PRESS values for comparison
print(press_model)

## [1] 1745.955

print(press_step)

## [1] 1745.955

The stepwise model has a lower PRESS (1851.698 vs 1901.073 for the full model). That suggests that the stepwise model may have slightly better predictive performance when using new and unseen data. Given that the stepwise model achieves a similar level of explanatory power, uses fewer predictors, and is less likely to overfit data, it is the preferred model for efficiency and interpretability.

d) Assumptions

The assumptions of the linear regression model are that:

the errors are normally distributed
the errors have constant variance
the errors are independent

The normality assumption can be checked with a q-q plot of the residuals, constant variance by plotting residuals against the fitted values, and independence by considering both a plot of residuals vs fitted values and of residuals vs covariates.

library(ggplot2)

# Extract residuals & fitted values
residuals_stepwise <- residuals(stepwise_model)
fitted_values_stepwise <- fitted(stepwise_model)

# Plot the Q-Q plot with a reference line
ggplot(data = data.frame(Residuals = residuals_stepwise), aes(sample = Residuals)) +
  stat_qq() +
  stat_qq_line() +
  labs(
    title = "Q-Q Plot of Residuals (Stepwise Model)",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

# Residuals vs Fitted
ggplot() +
  geom_point(aes(x = fitted_values_stepwise, y = residuals_stepwise)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(aes(x = fitted_values_stepwise, y = residuals_stepwise),
    method = "loess", se = FALSE, color = "red"
  )

## `geom_smooth()` using formula = 'y ~ x'

# Residuals vs covariates
covariates <- c("lat", "elevation", "wind", "avg_annual_precip", "distance_to_coast")

df_long <- weather %>%
  dplyr::select(all_of(covariates)) %>%
  pivot_longer(cols = everything(), names_to = "Covariate", values_to = "Value")

df_long$Residuals <- rep(residuals_stepwise, each = length(covariates))

ggplot(df_long, aes(x = Value, y = Residuals)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  #  stat_smooth(method = "loess", se = FALSE, color = "red") +
  facet_wrap(~Covariate, scales = "free_x") +
  labs(
    title = "Residuals vs Covariates",
    x = "Covariate Value",
    y = "Residuals"
  ) +
  theme_minimal()

The Q-Q plot shows the residuals mostly follow the line in the middle, indicating normality for those values. However, there is some deviation at the extremes, particularly on the right. This indicates outliers at either end, suggesting some slight heavy tails.

The residuals vs fitted shows a slight smile pattern, made more apparent with the use of a LOESS curve to visualise the overall trend. This indicates that residuals are slightly smaller for intermediate fitted values and larger at the extremes. This could mean the model may predict cities with temperatures at extremes with less precision than cities with moderate average temperatures.

Finally, we turn to the Residuals vs Covariates graphs. The graphs for distance_to_coast and elevation show clustering at the left side of the graph. This does not inherently violate assumptions, but instead indicate that there is a greater proportion of cities close to the coast and at lower levels of elevation. The spread of values for these variables is fairly evenly spread across the zero line, albeit with slightly more points above than below in both cases. The dispersion of values for wind appears straightforwardly random, and to a lesser extent lat which does have some outliers in the top right. A significant outlier impacts avg_annual_precip in the top right of the graph.

e) Transformations

The process of checking assumptions above establishes grounds for a Box-Cox transformation to avg_temp. Box-Cox transformations can normalise residual distribution, which may reduce the heavy tails identified in the q-q plot. It can also reduce heteroscadisticity, which should help correct the smile pattern in the residuals vs fitted graph. And it can enhance linearity between dependent variable and predictors, which could lead to a more random dispersion of points in the residual vs covariate graphs. The first step is identifying the lambda to use.

library(MASS)
bc <- boxcox(stepwise_model, plotit = TRUE)

lambda_opt <- bc$x[which.max(bc$y)]
print(lambda_opt)

## [1] 0.4646465

That gives a lambda of 0.50505, which can now be incorporated into the stepwise model. This is almost equivalent to a square-root transformation, but I will use the calculated value for extra precision.

lambda_opt <- 0.5050505
weather$avg_temp_trans <- (weather$avg_temp^lambda_opt - 1) / lambda_opt

stepwise_transformed <- lm(avg_temp_trans ~ lat + elevation + wind + avg_annual_precip + distance_to_coast,
  data = weather
)

summary(stepwise_transformed)

## 
## Call:
## lm(formula = avg_temp_trans ~ lat + elevation + wind + avg_annual_precip + 
##     distance_to_coast, data = weather)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45079 -0.20982 -0.03978  0.19133  1.65601 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       19.5859626  0.2725420  71.864  < 2e-16 ***
## lat               -0.1653973  0.0057028 -29.003  < 2e-16 ***
## elevation         -0.0008520  0.0001037  -8.214 6.98e-14 ***
## wind              -0.0750371  0.0465047  -1.614   0.1086    
## avg_annual_precip -0.0049575  0.0026046  -1.903   0.0588 .  
## distance_to_coast -0.0003594  0.0001600  -2.246   0.0261 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4251 on 159 degrees of freedom
## Multiple R-squared:  0.8921, Adjusted R-squared:  0.8888 
## F-statistic: 263.1 on 5 and 159 DF,  p-value: < 2.2e-16

Assessing the impact of the Box-Cox transformation on the model we can see:

R²: increased from 0.87 to 0.89
Adjusted R²: increased 0.8826 to 0.8888

which means the Box-Cox transformation ensures a slightly larger proportion of the variance in average temperature is explained by the model.

There is also a slight increase in the F-statistic (from 247.5 to 263.1) and the p-value is still <0.001. So the Box-Cox transformation does appear to have contributed to a further improvement to the model.

Re-running the assumption checks:

# Extract residuals & fitted values
residuals_t_stepwise <- residuals(stepwise_transformed)
fitted_values_t_stepwise <- fitted(stepwise_transformed)

# Replot the Q-Q plot. I have adjusted so that it uses the same scale as the first Q-Q plot to make the graphs easier to compare
ggplot(data = data.frame(Residuals = residuals_t_stepwise), aes(sample = Residuals)) +
  stat_qq() +
  stat_qq_line() +
  labs(
    title = "Q-Q Plot of Residuals (Stepwise Model)",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  scale_y_continuous(limits = c(-10, 10), breaks = seq(-10, 10, by = 5)) +
  theme_minimal()

# Residuals vs Fitted
ggplot() +
  geom_point(aes(x = fitted_values_t_stepwise, y = residuals_t_stepwise)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(aes(x = fitted_values_t_stepwise, y = residuals_t_stepwise),
    method = "loess", se = FALSE, color = "red"
  )

## `geom_smooth()` using formula = 'y ~ x'

# Residuals vs covariates
df_long$Residuals <- rep(residuals_t_stepwise, each = length(covariates))

ggplot(df_long, aes(x = Value, y = Residuals)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  #  stat_smooth(method = "loess", se = FALSE, color = "red") +
  facet_wrap(~Covariate, scales = "free_x") +
  labs(
    title = "Residuals vs Covariates",
    x = "Covariate Value",
    y = "Residuals"
  ) +
  theme_minimal()

The Q-Q plot is flatter and the tails are less pronounced, reflecting the reduced variance in residuals and higher normality. Turning to the residuals vs fitted graph, the LOESS curve has also flattened, which means greater homoscedasticity (i.e. constant variance). However, the residuals vs covariate graphs have not been significantly impacted. I will therefore explore applying transformations to the dependent variables to assess whether that will strengthen the model.

library(gridExtra)

weather <- weather %>%
  mutate(
    log_lat = log(lat + 1),
    log_elevation = log(elevation + 1),
    log_distance_to_coast = log(distance_to_coast + 1),
    log_avg_annual_precip = log(avg_annual_precip + 1),
    log_wind = log(wind + 1),
    sqrt_lat = sqrt(lat),
    sqrt_elevation = sqrt(pmax(elevation, 0)), # Ensure non-negative values.
    sqrt_distance_to_coast = sqrt(distance_to_coast),
    sqrt_avg_annual_precip = sqrt(avg_annual_precip),
    sqrt_wind = sqrt(wind)
  )

# Fit the models with transformed predictors
model_log <- lm(
  avg_temp_trans ~ log_lat + log_elevation + log_wind +
    log_avg_annual_precip + log_distance_to_coast,
  data = weather
)
model_sqrt <- lm(
  avg_temp_trans ~ sqrt_lat + sqrt_elevation + sqrt_wind +
    sqrt_avg_annual_precip + sqrt_distance_to_coast,
  data = weather
)
## Doing each graph was unwieldy so using a function to cut down on code:

# Extract residuals and fitted values from the new models
weather <- weather %>%
  mutate(
    residuals_transformed_log = residuals(model_log),
    residuals_transformed_sqrt = residuals(model_sqrt),
    residuals_t_stepwise = residuals(stepwise_transformed),
  )

# Define variables and transformations
variables <- c("lat", "elevation", "distance_to_coast", "avg_annual_precip", "wind")
transformations <- c("original", "log", "sqrt")
residuals_vars <- c("residuals_t_stepwise", "residuals_transformed_log", "residuals_transformed_sqrt")
colors <- c("blue", "darkgreen", "purple")
transformation_labels <- c("", "log( + 1)", "sqrt()")

# Function to create plots
create_plots <- function(var, transformation, residuals_var, color, transformation_label) {
  x_var <- if (transformation == "original") {
    var
  } else {
    paste0(transformation, "_", var)
  }

  x_label <- if (transformation == "original") {
    gsub("_", " ", var) # Replace underscore with space in label if raw
  } else {
    paste0(transformation_label, "(", gsub("_", " ", var), ")")
  }

  ggplot(weather, aes(x = .data[[x_var]], y = .data[[residuals_var]])) +
    geom_point(alpha = 0.7, color = color) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    labs(x = x_label, y = "Residuals") +
    theme_minimal()
}

# Loop through variables and transformations, printing each variable in 1 row
for (var in variables) {
  plot_row <- list()
  for (i in 1:length(transformations)) {
    transformation <- transformations[i]
    residuals_var <- residuals_vars[i]
    color <- colors[i]
    transformation_label <- transformation_labels[i]

    plot_row[[i]] <- create_plots(var, transformation, residuals_var, color, transformation_label)
  }
  print(grid.arrange(grobs = plot_row, ncol = 3))
}

## TableGrob (1 x 3) "arrange": 3 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]

## TableGrob (1 x 3) "arrange": 3 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]

## TableGrob (1 x 3) "arrange": 3 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]

## TableGrob (1 x 3) "arrange": 3 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]

## TableGrob (1 x 3) "arrange": 3 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]

I will also assess how these transformations impact the model more broadly:

## Again, use of functions will lead to cleaner and replicable output:

# Based on my readings, I will also assess including just sqrt transformations for precip and distance to coast
model_combined <- lm(
  avg_temp_trans ~ lat + elevation + wind +
    sqrt_avg_annual_precip + sqrt_distance_to_coast,
  data = weather
)

model_combined_2 <- lm(
  avg_temp_trans ~ lat + sqrt_elevation + wind +
    sqrt_avg_annual_precip + sqrt_distance_to_coast,
  data = weather
)

# F-statistic function
get_f_stat <- function(model) {
  f <- summary(model)$fstatistic
  p_value <- pf(f[1], f[2], f[3], lower.tail = FALSE)
  data.frame(
    F_statistic = f[1],
    df1 = f[2],
    df2 = f[3],
    p_value = p_value
  )
}

# PRESS statistic function
get_press <- function(model) {
  h <- hatvalues(model)
  press <- sum((residuals(model) / (1 - h))^2)
  return(press)
}

# Calculate F-statistics
f_original <- get_f_stat(stepwise_transformed)
f_log <- get_f_stat(model_log)
f_sqrt <- get_f_stat(model_sqrt)
f_combined <- get_f_stat(model_combined)
f_combined_2 <- get_f_stat(model_combined_2)

# Calculate PRESS statistics
press_original <- get_press(stepwise_transformed)
press_log <- get_press(model_log)
press_sqrt <- get_press(model_sqrt)
press_combined <- get_press(model_combined)
press_combined_2 <- get_press(model_combined_2)
# Combine results for easy comparison
comparison <- rbind(
  Original = cbind(f_original, PRESS = press_original),
  Log_Transform = cbind(f_log, PRESS = press_log),
  Sqrt_Transform = cbind(f_sqrt, PRESS = press_sqrt),
  Combined_Transform = cbind(f_combined, PRESS = press_combined),
  Combined_Transform_2 = cbind(f_combined_2, PRESS = press_combined_2)
)

print(comparison)

##                      F_statistic df1 df2      p_value    PRESS
## Original                263.0511   5 159 5.932585e-75 33.43613
## Log_Transform           189.8122   5 159 4.034483e-65 42.40488
## Sqrt_Transform          256.3958   5 159 3.628297e-74 33.17682
## Combined_Transform      276.6720   5 159 1.649814e-76 31.42069
## Combined_Transform_2    258.8112   5 159 1.871536e-74 33.04713

Most of the graphs show fairly minor improvements in the residual vs covariate graphs. Latitude and wind are reasonably dispersed across 3 version so I do not see clear grounds for transforming. The square-root transformation for distance_to_coast does seem slightly improved, and there may be some very slight improvements for both the log and square-root transformations of avg_annual_precip. I did also consider the square-root transformation elevation which may offer some slight improvements. Subsequently, I have considered Combined_Transform with square-root transformations to distance_to_coast and ave_annual_precip whilst Combined_Transform_2 also includes a square-root transformation of elevation.

The statistics included for comparison shows model Combined_Transform with square-root transformations for distance_to_coast and avg_annual_precip as having the highest F-statistic and the lowest PRESS statistic which suggest this model explains the most variation and has the best predictive accuracy. Whilst the residual vs covariates graph may slightly improve elevation when a square-root transformation is applied, the effect was not drastic and including this in the model lowered the F-stat and raised the PRESS statistic relative to the Combined_Transform. On these grounds, I have decided to keep it untransformed.

With so many transformations, there is a risk of overfitting. However, the lower PRESS statistic gives some confidence that the model’s improvements are not just fitting noise but genuinely capturing underlying relationships.

The final model is therefore a stepwise model, with a Box-Cox transformation of the response variable avg_temperature and a square-root transformation applied to distance_to_coast and avg_annual_precip. Since the Box-Cox lambda is so close to 0.5 (which would be a square-root transformation), the model is essentially using a square-root transformation for the response variable too.

These transformations have contributed to better adherence to linear regression assumptions, greater reliability in model predictions across all fitted values, and improved accuracy. We can therefore have a high degree of confidence in the model’s predictions.

Part 4: Report

Now that the model is complete, I can begin using it for my report for the travel agency. The agency wants the model tested against a city in Florida. I will also test Springfield, Ohio with the model to see if it meets the agency’s temperature criteria.

library(skimr)

# I want an overview of the dataset for my methodology section
skim(weather)

Data summary
Name	weather
Number of rows	165
Number of columns	26
_______________________
Column type frequency:
character	3
numeric	23
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
State	1	2	2	51
City	1	4	15	158
koppen	1	2	3	15

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
lon	1	-94.05	16.73	-157.94	-104.51	-89.68	-81.59	-68.01	▁▁▃▇▇
lat	1	38.49	6.28	21.33	34.02	39.11	42.38	64.80	▁▆▇▁▁
elevation	1	351.48	467.22	-0.67	48.60	192.13	353.42	2133.27	▇▁▁▁▁
distance_to_coast	1	323.99	297.49	0.26	36.14	249.02	568.23	1125.15	▇▃▃▂▁
wind	1	3.40	0.82	1.15	2.79	3.36	3.99	5.64	▁▅▇▅▁
elevation_change_four	1	92.83	136.47	0.00	23.71	43.91	99.11	809.87	▇▁▁▁▁
elevation_change_eight	1	144.36	177.45	0.46	37.15	82.13	152.62	851.60	▇▂▁▁▁
avg_annual_precip	1	39.09	16.99	3.90	23.75	43.36	51.19	111.10	▅▆▇▁▁
avg_temp	1	50.77	8.91	23.40	45.32	49.68	55.71	75.91	▁▃▇▃▁
avg_temp_trans	1	12.36	1.27	7.75	11.61	12.25	13.10	15.65	▁▂▇▆▂
log_lat	1	3.66	0.16	3.11	3.56	3.69	3.77	4.19	▁▃▇▅▁
log_elevation	1	4.77	1.89	-1.11	3.90	5.26	5.87	7.67	▁▂▂▇▃
log_distance_to_coast	1	4.75	2.02	0.23	3.61	5.52	6.34	7.03	▂▂▂▃▇
log_avg_annual_precip	1	3.57	0.55	1.59	3.21	3.79	3.95	4.72	▁▂▂▇▁
log_wind	1	1.46	0.20	0.77	1.33	1.47	1.61	1.89	▁▂▆▇▃
sqrt_lat	1	6.18	0.50	4.62	5.83	6.25	6.51	8.05	▁▅▇▂▁
sqrt_elevation	1	15.16	11.07	0.00	6.97	13.86	18.80	46.19	▆▇▂▂▂
sqrt_distance_to_coast	1	15.21	9.65	0.51	6.01	15.78	23.84	33.54	▇▆▆▇▅
sqrt_avg_annual_precip	1	6.07	1.49	1.97	4.87	6.58	7.15	10.54	▂▃▇▅▁
sqrt_wind	1	1.83	0.23	1.07	1.67	1.83	2.00	2.37	▁▂▇▇▂
residuals_transformed_log	1	0.00	0.48	-2.27	-0.22	0.08	0.24	1.86	▁▂▇▃▁
residuals_transformed_sqrt	1	0.00	0.42	-1.63	-0.18	0.03	0.16	1.67	▁▂▇▂▁
residuals_t_stepwise	1	0.00	0.42	-1.45	-0.21	-0.04	0.19	1.66	▁▃▇▂▁

# There are 165 columns but only 158 unique values for cities so I will check for cities sharing the same name from different states
cities_multiple_states <- weather %>%
  group_by(City) %>%
  summarise(n_states = n_distinct(State)) %>%
  filter(n_states > 1)

print(cities_multiple_states)

## # A tibble: 6 × 2
##   City        n_states
##   <chr>          <int>
## 1 BUFFALO            2
## 2 CHARLESTON         2
## 3 COLUMBUS           2
## 4 PORTLAND           2
## 5 RICHMOND           3
## 6 SPRINGFIELD        2

# Now testing the cities

# We are only provided values for lat, elevation, distance_to_coast and avg_annual_precip for both of them
model_subset <- update(model_combined, . ~ lat + elevation + sqrt_distance_to_coast + sqrt_avg_annual_precip)

# Create dataframes for the Florida city and for Springfield, Ohio
new_city <- data.frame(
  lat = 29.65,
  elevation = 13,
  sqrt_distance_to_coast = sqrt(3.25),
  sqrt_avg_annual_precip = sqrt(51.04)
)

springfield <- data.frame(
  lat = 39.93,
  elevation = 298,
  sqrt_distance_to_coast = sqrt(453),
  sqrt_avg_annual_precip = sqrt(38.512)
)

# Calculate the predicted temperatures
predicted_temperature_new_city <- predict(model_subset, newdata = new_city, interval = "confidence", level = 0.95)
predicted_temp_springfield <- predict(model_subset, newdata = springfield, interval = "confidence", level = 0.95)

# Because the dependent variable has a Box-Cox transformation applied, the inverse function must now be used on the computed variables
lambda <- 0.5050505
final_temp_new_city <- (lambda * predicted_temperature_new_city + 1)^(1 / lambda)
final_temp_springfield <- (lambda * predicted_temp_springfield + 1)^(1 / lambda)

print(final_temp_new_city)

##        fit      lwr      upr
## 1 64.26314 63.25711 65.27702

print(final_temp_springfield)

##        fit      lwr     upr
## 1 48.36741 47.79292 48.9453

# We are asked to compare the new_city to the average in Florida
florida_mean_temp <- weather %>%
  filter(State == "FL") %>%
  summarise(mean_temp = mean(avg_temp, na.rm = TRUE))

florida_sd <- weather %>%
  filter(State == "FL") %>%
  summarise(sd_temp = sd(avg_temp, na.rm = TRUE))

print(florida_mean_temp)

## # A tibble: 1 × 1
##   mean_temp
##       <dbl>
## 1      68.4

print(florida_sd)

## # A tibble: 1 × 1
##   sd_temp
##     <dbl>
## 1    5.53

# I will do the same for OH just as a sense-check
oh_mean_temp <- weather %>%
  filter(State == "OH") %>%
  summarise(mean_temp = mean(avg_temp, na.rm = TRUE))

print(oh_mean_temp)

## # A tibble: 1 × 1
##   mean_temp
##       <dbl>
## 1      48.6

I want to generate a table to use in my report comparing the original model to the final model:

## Compare original 'model' vs. final 'model_sqrt'

# 1. R-squared
R2_orig <- summary(model)$r.squared
R2_final <- summary(model_combined)$r.squared

# 2. F-statistic
F_orig <- summary(model)$fstatistic[1]
F_final <- summary(model_combined)$fstatistic[1]

# 3. PRESS
PRESS_orig <- sum((resid(model) / (1 - hatvalues(model)))^2)
PRESS_final <- sum((resid(model_combined) / (1 - hatvalues(model_sqrt)))^2)

# Build data frame for the comparison
comparison_table <- data.frame(
  Model       = c("Original model", "Final model"),
  R_squared   = c(R2_orig, R2_final),
  F_statistic = c(F_orig, F_final),
  PRESS       = c(PRESS_orig, PRESS_final)
)

# Create a formatted table to view
knitr::kable(
  comparison_table,
  caption = "Comparison of Original vs Final Model"
)

Comparison of Original vs Final Model
Model	R_squared	F_statistic	PRESS
Original model	0.8891160	179.842	1745.95489
Final model	0.8969112	276.672	30.99494

# Latex version for my written report
latex_table <- kable(
  comparison_table,
  format    = "latex",
  booktabs  = TRUE,
  caption   = "Comparison of Original vs. Final Model: R², F, and PRESS",
  label     = "tab:comparison",
  align     = c("l", "r", "r", "r")
) %>%
  kable_styling(
    latex_options = c("striped", "hold_position")
  )

cat(latex_table, file = "latex_table.tex")

Adding a graph for distance to coast and calculating coefficients

p <- ggplot(weather, aes(x = sqrt_distance_to_coast, y = avg_temp_trans)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    x = "sqrt(Distance to Coast)",
    y = "Average Temperature (°F) with Box-Cox"
  )
ggsave("distance_temp_plot.pdf", plot = p, width = 6, height = 3)

## `geom_smooth()` using formula = 'y ~ x'

# Just for reference I want to have a look at an untransformed graph too
ggplot(weather, aes(x = distance_to_coast, y = avg_temp)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    x = "sqrt(Distance to Coast)",
    y = "Average Temperature (°F) with Box-Cox"
  )

## `geom_smooth()` using formula = 'y ~ x'

summary(model_combined)

## 
## Call:
## lm(formula = avg_temp_trans ~ lat + elevation + wind + sqrt_avg_annual_precip + 
##     sqrt_distance_to_coast, data = weather)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35602 -0.18813 -0.02756  0.19058  1.61208 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            20.0034793  0.3018436  66.271  < 2e-16 ***
## lat                    -0.1662721  0.0055436 -29.993  < 2e-16 ***
## elevation              -0.0009080  0.0001041  -8.722 3.46e-15 ***
## wind                   -0.0602448  0.0446149  -1.350  0.17883    
## sqrt_avg_annual_precip -0.0885253  0.0295319  -2.998  0.00316 ** 
## sqrt_distance_to_coast -0.0122956  0.0046091  -2.668  0.00843 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4157 on 159 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8937 
## F-statistic: 276.7 on 5 and 159 DF,  p-value: < 2.2e-16