The first step is to load the data. The data files are saved in the same folder as the .Rmd so there is no need to change the working directory.
Now I will tidy the data, starting with the forecasts dataset:
library("tidyverse")
library("dplyr") # Despite loading tidyverse above, the distinct function was not working unless I also explicitly called dplyr
tidy_forecasts <- forecasts %>%
# Create unique variables for State and City
separate(State_City, c("State", "City"), ":") %>%
# Make each level of Measurement have its own column with measure from Response
pivot_wider(names_from = Measurement, values_from = Response) %>%
# Obtain only rows with accurate observed_temp and where year is 2021
filter(!(possible_error == "observed_temp"), substr(date, 1, 4) == "2021") %>%
# Calculate average temperature of each city and store in variable avg_temp rounded to 3 decimal places
group_by(State, City) %>%
mutate(avg_temp = round(mean(observed_temp, na.rm = TRUE), 3)) %>%
# Retain only one row from each city in each state
distinct(State, City, .keep_all = TRUE)
head(tidy_forecasts)Tidy data principles state that each variable should have its own
column, each observation its own row, and each value its own cell. The
original dataset violated those principles because the
Measurement column contained multiple variables
(i.e. observed_temp, forecast temp etc),
whilst the State and City variables were
combined into one column. The combined State_City column
was therefore separated out, and pivoting moves the measurements into
separate columns. I also filtered out instances where the observed
temperature may be inaccurate, which should improve the accuracy of any
analysis performed with this dataset.
Moving on to the cities dataset:
tidy_cities <- cities %>%
# Create variables koppen and avg_annual_precip
pivot_longer(cols = 10:25, names_to = "koppen", values_to = "avg_annual_precip") %>%
# Drop rows with missing precipitation
drop_na(avg_annual_precip) %>%
# Convert 'city' to uppercase so I can join with forecast data
rename(City = city, State = state)
head(tidy_cities)# Join with tidy_forecasts, matching cities between each dataset. State is also included to account for possibility of cities with the same name from different states
combined_data <- inner_join(tidy_cities, tidy_forecasts, by = c("City", "State"))
# Filter out unwanted columns
weather <- combined_data %>%
dplyr::select(State, City, lon, lat, koppen, elevation, distance_to_coast, wind, elevation_change_four, elevation_change_eight, avg_annual_precip, avg_temp) # Once I included the MASS package, I had some issues with 'select' so have explicitly called it from the dplyr package
head(weather)As with the forecasts data, the cities data required pivoting. This
time it was pivoting the data from wide to long format, ensuring that
each Koppen classification has its own single column. The data was then
joined with the forecasts data by City and
State, thereby ensuring that each row represents a unique
city and state combination.
Before moving on, I will verify that this matches the
weather.Rdata provided, which is saved in the same
folder.
# Load data in new environment to avoid potential conflicts with my created dataset
weather_env <- new.env()
load("weather_check.RData", envir = weather_env)
weather_check <- get(ls(weather_env), envir = weather_env)
head(weather_check)## [1] FALSE
## character(0)
## character(0)
## [1] "Component \"avg_temp\": Mean relative difference: 0.003531672"
The analysis shows that the only difference is that
avg_temp has a mean relative difference of 0.0035. This
likely owes to rounding or floating-point precision. Therefore, I can be
happy that the data has been successfully transformed.
Whilst the minor discrepancy in avg_temp is unlikely to
meaningfully impact analysis, for consistency’s sake I will load the
original dataset for this exploratory data analysis.
First I will create a histogram for average temperature:
library(ggplot2)
# Histogram with mean & median lines
ggplot(weather, aes(x = avg_temp)) +
geom_histogram(binwidth = 2, fill = "blue", alpha = 0.6, color = "black") +
labs(
title = "Histogram of Average Temperature",
x = "Average Temperature in Fahrenheit",
y = "Frequency"
) +
theme_minimal()The histogram does show a slight skew to the right, indicating that there are more cities with relatively high temperatures. However, the data is still roughly symmetrical and does not have any extreme outliers. Therefore the mean is a reasonable measure of location and standard deviation a reasonable measure of spread.
weather_stats <- weather %>%
summarise(
Mean_Temperature = mean(avg_temp, na.rm = TRUE),
SD_Temperature = sd(avg_temp, na.rm = TRUE)
)
print(weather_stats)## # A tibble: 1 × 2
## Mean_Temperature SD_Temperature
## <dbl> <dbl>
## 1 50.8 8.91
The summary statistics show that cities have a mean average temperature of 50.8°f, with a standard deviation of 8.95°f, indicating moderate variation amongst cities.
Next, I will calculate the correlation coefficient for average temperature against longitude and then latitude
# Calculate correlation coefficients
cor_long <- cor(weather$avg_temp, weather$lon, use = "complete.obs")
cor_lat <- cor(weather$avg_temp, weather$lat, use = "complete.obs")
# Print correlation results
print(paste("Correlation between Avg Temp and Longitude:", round(cor_long, 3)))## [1] "Correlation between Avg Temp and Longitude: 0.159"
## [1] "Correlation between Avg Temp and Latitude: -0.878"
library(ggplot2)
# Avg_Temp vs Longitude
ggplot(weather, aes(x = lon, y = avg_temp)) +
geom_point(color = "#3498db", alpha = 0.7) +
geom_smooth(method = "lm", color = "#e74c3c", fill = "#e74c3c", alpha = 0.2, se = FALSE) +
theme_minimal() +
labs(
title = "Average Temperature vs. Longitude",
x = "Longitude",
y = "Average Temperature (°f)"
)## `geom_smooth()` using formula = 'y ~ x'
# Avg_Temp vs Latitude
ggplot(weather, aes(x = lat, y = avg_temp)) +
geom_point(color = "#18bc9c", alpha = 0.7) +
geom_smooth(method = "lm", color = "#e74c3c", fill = "#e74c3c", alpha = 0.2, se = FALSE) +
theme_minimal() +
labs(
title = "Average Temperature vs. Latitude",
x = "Latitude",
y = "Average Temperature (°f)"
)## `geom_smooth()` using formula = 'y ~ x'
The correlation between average temperature and longitude is 0.151, showing a weak positive linear relationship. This indicates that longitude alone does not meaningfully impact city temperature.
For average temperature and latitude, the correlation is -0.881, which is a very strong negative linear relationship. That means as longitude increases (i.e. as one moves east) average temperature will increase slightly. And as latitude increases (i.e. as one moves north) there is a much larger fall in average temperature.
These differences in correlation strength are clearly captured in the scatter plots, which show very diffuse plotting for points in the avg_temp vs longitude compared with a very close fit to the line for the points in the avg_temp vs latitude graph.
To analyze how average temperature (avg_temp) varies
across different Köppen climate classifications (koppen), I
will use a boxplot. A boxplot is an appropriate visualisation here
because it will illustrate the distribution of temperature across
classifications. To facilitate analysis, I will also produce some
summary statistics.
library(ggplot2)
ggplot(weather, aes(x = koppen, y = avg_temp)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(
title = "Average Temperature by Köppen Climate Classification",
x = "Köppen Classification",
y = "Average Temperature (°C)"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Compute summary statistics
koppen_summary <- weather %>%
group_by(koppen) %>%
summarise(
Median_Temp = median(avg_temp, na.rm = TRUE),
IQR_Temp = IQR(avg_temp, na.rm = TRUE),
Min_Temp = min(avg_temp, na.rm = TRUE),
Max_Temp = max(avg_temp, na.rm = TRUE),
Spread = Max_Temp - Min_Temp
) %>%
arrange(desc(Median_Temp))
print(koppen_summary)## # A tibble: 15 × 6
## koppen Median_Temp IQR_Temp Min_Temp Max_Temp Spread
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aw 75.9 0 75.9 75.9 0
## 2 As 73.5 0 73.5 73.5 0
## 3 Am 73.0 0 73.0 73.0 0
## 4 Af 71.7 0 71.7 71.7 0
## 5 BWh 67.0 0.674 66.4 67.7 1.35
## 6 BWk 60.9 3.37 57.5 64.3 6.73
## 7 BSh 60.8 0 60.8 60.8 0
## 8 Cfa 54.5 6.83 47.0 71.8 24.8
## 9 Cfb 49.0 5.66 41.3 52.6 11.3
## 10 Csa 48.3 3.41 44.9 51.7 6.82
## 11 Csb 47.1 5.59 35.7 58.1 22.4
## 12 Dfa 46.1 2.62 37.9 49.9 12.0
## 13 Dfb 43.9 5.81 36.8 49.0 12.2
## 14 BSk 41.7 8.77 35.4 60 24.6
## 15 Dfc 32.3 7.23 23.4 37.9 14.5
The temperature here is the average for 2021, and the variability captures differences in temperature between cities within the same climate categories. Therefore the boxplot and statistics show both that average temperature varies between categories, but also that certain categories accommodate a wider range of temperatures than others. Specifically,we see that A category climates have the highest average temperatures, exceeding 70 fahrenheit. They also have zero variation, which may indicate that there is only one city in each of these categories, since it is unlikely that multiple cities have identical average temperatures.
B category cities show lower average temperatures (all between 60 and 70° fahrenheit) and some minor variability (though this remains below 7°). For the C categories, temperatures fall again, averaging between 47.7° and 55° fahrehneit. Variability also increases, particularly for Cfa cities which have a variability of 24.8° (the highest for any category). This suggests the C category includes the most diverse range of cities. Finally, category D cities have the lowest average temperatures (between 46.1° and 32.3° fahrenheit) and also reasonably high variability of between 12° and 24.6°.
Moving on to count the number of cities with an A classification:
# Initialize counter
a_count <- 0
# Loop through each city's Köppen classification
for (climate in weather$koppen) {
# Check climate is NOT missing and starts with "A"
if (!is.na(climate) && substr(climate, 1, 1) == "A") {
a_count <- a_count + 1
}
}
# Print the result
print(a_count)## [1] 4
There are 4 cities with an A classification, which confirms my suspicion that there was only 1 city per A category given the absence of any variation for those categories. We therefore have few entries in the dataset for cities where the average temperature in the coolest month is 18° celsius or higher.
Moving on to create a function that returns the total number of cities and average temperature for each state:
state_summary <- function(state) {
# Subset data
state_data <- weather[weather$State == state, ]
# Create values to be returned
total_cities <- nrow(state_data)
avg_temp_state <- mean(state_data$avg_temp, na.rm = TRUE)
# Return values as a list
return(list(
Total_Cities = total_cities,
Avg_Temperature = round(avg_temp_state, 2)
))
}
# Test the function for FL, NY, SD
print(state_summary("FL"))## $Total_Cities
## [1] 9
##
## $Avg_Temperature
## [1] 68.39
## $Total_Cities
## [1] 5
##
## $Avg_Temperature
## [1] 46.73
## $Total_Cities
## [1] 2
##
## $Avg_Temperature
## [1] 40.05
Using this function, we can see that cities in Florida on average have a warm temperature, and cities in New York and South Dakota are moderately cold. We can also see that South Dakota only has 2 cities included in the dataset, which could mean less robust conclusions about the state.
Having completed the exploratory data analysis, I can now move on to building a model.
From the correlation analysis earlier, latitude had a far stronger correlation (-0.881) with average temperature than longitude (0.151). I will therefore use latitude as the predictor for my linear regression analysis to be carried out on the climate variables.
First, I will now fit a linear regression model and create an ANOVA table and carry out an F test.
library(kableExtra)
# Fit a linear regression model
model <- lm(
avg_temp ~ lat + elevation + wind + avg_annual_precip +
elevation_change_four + elevation_change_eight + distance_to_coast,
data = weather
)
# Display summary of the model
summary(model)##
## Call:
## lm(formula = avg_temp ~ lat + elevation + wind + avg_annual_precip +
## elevation_change_four + elevation_change_eight + distance_to_coast,
## data = weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0597 -1.6134 -0.2268 1.2049 10.8574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.016e+02 1.971e+00 51.564 < 2e-16 ***
## lat -1.152e+00 4.253e-02 -27.076 < 2e-16 ***
## elevation -5.804e-03 8.303e-04 -6.990 7.4e-11 ***
## wind -4.997e-01 3.515e-01 -1.421 0.1572
## avg_annual_precip -4.425e-02 1.878e-02 -2.356 0.0197 *
## elevation_change_four 4.686e-03 2.818e-03 1.663 0.0983 .
## elevation_change_eight -3.555e-03 2.433e-03 -1.461 0.1460
## distance_to_coast -3.076e-03 1.183e-03 -2.601 0.0102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.031 on 157 degrees of freedom
## Multiple R-squared: 0.8891, Adjusted R-squared: 0.8842
## F-statistic: 179.8 on 7 and 157 DF, p-value: < 2.2e-16
# Calculate degrees of freedom
df_residual <- nrow(weather) - length(coef(model))
df_regression <- length(coef(model)) - 1
df_total <- nrow(weather) - 1
# Calculate sums of squares
SS_residual <- sum(resid(model)^2)
SS_total <- sum((weather$avg_temp - mean(weather$avg_temp))^2)
SS_regression <- SS_total - SS_residual
# Calculate mean squares
MS_regression <- SS_regression / df_regression
MS_residual <- SS_residual / df_residual
# Calculate F-statistic
F_value <- MS_regression / MS_residual
# Create a data frame to display the ANOVA table
anova_table <- data.frame(
Source = c("Regression", "Residual", "Total"),
DF = c(df_regression, df_residual, df_total),
Sum_Sq = c(SS_regression, SS_residual, SS_total),
Mean_Sq = c(MS_regression, MS_residual, NA),
F_value = c(F_value, NA, NA)
)
# Round numerical columns to 3 decimal places to tidy table
anova_table$Sum_Sq <- round(anova_table$Sum_Sq, 3)
anova_table$Mean_Sq <- round(anova_table$Mean_Sq, 3)
anova_table$F_value <- round(anova_table$F_value, 3)
# Replace NA values with empty strings to tidy table up
anova_table[is.na(anova_table)] <- "" # Replace NAs with empty strings
# Use kable and kableExtra to produce a formatted ANOVA table
kable(anova_table, "html", caption = "ANOVA Table (values rounded to 3 decimal places)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered")) %>%
row_spec(0, extra_css = "border-bottom: 1px solid #ddd;")| Source | DF | Sum_Sq | Mean_Sq | F_value |
|---|---|---|---|---|
| Regression | 7 | 11565.717 | 1652.245 | 179.842 |
| Residual | 157 | 1442.391 | 9.187 | |
| Total | 164 | 13008.108 |
## [1] 1.367117e-71
## [1] 0
Model fit
The F-statistic tests the hypothesis that at least one predictor from the model is significantly associated with average temperature. The very small p-value of <0.001 means we can reject H0 that there are no statistically significant predictors and conclude there is a linear association between average temperature and at least one of the predictors.
If we divide the sum of squares regression by sum of squares total: \[ R^2 = \frac{SS_{\text{regression}}}{SS_{\text{total}}} = \frac{12229.176}{13781.852} \approx 0.887. \] we can see that the R2 is approximately 0.887, meaning the model explains 88.7% of variance in the response variable. That is a very robust result with regards the model’s predictive strength.
Significant predictors
The variable with the strongest effect on average temperature is latitude, with p-value < 0.001. Looking at the summary table we can see that for every 1° increase in latitude, average temperature decreases by roughly 1.18° fahrenheit.
Elevation has a p-value < 0.001, and for every 1 metre increase in elevation, average temperature decreases by roughly 0.0058° fahrenheit.
For distance to coast, the p-value is 0.0013, which is statistically significant. Average temperature decreases by 0.0032° fahrenheit for every 1km closer to the coast.
Average annual precipitation has a p-value of 0.0123 and a coefficient of -0.0493, meaning an increase of 1 inch of average annual precipitation is linked to a fall in average temperature of -0.0493° fahrenheit.
None of the remaining variables (wind,
elevation_change_four or
elevation_change_eight) are statistically
significant.
Next, I will test whether the slope coefficient of the distance to cost is equal to -0.01:
# Extract coefficient and standard error for distance_to_coast
coefs <- summary(model)$coefficients
beta_dist_coast <- coefs["distance_to_coast", 1] # Estimate of coefficient
se_dist_coast <- coefs["distance_to_coast", 2] # Standard error
# Compute t-statistic
t_stat <- (beta_dist_coast - (-0.01)) / se_dist_coast
# Compute p-value (two-tailed test)
p_value <- 2 * pt(abs(t_stat), df = model$df.residual, lower.tail = FALSE)
# Print results
print(round(t_stat, 3))## [1] 5.854
## [1] 0
# Interpretation
if (p_value < 0.05) {
cat("Reject H₀: The slope coefficient is significantly different from -0.01.\n")
} else {
cat("Fail to reject H₀: No significant evidence that the slope coefficient differs from -0.01.\n")
}## Reject H₀: The slope coefficient is significantly different from -0.01.
With a p-value of < 0.001, which is less than the significance level of 0.05, we reject H0 and conclude that the slope coefficient is significantly different from -0.01.
In this section, I use automatic variable selection methods to explore whether it is possible to build an improved version of the model above. I will test forward, backward and stepwise models:
# Forward selection
forward_model <- step(lm(avg_temp ~ 1, data = weather),
scope = formula(model),
direction = "forward",
trace = FALSE
)
summary(forward_model)##
## Call:
## lm(formula = avg_temp ~ lat + elevation + distance_to_coast +
## avg_annual_precip, data = weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3873 -1.7567 -0.3787 1.2942 11.3469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.003e+02 1.827e+00 54.895 < 2e-16 ***
## lat -1.152e+00 4.079e-02 -28.251 < 2e-16 ***
## elevation -5.924e-03 7.436e-04 -7.967 2.87e-13 ***
## distance_to_coast -3.644e-03 1.043e-03 -3.495 0.000613 ***
## avg_annual_precip -4.906e-02 1.810e-02 -2.711 0.007434 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.048 on 160 degrees of freedom
## Multiple R-squared: 0.8857, Adjusted R-squared: 0.8829
## F-statistic: 310 on 4 and 160 DF, p-value: < 2.2e-16
# Backward elimination
backward_model <- step(model,
direction = "backward",
trace = FALSE
)
summary(backward_model)##
## Call:
## lm(formula = avg_temp ~ lat + elevation + wind + avg_annual_precip +
## elevation_change_four + elevation_change_eight + distance_to_coast,
## data = weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0597 -1.6134 -0.2268 1.2049 10.8574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.016e+02 1.971e+00 51.564 < 2e-16 ***
## lat -1.152e+00 4.253e-02 -27.076 < 2e-16 ***
## elevation -5.804e-03 8.303e-04 -6.990 7.4e-11 ***
## wind -4.997e-01 3.515e-01 -1.421 0.1572
## avg_annual_precip -4.425e-02 1.878e-02 -2.356 0.0197 *
## elevation_change_four 4.686e-03 2.818e-03 1.663 0.0983 .
## elevation_change_eight -3.555e-03 2.433e-03 -1.461 0.1460
## distance_to_coast -3.076e-03 1.183e-03 -2.601 0.0102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.031 on 157 degrees of freedom
## Multiple R-squared: 0.8891, Adjusted R-squared: 0.8842
## F-statistic: 179.8 on 7 and 157 DF, p-value: < 2.2e-16
# Stepwise selection (both forward and backward)
stepwise_model <- step(model,
direction = "both",
trace = FALSE
)
summary(stepwise_model)##
## Call:
## lm(formula = avg_temp ~ lat + elevation + wind + avg_annual_precip +
## elevation_change_four + elevation_change_eight + distance_to_coast,
## data = weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0597 -1.6134 -0.2268 1.2049 10.8574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.016e+02 1.971e+00 51.564 < 2e-16 ***
## lat -1.152e+00 4.253e-02 -27.076 < 2e-16 ***
## elevation -5.804e-03 8.303e-04 -6.990 7.4e-11 ***
## wind -4.997e-01 3.515e-01 -1.421 0.1572
## avg_annual_precip -4.425e-02 1.878e-02 -2.356 0.0197 *
## elevation_change_four 4.686e-03 2.818e-03 1.663 0.0983 .
## elevation_change_eight -3.555e-03 2.433e-03 -1.461 0.1460
## distance_to_coast -3.076e-03 1.183e-03 -2.601 0.0102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.031 on 157 degrees of freedom
## Multiple R-squared: 0.8891, Adjusted R-squared: 0.8842
## F-statistic: 179.8 on 7 and 157 DF, p-value: < 2.2e-16
## Full Model AIC: 843.9882
## Forward Selection AIC: 842.9679
## Backward Elimination AIC: 843.9882
## Stepwise Selection AIC: 843.9882
Forward, backward and stepwise all resulted in the same AIC of 837.7161, compared with an AIC for the full model of 856.1. The fact that the three automatic variable selection models produced the same models suggests the removed variables did not contribute significantly to predicting average temperature. The fact that these models have a lower AIC does mean the adjusted models are more efficient, which is to say they explains the data well with fewer predictors than the full model.
Since forward, backward and stepwise selection all had the same AIC, I can just select one to compare more thoroughly with the original model. For the analysis below, I have selected the stepwise model.
# Calculate number of observations and p
n <- nrow(model.frame(model))
p_full <- length(coef(model))
# Create summary objects
sum_full <- summary(model)
sum_step <- summary(stepwise_model)
# Compute MSE for full model
MSE_full <- sum(resid(model)^2) / (n - p_full)
# Compute Mallows' Cp for the full model
CP_full <- sum(resid(model)^2) / MSE_full + 2 * p_full - n
# Compute p and then Mallows' CP for stepwise model
p_stepwise <- length(coef(stepwise_model))
CP_stepwise <- sum(resid(stepwise_model)^2) / MSE_full + 2 * p_stepwise - n
# Create a dataframe comparing R², Adjusted R², AIC, and Mallows' Cp
model_comparison <- data.frame(
Model = c("Full Model", "Stepwise Selection"),
R_Squared = c(summary(model)$r.squared, summary(stepwise_model)$r.squared),
Adjusted_R2 = c(summary(model)$adj.r.squared, summary(stepwise_model)$adj.r.squared),
AIC = c(AIC(model), AIC(stepwise_model)),
Mallows_Cp = c(CP_full, CP_stepwise)
)
print(model_comparison)## Model R_Squared Adjusted_R2 AIC Mallows_Cp
## 1 Full Model 0.889116 0.8841721 843.9882 8
## 2 Stepwise Selection 0.889116 0.8841721 843.9882 8
## Analysis of Variance Table
##
## Model 1: avg_temp ~ lat + elevation + wind + avg_annual_precip + elevation_change_four +
## elevation_change_eight + distance_to_coast
## Model 2: avg_temp ~ lat + elevation + wind + avg_annual_precip + elevation_change_four +
## elevation_change_eight + distance_to_coast
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 157 1442.4
## 2 157 1442.4 0 0
I have already discussed the AIC above. Turning to the R2, there is a very small (0.0005) drop in the stepwise model relative to the original model, which means removing variables did not significantly reduce the model’s explanatory power. The adjusted R2 is very marginally improved (0.8826 vs 0.8823 for the full model, representing an increase in 0.0003) which means when adjusting for the number of predictors the stepwise model performs slightly better.
With regards Mallow’s Cp, if the stepwise model performs similarly to the full model than Cp should be approximately p + 1. The stepwise model has a p of 6 (5 predictors + intercept) and has a Cp of 5.66, representing only a small distance from p. Where two models perform similarly but one is smaller with a lower p value, we prefer the smaller model. We therefore have strong grounds from the Mallow’s Cp analysis to use the smaller stepwise model.
The ANOVA results paint a similar picture, with a slight increase in
residual sum of squares for the stepwise model. But the F-statistic has
a p-value of 0.438, above the significance threshold of 0.05. This means
that the difference in model fit between the full and stepwise models is
not statistically significant. In other words, removing
elevation_change_four and
elevation_change_eight does not meaningfully reduce the
model’s ability to explain variation in avg_temp.
Finally, we can compare each model’s PRESS statistic, which allows us to check how well a model will perform on new data.
h_model <- hatvalues(model)
press_model <- sum((residuals(model) / (1 - h_model))^2)
h_step <- hatvalues(stepwise_model)
press_step <- sum((residuals(stepwise_model) / (1 - h_step))^2)
# Print both PRESS values for comparison
print(press_model)## [1] 1745.955
## [1] 1745.955
The stepwise model has a lower PRESS (1851.698 vs 1901.073 for the full model). That suggests that the stepwise model may have slightly better predictive performance when using new and unseen data. Given that the stepwise model achieves a similar level of explanatory power, uses fewer predictors, and is less likely to overfit data, it is the preferred model for efficiency and interpretability.
The assumptions of the linear regression model are that:
The normality assumption can be checked with a q-q plot of the residuals, constant variance by plotting residuals against the fitted values, and independence by considering both a plot of residuals vs fitted values and of residuals vs covariates.
library(ggplot2)
# Extract residuals & fitted values
residuals_stepwise <- residuals(stepwise_model)
fitted_values_stepwise <- fitted(stepwise_model)
# Plot the Q-Q plot with a reference line
ggplot(data = data.frame(Residuals = residuals_stepwise), aes(sample = Residuals)) +
stat_qq() +
stat_qq_line() +
labs(
title = "Q-Q Plot of Residuals (Stepwise Model)",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal()# Residuals vs Fitted
ggplot() +
geom_point(aes(x = fitted_values_stepwise, y = residuals_stepwise)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(aes(x = fitted_values_stepwise, y = residuals_stepwise),
method = "loess", se = FALSE, color = "red"
)## `geom_smooth()` using formula = 'y ~ x'
# Residuals vs covariates
covariates <- c("lat", "elevation", "wind", "avg_annual_precip", "distance_to_coast")
df_long <- weather %>%
dplyr::select(all_of(covariates)) %>%
pivot_longer(cols = everything(), names_to = "Covariate", values_to = "Value")
df_long$Residuals <- rep(residuals_stepwise, each = length(covariates))
ggplot(df_long, aes(x = Value, y = Residuals)) +
geom_point(color = "blue", alpha = 0.7) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
# stat_smooth(method = "loess", se = FALSE, color = "red") +
facet_wrap(~Covariate, scales = "free_x") +
labs(
title = "Residuals vs Covariates",
x = "Covariate Value",
y = "Residuals"
) +
theme_minimal()The Q-Q plot shows the residuals mostly follow the line in the middle, indicating normality for those values. However, there is some deviation at the extremes, particularly on the right. This indicates outliers at either end, suggesting some slight heavy tails.
The residuals vs fitted shows a slight smile pattern, made more apparent with the use of a LOESS curve to visualise the overall trend. This indicates that residuals are slightly smaller for intermediate fitted values and larger at the extremes. This could mean the model may predict cities with temperatures at extremes with less precision than cities with moderate average temperatures.
Finally, we turn to the Residuals vs Covariates graphs. The graphs
for distance_to_coast and elevation show
clustering at the left side of the graph. This does not inherently
violate assumptions, but instead indicate that there is a greater
proportion of cities close to the coast and at lower levels of
elevation. The spread of values for these variables is fairly evenly
spread across the zero line, albeit with slightly more points above than
below in both cases. The dispersion of values for wind
appears straightforwardly random, and to a lesser extent
lat which does have some outliers in the top right. A
significant outlier impacts avg_annual_precip in the top
right of the graph.
The process of checking assumptions above establishes grounds for a
Box-Cox transformation to avg_temp. Box-Cox transformations
can normalise residual distribution, which may reduce the heavy tails
identified in the q-q plot. It can also reduce heteroscadisticity, which
should help correct the smile pattern in the residuals vs fitted graph.
And it can enhance linearity between dependent variable and predictors,
which could lead to a more random dispersion of points in the residual
vs covariate graphs. The first step is identifying the lambda to
use.
## [1] 0.4646465
That gives a lambda of 0.50505, which can now be incorporated into the stepwise model. This is almost equivalent to a square-root transformation, but I will use the calculated value for extra precision.
lambda_opt <- 0.5050505
weather$avg_temp_trans <- (weather$avg_temp^lambda_opt - 1) / lambda_opt
stepwise_transformed <- lm(avg_temp_trans ~ lat + elevation + wind + avg_annual_precip + distance_to_coast,
data = weather
)
summary(stepwise_transformed)##
## Call:
## lm(formula = avg_temp_trans ~ lat + elevation + wind + avg_annual_precip +
## distance_to_coast, data = weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.45079 -0.20982 -0.03978 0.19133 1.65601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.5859626 0.2725420 71.864 < 2e-16 ***
## lat -0.1653973 0.0057028 -29.003 < 2e-16 ***
## elevation -0.0008520 0.0001037 -8.214 6.98e-14 ***
## wind -0.0750371 0.0465047 -1.614 0.1086
## avg_annual_precip -0.0049575 0.0026046 -1.903 0.0588 .
## distance_to_coast -0.0003594 0.0001600 -2.246 0.0261 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4251 on 159 degrees of freedom
## Multiple R-squared: 0.8921, Adjusted R-squared: 0.8888
## F-statistic: 263.1 on 5 and 159 DF, p-value: < 2.2e-16
Assessing the impact of the Box-Cox transformation on the model we can see:
which means the Box-Cox transformation ensures a slightly larger proportion of the variance in average temperature is explained by the model.
There is also a slight increase in the F-statistic (from 247.5 to 263.1) and the p-value is still <0.001. So the Box-Cox transformation does appear to have contributed to a further improvement to the model.
Re-running the assumption checks:
# Extract residuals & fitted values
residuals_t_stepwise <- residuals(stepwise_transformed)
fitted_values_t_stepwise <- fitted(stepwise_transformed)
# Replot the Q-Q plot. I have adjusted so that it uses the same scale as the first Q-Q plot to make the graphs easier to compare
ggplot(data = data.frame(Residuals = residuals_t_stepwise), aes(sample = Residuals)) +
stat_qq() +
stat_qq_line() +
labs(
title = "Q-Q Plot of Residuals (Stepwise Model)",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
scale_y_continuous(limits = c(-10, 10), breaks = seq(-10, 10, by = 5)) +
theme_minimal()# Residuals vs Fitted
ggplot() +
geom_point(aes(x = fitted_values_t_stepwise, y = residuals_t_stepwise)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(aes(x = fitted_values_t_stepwise, y = residuals_t_stepwise),
method = "loess", se = FALSE, color = "red"
)## `geom_smooth()` using formula = 'y ~ x'
# Residuals vs covariates
df_long$Residuals <- rep(residuals_t_stepwise, each = length(covariates))
ggplot(df_long, aes(x = Value, y = Residuals)) +
geom_point(color = "blue", alpha = 0.7) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
# stat_smooth(method = "loess", se = FALSE, color = "red") +
facet_wrap(~Covariate, scales = "free_x") +
labs(
title = "Residuals vs Covariates",
x = "Covariate Value",
y = "Residuals"
) +
theme_minimal()The Q-Q plot is flatter and the tails are less pronounced, reflecting the reduced variance in residuals and higher normality. Turning to the residuals vs fitted graph, the LOESS curve has also flattened, which means greater homoscedasticity (i.e. constant variance). However, the residuals vs covariate graphs have not been significantly impacted. I will therefore explore applying transformations to the dependent variables to assess whether that will strengthen the model.
library(gridExtra)
weather <- weather %>%
mutate(
log_lat = log(lat + 1),
log_elevation = log(elevation + 1),
log_distance_to_coast = log(distance_to_coast + 1),
log_avg_annual_precip = log(avg_annual_precip + 1),
log_wind = log(wind + 1),
sqrt_lat = sqrt(lat),
sqrt_elevation = sqrt(pmax(elevation, 0)), # Ensure non-negative values.
sqrt_distance_to_coast = sqrt(distance_to_coast),
sqrt_avg_annual_precip = sqrt(avg_annual_precip),
sqrt_wind = sqrt(wind)
)
# Fit the models with transformed predictors
model_log <- lm(
avg_temp_trans ~ log_lat + log_elevation + log_wind +
log_avg_annual_precip + log_distance_to_coast,
data = weather
)
model_sqrt <- lm(
avg_temp_trans ~ sqrt_lat + sqrt_elevation + sqrt_wind +
sqrt_avg_annual_precip + sqrt_distance_to_coast,
data = weather
)
## Doing each graph was unwieldy so using a function to cut down on code:
# Extract residuals and fitted values from the new models
weather <- weather %>%
mutate(
residuals_transformed_log = residuals(model_log),
residuals_transformed_sqrt = residuals(model_sqrt),
residuals_t_stepwise = residuals(stepwise_transformed),
)
# Define variables and transformations
variables <- c("lat", "elevation", "distance_to_coast", "avg_annual_precip", "wind")
transformations <- c("original", "log", "sqrt")
residuals_vars <- c("residuals_t_stepwise", "residuals_transformed_log", "residuals_transformed_sqrt")
colors <- c("blue", "darkgreen", "purple")
transformation_labels <- c("", "log( + 1)", "sqrt()")
# Function to create plots
create_plots <- function(var, transformation, residuals_var, color, transformation_label) {
x_var <- if (transformation == "original") {
var
} else {
paste0(transformation, "_", var)
}
x_label <- if (transformation == "original") {
gsub("_", " ", var) # Replace underscore with space in label if raw
} else {
paste0(transformation_label, "(", gsub("_", " ", var), ")")
}
ggplot(weather, aes(x = .data[[x_var]], y = .data[[residuals_var]])) +
geom_point(alpha = 0.7, color = color) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = x_label, y = "Residuals") +
theme_minimal()
}
# Loop through variables and transformations, printing each variable in 1 row
for (var in variables) {
plot_row <- list()
for (i in 1:length(transformations)) {
transformation <- transformations[i]
residuals_var <- residuals_vars[i]
color <- colors[i]
transformation_label <- transformation_labels[i]
plot_row[[i]] <- create_plots(var, transformation, residuals_var, color, transformation_label)
}
print(grid.arrange(grobs = plot_row, ncol = 3))
}## TableGrob (1 x 3) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## TableGrob (1 x 3) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## TableGrob (1 x 3) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## TableGrob (1 x 3) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## TableGrob (1 x 3) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
I will also assess how these transformations impact the model more broadly:
## Again, use of functions will lead to cleaner and replicable output:
# Based on my readings, I will also assess including just sqrt transformations for precip and distance to coast
model_combined <- lm(
avg_temp_trans ~ lat + elevation + wind +
sqrt_avg_annual_precip + sqrt_distance_to_coast,
data = weather
)
model_combined_2 <- lm(
avg_temp_trans ~ lat + sqrt_elevation + wind +
sqrt_avg_annual_precip + sqrt_distance_to_coast,
data = weather
)
# F-statistic function
get_f_stat <- function(model) {
f <- summary(model)$fstatistic
p_value <- pf(f[1], f[2], f[3], lower.tail = FALSE)
data.frame(
F_statistic = f[1],
df1 = f[2],
df2 = f[3],
p_value = p_value
)
}
# PRESS statistic function
get_press <- function(model) {
h <- hatvalues(model)
press <- sum((residuals(model) / (1 - h))^2)
return(press)
}
# Calculate F-statistics
f_original <- get_f_stat(stepwise_transformed)
f_log <- get_f_stat(model_log)
f_sqrt <- get_f_stat(model_sqrt)
f_combined <- get_f_stat(model_combined)
f_combined_2 <- get_f_stat(model_combined_2)
# Calculate PRESS statistics
press_original <- get_press(stepwise_transformed)
press_log <- get_press(model_log)
press_sqrt <- get_press(model_sqrt)
press_combined <- get_press(model_combined)
press_combined_2 <- get_press(model_combined_2)
# Combine results for easy comparison
comparison <- rbind(
Original = cbind(f_original, PRESS = press_original),
Log_Transform = cbind(f_log, PRESS = press_log),
Sqrt_Transform = cbind(f_sqrt, PRESS = press_sqrt),
Combined_Transform = cbind(f_combined, PRESS = press_combined),
Combined_Transform_2 = cbind(f_combined_2, PRESS = press_combined_2)
)
print(comparison)## F_statistic df1 df2 p_value PRESS
## Original 263.0511 5 159 5.932585e-75 33.43613
## Log_Transform 189.8122 5 159 4.034483e-65 42.40488
## Sqrt_Transform 256.3958 5 159 3.628297e-74 33.17682
## Combined_Transform 276.6720 5 159 1.649814e-76 31.42069
## Combined_Transform_2 258.8112 5 159 1.871536e-74 33.04713
Most of the graphs show fairly minor improvements in the residual vs
covariate graphs. Latitude and wind are reasonably dispersed across 3
version so I do not see clear grounds for transforming. The square-root
transformation for distance_to_coast does seem slightly
improved, and there may be some very slight improvements for both the
log and square-root transformations of avg_annual_precip. I
did also consider the square-root transformation elevation
which may offer some slight improvements. Subsequently, I have
considered Combined_Transform with square-root
transformations to distance_to_coast and
ave_annual_precip whilst Combined_Transform_2
also includes a square-root transformation of
elevation.
The statistics included for comparison shows model
Combined_Transform with square-root transformations for
distance_to_coast and avg_annual_precip as
having the highest F-statistic and the lowest PRESS statistic which
suggest this model explains the most variation and has the best
predictive accuracy. Whilst the residual vs covariates graph may
slightly improve elevation when a square-root
transformation is applied, the effect was not drastic and including this
in the model lowered the F-stat and raised the PRESS statistic relative
to the Combined_Transform. On these grounds, I have decided
to keep it untransformed.
With so many transformations, there is a risk of overfitting. However, the lower PRESS statistic gives some confidence that the model’s improvements are not just fitting noise but genuinely capturing underlying relationships.
The final model is therefore a stepwise model, with
a Box-Cox transformation of the response variable
avg_temperature and a square-root
transformation applied to distance_to_coast and
avg_annual_precip. Since the Box-Cox lambda is so close to
0.5 (which would be a square-root transformation), the model is
essentially using a square-root transformation for the response variable
too.
These transformations have contributed to better adherence to linear regression assumptions, greater reliability in model predictions across all fitted values, and improved accuracy. We can therefore have a high degree of confidence in the model’s predictions.
Now that the model is complete, I can begin using it for my report for the travel agency. The agency wants the model tested against a city in Florida. I will also test Springfield, Ohio with the model to see if it meets the agency’s temperature criteria.
| Name | weather |
| Number of rows | 165 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| State | 0 | 1 | 2 | 2 | 0 | 51 | 0 |
| City | 0 | 1 | 4 | 15 | 0 | 158 | 0 |
| koppen | 0 | 1 | 2 | 3 | 0 | 15 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lon | 0 | 1 | -94.05 | 16.73 | -157.94 | -104.51 | -89.68 | -81.59 | -68.01 | ▁▁▃▇▇ |
| lat | 0 | 1 | 38.49 | 6.28 | 21.33 | 34.02 | 39.11 | 42.38 | 64.80 | ▁▆▇▁▁ |
| elevation | 0 | 1 | 351.48 | 467.22 | -0.67 | 48.60 | 192.13 | 353.42 | 2133.27 | ▇▁▁▁▁ |
| distance_to_coast | 0 | 1 | 323.99 | 297.49 | 0.26 | 36.14 | 249.02 | 568.23 | 1125.15 | ▇▃▃▂▁ |
| wind | 0 | 1 | 3.40 | 0.82 | 1.15 | 2.79 | 3.36 | 3.99 | 5.64 | ▁▅▇▅▁ |
| elevation_change_four | 0 | 1 | 92.83 | 136.47 | 0.00 | 23.71 | 43.91 | 99.11 | 809.87 | ▇▁▁▁▁ |
| elevation_change_eight | 0 | 1 | 144.36 | 177.45 | 0.46 | 37.15 | 82.13 | 152.62 | 851.60 | ▇▂▁▁▁ |
| avg_annual_precip | 0 | 1 | 39.09 | 16.99 | 3.90 | 23.75 | 43.36 | 51.19 | 111.10 | ▅▆▇▁▁ |
| avg_temp | 0 | 1 | 50.77 | 8.91 | 23.40 | 45.32 | 49.68 | 55.71 | 75.91 | ▁▃▇▃▁ |
| avg_temp_trans | 0 | 1 | 12.36 | 1.27 | 7.75 | 11.61 | 12.25 | 13.10 | 15.65 | ▁▂▇▆▂ |
| log_lat | 0 | 1 | 3.66 | 0.16 | 3.11 | 3.56 | 3.69 | 3.77 | 4.19 | ▁▃▇▅▁ |
| log_elevation | 0 | 1 | 4.77 | 1.89 | -1.11 | 3.90 | 5.26 | 5.87 | 7.67 | ▁▂▂▇▃ |
| log_distance_to_coast | 0 | 1 | 4.75 | 2.02 | 0.23 | 3.61 | 5.52 | 6.34 | 7.03 | ▂▂▂▃▇ |
| log_avg_annual_precip | 0 | 1 | 3.57 | 0.55 | 1.59 | 3.21 | 3.79 | 3.95 | 4.72 | ▁▂▂▇▁ |
| log_wind | 0 | 1 | 1.46 | 0.20 | 0.77 | 1.33 | 1.47 | 1.61 | 1.89 | ▁▂▆▇▃ |
| sqrt_lat | 0 | 1 | 6.18 | 0.50 | 4.62 | 5.83 | 6.25 | 6.51 | 8.05 | ▁▅▇▂▁ |
| sqrt_elevation | 0 | 1 | 15.16 | 11.07 | 0.00 | 6.97 | 13.86 | 18.80 | 46.19 | ▆▇▂▂▂ |
| sqrt_distance_to_coast | 0 | 1 | 15.21 | 9.65 | 0.51 | 6.01 | 15.78 | 23.84 | 33.54 | ▇▆▆▇▅ |
| sqrt_avg_annual_precip | 0 | 1 | 6.07 | 1.49 | 1.97 | 4.87 | 6.58 | 7.15 | 10.54 | ▂▃▇▅▁ |
| sqrt_wind | 0 | 1 | 1.83 | 0.23 | 1.07 | 1.67 | 1.83 | 2.00 | 2.37 | ▁▂▇▇▂ |
| residuals_transformed_log | 0 | 1 | 0.00 | 0.48 | -2.27 | -0.22 | 0.08 | 0.24 | 1.86 | ▁▂▇▃▁ |
| residuals_transformed_sqrt | 0 | 1 | 0.00 | 0.42 | -1.63 | -0.18 | 0.03 | 0.16 | 1.67 | ▁▂▇▂▁ |
| residuals_t_stepwise | 0 | 1 | 0.00 | 0.42 | -1.45 | -0.21 | -0.04 | 0.19 | 1.66 | ▁▃▇▂▁ |
# There are 165 columns but only 158 unique values for cities so I will check for cities sharing the same name from different states
cities_multiple_states <- weather %>%
group_by(City) %>%
summarise(n_states = n_distinct(State)) %>%
filter(n_states > 1)
print(cities_multiple_states)## # A tibble: 6 × 2
## City n_states
## <chr> <int>
## 1 BUFFALO 2
## 2 CHARLESTON 2
## 3 COLUMBUS 2
## 4 PORTLAND 2
## 5 RICHMOND 3
## 6 SPRINGFIELD 2
# Now testing the cities
# We are only provided values for lat, elevation, distance_to_coast and avg_annual_precip for both of them
model_subset <- update(model_combined, . ~ lat + elevation + sqrt_distance_to_coast + sqrt_avg_annual_precip)
# Create dataframes for the Florida city and for Springfield, Ohio
new_city <- data.frame(
lat = 29.65,
elevation = 13,
sqrt_distance_to_coast = sqrt(3.25),
sqrt_avg_annual_precip = sqrt(51.04)
)
springfield <- data.frame(
lat = 39.93,
elevation = 298,
sqrt_distance_to_coast = sqrt(453),
sqrt_avg_annual_precip = sqrt(38.512)
)
# Calculate the predicted temperatures
predicted_temperature_new_city <- predict(model_subset, newdata = new_city, interval = "confidence", level = 0.95)
predicted_temp_springfield <- predict(model_subset, newdata = springfield, interval = "confidence", level = 0.95)
# Because the dependent variable has a Box-Cox transformation applied, the inverse function must now be used on the computed variables
lambda <- 0.5050505
final_temp_new_city <- (lambda * predicted_temperature_new_city + 1)^(1 / lambda)
final_temp_springfield <- (lambda * predicted_temp_springfield + 1)^(1 / lambda)
print(final_temp_new_city)## fit lwr upr
## 1 64.26314 63.25711 65.27702
## fit lwr upr
## 1 48.36741 47.79292 48.9453
# We are asked to compare the new_city to the average in Florida
florida_mean_temp <- weather %>%
filter(State == "FL") %>%
summarise(mean_temp = mean(avg_temp, na.rm = TRUE))
florida_sd <- weather %>%
filter(State == "FL") %>%
summarise(sd_temp = sd(avg_temp, na.rm = TRUE))
print(florida_mean_temp)## # A tibble: 1 × 1
## mean_temp
## <dbl>
## 1 68.4
## # A tibble: 1 × 1
## sd_temp
## <dbl>
## 1 5.53
# I will do the same for OH just as a sense-check
oh_mean_temp <- weather %>%
filter(State == "OH") %>%
summarise(mean_temp = mean(avg_temp, na.rm = TRUE))
print(oh_mean_temp)## # A tibble: 1 × 1
## mean_temp
## <dbl>
## 1 48.6
I want to generate a table to use in my report comparing the original model to the final model:
## Compare original 'model' vs. final 'model_sqrt'
# 1. R-squared
R2_orig <- summary(model)$r.squared
R2_final <- summary(model_combined)$r.squared
# 2. F-statistic
F_orig <- summary(model)$fstatistic[1]
F_final <- summary(model_combined)$fstatistic[1]
# 3. PRESS
PRESS_orig <- sum((resid(model) / (1 - hatvalues(model)))^2)
PRESS_final <- sum((resid(model_combined) / (1 - hatvalues(model_sqrt)))^2)
# Build data frame for the comparison
comparison_table <- data.frame(
Model = c("Original model", "Final model"),
R_squared = c(R2_orig, R2_final),
F_statistic = c(F_orig, F_final),
PRESS = c(PRESS_orig, PRESS_final)
)
# Create a formatted table to view
knitr::kable(
comparison_table,
caption = "Comparison of Original vs Final Model"
)| Model | R_squared | F_statistic | PRESS |
|---|---|---|---|
| Original model | 0.8891160 | 179.842 | 1745.95489 |
| Final model | 0.8969112 | 276.672 | 30.99494 |
# Latex version for my written report
latex_table <- kable(
comparison_table,
format = "latex",
booktabs = TRUE,
caption = "Comparison of Original vs. Final Model: R², F, and PRESS",
label = "tab:comparison",
align = c("l", "r", "r", "r")
) %>%
kable_styling(
latex_options = c("striped", "hold_position")
)
cat(latex_table, file = "latex_table.tex")Adding a graph for distance to coast and calculating coefficients
p <- ggplot(weather, aes(x = sqrt_distance_to_coast, y = avg_temp_trans)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(
x = "sqrt(Distance to Coast)",
y = "Average Temperature (°F) with Box-Cox"
)
ggsave("distance_temp_plot.pdf", plot = p, width = 6, height = 3)## `geom_smooth()` using formula = 'y ~ x'
# Just for reference I want to have a look at an untransformed graph too
ggplot(weather, aes(x = distance_to_coast, y = avg_temp)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(
x = "sqrt(Distance to Coast)",
y = "Average Temperature (°F) with Box-Cox"
)## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = avg_temp_trans ~ lat + elevation + wind + sqrt_avg_annual_precip +
## sqrt_distance_to_coast, data = weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35602 -0.18813 -0.02756 0.19058 1.61208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0034793 0.3018436 66.271 < 2e-16 ***
## lat -0.1662721 0.0055436 -29.993 < 2e-16 ***
## elevation -0.0009080 0.0001041 -8.722 3.46e-15 ***
## wind -0.0602448 0.0446149 -1.350 0.17883
## sqrt_avg_annual_precip -0.0885253 0.0295319 -2.998 0.00316 **
## sqrt_distance_to_coast -0.0122956 0.0046091 -2.668 0.00843 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4157 on 159 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8937
## F-statistic: 276.7 on 5 and 159 DF, p-value: < 2.2e-16