This tutorial builds on the descriptive statistical analysis of Part 1 by performing statistical modeling to understand the differences between sensor measurements and NOAA reference data. We’ll use Bland-Altman plots, paired t-tests, collect additional meteorological variables, and build linear regression models to identify factors that influence measurement errors.
# Install additional packages if needed
# install.packages(c("ggExtra", "broom", "car", "lmtest", "ggpubr"))
library(dplyr)
library(ggplot2)
library(plotly)
library(leaflet)
library(lubridate)
library(ggExtra) # For marginal plots
library(broom) # For tidy model output
library(car) # For regression diagnostics
library(lmtest) # For statistical tests
library(ggpubr) # For publication-ready plots
library(httr)
library(jsonlite)
library(readr)
# Assuming you have the comparison_data from Part 1 saved as "CDT_Weather_Comparison.csv" in your working directory (part 9 from the first the activity).
# If not, you'll need to run the first activity again first to generate it. Make sure to (uncomment and then) run the code in part 9 that saves the data to the csv file. This code is: write.csv(comparison_data, "CDT_Weather_Comparison.csv", row.names = FALSE)
#or, optionally you can just download this csv file from canvas and put it in your working directory.
comparison_data <- read_csv("CDT_Weather_Comparison.csv")
# For this tutorial, we'll assume comparison_data exists
cat("Data dimensions:", nrow(comparison_data), "rows x", ncol(comparison_data), "columns\n")
## Data dimensions: 100 rows x 26 columns
cat("Variables available:\n")
## Variables available:
print(names(comparison_data))
## [1] "TimeStampFull" "Date" "Time"
## [4] "Lat" "Lon" "alt_m"
## [7] "Temp_F" "Hum" "Press"
## [10] "DP" "Device" "Temp_C"
## [13] "NOAA_Temp_F" "NOAA_Humidity" "NOAA_Pressure_hPa"
## [16] "Temp_Diff" "Temp_Pct_Diff" "Temp_Abs_Diff"
## [19] "Hum_Diff" "Hum_Abs_Diff" "Press_Diff"
## [22] "Press_Abs_Diff" "Temp_Error_Category" "Temp_Error_Type"
## [25] "Hum_Error_Type" "Press_Error_Type"
Bland-Altman plots are the gold standard for assessing agreement between two measurement methods.
# Calculate mean and difference for Bland-Altman
comparison_data <- comparison_data %>%
mutate(
Temp_Mean = (Temp_F + NOAA_Temp_F) / 2,
Hum_Mean = (Hum + NOAA_Humidity) / 2,
Press_Mean = (Press + NOAA_Pressure_hPa) / 2
)
# Calculate bias and limits of agreement
temp_bias <- mean(comparison_data$Temp_Diff, na.rm = TRUE)
temp_sd <- sd(comparison_data$Temp_Diff, na.rm = TRUE)
temp_upper_loa <- temp_bias + 1.96 * temp_sd
temp_lower_loa <- temp_bias - 1.96 * temp_sd
# Create Bland-Altman plot
ba_temp <- ggplot(comparison_data, aes(x = Temp_Mean, y = Temp_Diff)) +
geom_point(aes(color = Device), alpha = 0.6, size = 3) +
geom_hline(yintercept = temp_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
geom_hline(yintercept = temp_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
geom_hline(yintercept = temp_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
annotate("text", x = min(comparison_data$Temp_Mean), y = temp_bias,
label = paste0("Bias = ", round(temp_bias, 2), "°F"), hjust = 0, vjust = -0.5) +
annotate("text", x = min(comparison_data$Temp_Mean), y = temp_upper_loa,
label = paste0("+1.96 SD = ", round(temp_upper_loa, 2), "°F"), hjust = 0, vjust = -0.5) +
annotate("text", x = min(comparison_data$Temp_Mean), y = temp_lower_loa,
label = paste0("-1.96 SD = ", round(temp_lower_loa, 2), "°F"), hjust = 0, vjust = 1.5) +
labs(title = "Bland-Altman Plot: Temperature",
subtitle = "Agreement between Sensor and NOAA measurements",
x = "Mean Temperature [(Sensor + NOAA) / 2] (°F)",
y = "Difference (Sensor - NOAA) (°F)") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
# Add marginal density plots
ggMarginal(ba_temp, type = "density", groupColour = TRUE, groupFill = TRUE)
Interpretation: - Blue line: Mean bias (systematic difference) - Red dashed lines: 95% limits of agreement (±1.96 SD) - Points within limits: Good agreement (95% of differences should fall within these limits) - Pattern detection: Look for trends or heteroscedasticity
# Calculate bias and limits for humidity
hum_bias <- mean(comparison_data$Hum_Diff, na.rm = TRUE)
hum_sd <- sd(comparison_data$Hum_Diff, na.rm = TRUE)
hum_upper_loa <- hum_bias + 1.96 * hum_sd
hum_lower_loa <- hum_bias - 1.96 * hum_sd
ba_hum <- ggplot(comparison_data, aes(x = Hum_Mean, y = Hum_Diff)) +
geom_point(aes(color = Device), alpha = 0.6, size = 3) +
geom_hline(yintercept = hum_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
geom_hline(yintercept = hum_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
geom_hline(yintercept = hum_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
annotate("text", x = min(comparison_data$Hum_Mean), y = hum_bias,
label = paste0("Bias = ", round(hum_bias, 2), "%"), hjust = 0, vjust = -0.5) +
annotate("text", x = min(comparison_data$Hum_Mean), y = hum_upper_loa,
label = paste0("+1.96 SD = ", round(hum_upper_loa, 2), "%"), hjust = 0, vjust = -0.5) +
annotate("text", x = min(comparison_data$Hum_Mean), y = hum_lower_loa,
label = paste0("-1.96 SD = ", round(hum_lower_loa, 2), "%"), hjust = 0, vjust = 1.5) +
labs(title = "Bland-Altman Plot: Humidity",
x = "Mean Humidity [(Sensor + NOAA) / 2] (%)",
y = "Difference (Sensor - NOAA) (%)") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
ggMarginal(ba_hum, type = "density", groupColour = TRUE, groupFill = TRUE)
Exercise 1: Create a Bland-Altman plot for pressure following the same pattern as temperature and humidity above. Copy and paste the code from the example above into the code chunk below and modify it accordingly.
#put code for Excercise 1 below
# Create summary table
ba_summary <- data.frame(
Variable = c("Temperature (°F)", "Humidity (%)", "Pressure (hPa)"),
Bias = c(temp_bias, hum_bias, mean(comparison_data$Press_Diff, na.rm = TRUE)),
SD = c(temp_sd, hum_sd, sd(comparison_data$Press_Diff, na.rm = TRUE)),
Lower_LOA = c(temp_lower_loa, hum_lower_loa,
mean(comparison_data$Press_Diff, na.rm = TRUE) -
1.96 * sd(comparison_data$Press_Diff, na.rm = TRUE)),
Upper_LOA = c(temp_upper_loa, hum_upper_loa,
mean(comparison_data$Press_Diff, na.rm = TRUE) +
1.96 * sd(comparison_data$Press_Diff, na.rm = TRUE))
)
knitr::kable(ba_summary, digits = 2,
caption = "Bland-Altman Analysis: Bias and Limits of Agreement",
col.names = c("Variable", "Bias", "SD", "Lower LOA", "Upper LOA"))
| Variable | Bias | SD | Lower LOA | Upper LOA |
|---|---|---|---|---|
| Temperature (°F) | -0.08 | 9.19 | -18.10 | 17.94 |
| Humidity (%) | 5.02 | 14.92 | -24.23 | 34.27 |
| Pressure (hPa) | -6.22 | 7.15 | -20.23 | 7.80 |
Paired t-tests assess whether there is a statistically significant systematic difference between sensor and NOAA measurements.
# Perform paired t-test
temp_ttest <- t.test(comparison_data$Temp_F, comparison_data$NOAA_Temp_F,
paired = TRUE, alternative = "two.sided")
# Calculate effect size (Cohen's d)
temp_cohens_d <- temp_bias / temp_sd
cat("Temperature Paired T-Test Results:\n")
## Temperature Paired T-Test Results:
cat("=====================================\n")
## =====================================
cat("Mean difference (Bias):", round(temp_bias, 3), "°F\n")
## Mean difference (Bias): -0.08 °F
cat("t-statistic:", round(temp_ttest$statistic, 3), "\n")
## t-statistic: -0.087
cat("Degrees of freedom:", temp_ttest$parameter, "\n")
## Degrees of freedom: 99
cat("p-value:", format.pval(temp_ttest$p.value, digits = 3), "\n")
## p-value: 0.931
cat("95% CI:", round(temp_ttest$conf.int[1], 3), "to", round(temp_ttest$conf.int[2], 3), "°F\n")
## 95% CI: -1.904 to 1.744 °F
cat("Cohen's d (effect size):", round(temp_cohens_d, 3), "\n")
## Cohen's d (effect size): -0.009
cat("\nInterpretation:\n")
##
## Interpretation:
if (temp_ttest$p.value < 0.05) {
cat("There IS a statistically significant difference between sensor and NOAA temperatures (p < 0.05).\n")
} else {
cat("There is NO statistically significant difference between sensor and NOAA temperatures (p >= 0.05).\n")
}
## There is NO statistically significant difference between sensor and NOAA temperatures (p >= 0.05).
# Humidity t-test
hum_ttest <- t.test(comparison_data$Hum, comparison_data$NOAA_Humidity,
paired = TRUE, alternative = "two.sided")
# Pressure t-test
press_ttest <- t.test(comparison_data$Press, comparison_data$NOAA_Pressure_hPa,
paired = TRUE, alternative = "two.sided")
# Create summary table
ttest_summary <- data.frame(
Variable = c("Temperature", "Humidity", "Pressure"),
Mean_Diff = c(temp_bias, hum_bias, mean(comparison_data$Press_Diff, na.rm = TRUE)),
t_statistic = c(temp_ttest$statistic, hum_ttest$statistic, press_ttest$statistic),
df = c(temp_ttest$parameter, hum_ttest$parameter, press_ttest$parameter),
p_value = c(temp_ttest$p.value, hum_ttest$p.value, press_ttest$p.value),
CI_Lower = c(temp_ttest$conf.int[1], hum_ttest$conf.int[1], press_ttest$conf.int[1]),
CI_Upper = c(temp_ttest$conf.int[2], hum_ttest$conf.int[2], press_ttest$conf.int[2]),
Significant = c(temp_ttest$p.value < 0.05, hum_ttest$p.value < 0.05, press_ttest$p.value < 0.05)
)
knitr::kable(ttest_summary, digits = 4,
caption = "Paired T-Test Results Summary",
col.names = c("Variable", "Mean Diff", "t", "df", "p-value", "CI Lower", "CI Upper", "Significant (α=0.05)"))
| Variable | Mean Diff | t | df | p-value | CI Lower | CI Upper | Significant (α=0.05) |
|---|---|---|---|---|---|---|---|
| Temperature | -0.080 | -0.0870 | 99 | 0.9308 | -1.9038 | 1.7438 | FALSE |
| Humidity | 5.020 | 3.3637 | 99 | 0.0011 | 2.0587 | 7.9813 | TRUE |
| Pressure | -6.216 | -8.6941 | 99 | 0.0000 | -7.6347 | -4.7973 | TRUE |
# Test each device separately
red_data <- comparison_data %>% filter(Device == "Red")
purple_data <- comparison_data %>% filter(Device == "Purple")
if (nrow(red_data) > 1) {
red_temp_ttest <- t.test(red_data$Temp_F, red_data$NOAA_Temp_F, paired = TRUE)
cat("Red Device Temperature T-Test:\n")
cat("Mean difference:", round(mean(red_data$Temp_Diff, na.rm = TRUE), 3), "°F\n")
cat("p-value:", format.pval(red_temp_ttest$p.value, digits = 3), "\n\n")
}
## Red Device Temperature T-Test:
## Mean difference: -0.829 °F
## p-value: 0.6
if (nrow(purple_data) > 1) {
purple_temp_ttest <- t.test(purple_data$Temp_F, purple_data$NOAA_Temp_F, paired = TRUE)
cat("Purple Device Temperature T-Test:\n")
cat("Mean difference:", round(mean(purple_data$Temp_Diff, na.rm = TRUE), 3), "°F\n")
cat("p-value:", format.pval(purple_temp_ttest$p.value, digits = 3), "\n")
}
## Purple Device Temperature T-Test:
## Mean difference: 0.836 °F
## p-value: 0.232
Exercise 2:
Interpret the t-test results. Write your interpretation below each question part below.
2a) From 4.1 and 4.2, which weather variable(s) (temp, pressure, humidity) were significant?
2b) From 4.3, which device (if any) shows systematic bias for temperature?
2c) Modify the code from 4.3 do a paired t test for each device but for one of the other variables (pressure or humidity). Then interpret your results.
#put your modified 4.3 code here:
Let’s fetch additional meteorological variables that might explain measurement differences.
get_extended_weather_data <- function(lat, lon, datetime, debug = FALSE) {
date <- as.Date(datetime)
base_url <- "https://archive-api.open-meteo.com/v1/archive"
url <- paste0(
base_url,
"?latitude=", lat,
"&longitude=", lon,
"&start_date=", date,
"&end_date=", date,
"&hourly=cloudcover,shortwave_radiation,windspeed_10m,winddirection_10m,precipitation",
"&temperature_unit=fahrenheit"
)
default_result <- list(
cloudcover = NA_real_,
shortwave_radiation = NA_real_,
windspeed_10m = NA_real_,
winddirection_10m = NA_real_,
precipitation = NA_real_
)
result <- tryCatch({
response <- GET(url)
if (status_code(response) == 200) {
data <- content(response, "parsed")
if (is.null(data$hourly)) {
return(default_result)
}
hourly_data <- data$hourly
target_hour <- hour(datetime)
idx <- target_hour + 1
# Extract values with NULL checks
cloudcover_val <- if (!is.null(hourly_data$cloudcover) && idx <= length(hourly_data$cloudcover)) {
val <- hourly_data$cloudcover[idx]
if (!is.null(val)) as.numeric(val) else NA_real_
} else NA_real_
radiation_val <- if (!is.null(hourly_data$shortwave_radiation) && idx <= length(hourly_data$shortwave_radiation)) {
val <- hourly_data$shortwave_radiation[idx]
if (!is.null(val)) as.numeric(val) else NA_real_
} else NA_real_
windspeed_val <- if (!is.null(hourly_data$windspeed_10m) && idx <= length(hourly_data$windspeed_10m)) {
val <- hourly_data$windspeed_10m[idx]
if (!is.null(val)) as.numeric(val) else NA_real_
} else NA_real_
winddir_val <- if (!is.null(hourly_data$winddirection_10m) && idx <= length(hourly_data$winddirection_10m)) {
val <- hourly_data$winddirection_10m[idx]
if (!is.null(val)) as.numeric(val) else NA_real_
} else NA_real_
precip_val <- if (!is.null(hourly_data$precipitation) && idx <= length(hourly_data$precipitation)) {
val <- hourly_data$precipitation[idx]
if (!is.null(val)) as.numeric(val) else NA_real_
} else NA_real_
return(list(
cloudcover = cloudcover_val,
shortwave_radiation = radiation_val,
windspeed_10m = windspeed_val,
winddirection_10m = winddir_val,
precipitation = precip_val
))
} else {
return(default_result)
}
}, error = function(e) {
if (debug) cat("Error:", e$message, "\n")
return(default_result)
})
return(result)
}
OpenMeteo has other NOAA data variables we can add to our dataset. This might be helpful to investigate if any of the differences in weather data are dependent on any of these environmental variables.
# Initialize columns
comparison_data$cloudcover <- NA
comparison_data$solar_radiation <- NA
comparison_data$windspeed <- NA
comparison_data$wind_direction <- NA
comparison_data$precipitation <- NA
cat("Fetching extended weather data...\n")
## Fetching extended weather data...
cat("NOTE: Due to API rate limits, we'll fetch with longer delays.\n\n")
## NOTE: Due to API rate limits, we'll fetch with longer delays.
success_count <- 0
for (i in 1:nrow(comparison_data)) {
if (i %% 5 == 0) {
cat("Processing point", i, "of", nrow(comparison_data), "\n")
}
ext_data <- get_extended_weather_data(
comparison_data$Lat[i],
comparison_data$Lon[i],
comparison_data$TimeStampFull[i]
)
# Safely assign with length checks
if (length(ext_data$cloudcover) > 0) {
comparison_data$cloudcover[i] <- ext_data$cloudcover
}
if (length(ext_data$shortwave_radiation) > 0) {
comparison_data$solar_radiation[i] <- ext_data$shortwave_radiation
}
if (length(ext_data$windspeed_10m) > 0) {
comparison_data$windspeed[i] <- ext_data$windspeed_10m
}
if (length(ext_data$winddirection_10m) > 0) {
comparison_data$wind_direction[i] <- ext_data$winddirection_10m
}
if (length(ext_data$precipitation) > 0) {
comparison_data$precipitation[i] <- ext_data$precipitation
}
if (!is.na(comparison_data$cloudcover[i])) {
success_count <- success_count + 1
}
Sys.sleep(0.5) # delay to respect rate limits. If you download too fast or too much data your IP address will get cut off for the day!
}
## Processing point 5 of 100
## Processing point 10 of 100
## Processing point 15 of 100
## Processing point 20 of 100
## Processing point 25 of 100
## Processing point 30 of 100
## Processing point 35 of 100
## Processing point 40 of 100
## Processing point 45 of 100
## Processing point 50 of 100
## Processing point 55 of 100
## Processing point 60 of 100
## Processing point 65 of 100
## Processing point 70 of 100
## Processing point 75 of 100
## Processing point 80 of 100
## Processing point 85 of 100
## Processing point 90 of 100
## Processing point 95 of 100
## Processing point 100 of 100
cat("\nExtended data fetch complete!\n")
##
## Extended data fetch complete!
cat("Successfully retrieved:", success_count, "of", nrow(comparison_data), "points\n")
## Successfully retrieved: 100 of 100 points
# Cloud cover vs temperature error
p1 <- ggplot(comparison_data %>% filter(!is.na(cloudcover)),
aes(x = cloudcover, y = Temp_Diff, color = Device)) +
geom_point(alpha = 0.6, size = 3) +
geom_smooth(method = "loess", se = TRUE) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Temperature Error vs Cloud Cover",
x = "Cloud Cover (%)",
y = "Temperature Error (°F)") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
# Solar radiation vs temperature error
p2 <- ggplot(comparison_data %>% filter(!is.na(solar_radiation)),
aes(x = solar_radiation, y = Temp_Diff, color = Device)) +
geom_point(alpha = 0.6, size = 3) +
geom_smooth(method = "loess", se = TRUE) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Temperature Error vs Solar Radiation",
x = "Solar Radiation (W/m²)",
y = "Temperature Error (°F)") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
# Wind speed vs temperature error
p3 <- ggplot(comparison_data %>% filter(!is.na(windspeed)),
aes(x = windspeed, y = Temp_Diff, color = Device)) +
geom_point(alpha = 0.6, size = 3) +
geom_smooth(method = "loess", se = TRUE) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Temperature Error vs Wind Speed",
x = "Wind Speed (km/h)",
y = "Temperature Error (°F)") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
print(p1)
print(p2)
print(p3)
# Create color palette for cloud cover
cloud_pal <- colorNumeric(
palette = "Blues",
domain = comparison_data$cloudcover,
na.color = "transparent"
)
# Create map
env_map <- leaflet(comparison_data %>% filter(!is.na(cloudcover))) %>%
addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "Topo") %>%
# Cloud cover markers
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = 6,
color = ~cloud_pal(cloudcover),
fillOpacity = 0.8,
stroke = TRUE,
weight = 1,
group = "Cloud Cover",
label = ~paste0("Cloud Cover: ", round(cloudcover, 0), "%"),
popup = ~paste0(
"<b>Environmental Conditions</b><br>",
"Cloud Cover: ", round(cloudcover, 0), "%<br>",
"Solar Radiation: ", round(solar_radiation, 0), " W/m²<br>",
"Wind Speed: ", round(windspeed, 1), " km/h<br>",
"Temperature Error: ", round(Temp_Diff, 1), "°F"
)
) %>%
# Solar radiation markers
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = ~scales::rescale(solar_radiation, to = c(3, 10)),
color = "orange",
fillOpacity = 0.7,
stroke = TRUE,
weight = 1,
group = "Solar Radiation",
label = ~paste0("Solar: ", round(solar_radiation, 0), " W/m²"),
popup = ~paste0(
"<b>Solar Radiation</b><br>",
"Value: ", round(solar_radiation, 0), " W/m²<br>",
"Cloud Cover: ", round(cloudcover, 0), "%<br>",
"Temperature Error: ", round(Temp_Diff, 1), "°F"
)
) %>%
addLayersControl(
baseGroups = c("Satellite", "Topo"),
overlayGroups = c("Cloud Cover", "Solar Radiation"),
options = layersControlOptions(collapsed = FALSE)
) %>%
hideGroup("Solar Radiation") %>%
addLegend(
position = "bottomright",
pal = cloud_pal,
values = ~cloudcover,
title = "Cloud Cover (%)",
group = "Cloud Cover"
)
env_map
Build models to predict temperature error based on multiple factors.
# Add time-based variables
model_data <- comparison_data %>%
filter(!is.na(Temp_Diff)) %>%
mutate(
Hour = hour(TimeStampFull),
Hour_sin = sin(2 * pi * Hour / 24), # Circular encoding of hour
Hour_cos = cos(2 * pi * Hour / 24),
DayOfYear = yday(Date)
)
cat("Modeling dataset:", nrow(model_data), "observations\n")
## Modeling dataset: 100 observations
cat("Complete cases (no missing values):", sum(complete.cases(model_data)), "\n")
## Complete cases (no missing values): 100
model1 <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device,
data = model_data)
summary(model1)
##
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos +
## Device, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.934 -3.241 0.131 4.417 19.027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.523e+03 2.985e+02 -5.102 1.74e-06 ***
## Lat 3.571e+01 7.132e+00 5.007 2.58e-06 ***
## alt_m -2.007e-03 4.519e-03 -0.444 0.6579
## Hour_sin -2.086e+00 1.272e+00 -1.639 0.1045
## Hour_cos 2.505e+00 1.165e+00 2.150 0.0341 *
## DeviceRed -4.772e-01 1.634e+00 -0.292 0.7709
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.996 on 94 degrees of freedom
## Multiple R-squared: 0.2814, Adjusted R-squared: 0.2432
## F-statistic: 7.363 on 5 and 94 DF, p-value: 7.309e-06
# Tidy output
model1_tidy <- tidy(model1, conf.int = TRUE)
knitr::kable(model1_tidy, digits = 4,
caption = "Model 1: Location and Time Effects on Temperature Error")
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -1523.0441 | 298.5085 | -5.1022 | 0.0000 | -2115.7397 | -930.3485 |
| Lat | 35.7100 | 7.1320 | 5.0070 | 0.0000 | 21.5492 | 49.8708 |
| alt_m | -0.0020 | 0.0045 | -0.4442 | 0.6579 | -0.0110 | 0.0070 |
| Hour_sin | -2.0858 | 1.2723 | -1.6394 | 0.1045 | -4.6119 | 0.4403 |
| Hour_cos | 2.5052 | 1.1652 | 2.1501 | 0.0341 | 0.1917 | 4.8186 |
| DeviceRed | -0.4772 | 1.6343 | -0.2920 | 0.7709 | -3.7222 | 2.7678 |
# Filter to complete cases for fair comparison
model_data_complete <- model_data %>%
filter(!is.na(cloudcover) & !is.na(solar_radiation) & !is.na(windspeed))
if (nrow(model_data_complete) > 10) {
model2 <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
cloudcover + solar_radiation + windspeed,
data = model_data_complete)
summary(model2)
model2_tidy <- tidy(model2, conf.int = TRUE)
knitr::kable(model2_tidy, digits = 4,
caption = "Model 2: Location, Time, and Environmental Effects")
# Compare models
cat("\nModel Comparison:\n")
cat("Model 1 R-squared:", round(summary(model1)$r.squared, 4), "\n")
cat("Model 1 Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")
cat("Model 2 R-squared:", round(summary(model2)$r.squared, 4), "\n")
cat("Model 2 Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")
} else {
cat("Insufficient complete cases for Model 2.\n")
cat("Available complete cases:", nrow(model_data_complete), "\n")
}
##
## Model Comparison:
## Model 1 R-squared: 0.2814
## Model 1 Adjusted R-squared: 0.2432
## Model 2 R-squared: 0.3561
## Model 2 Adjusted R-squared: 0.2994
Exercise 3: Add Interaction Term(s)
In Model 2 above, add in one or more interaction term to the model to see if the Device makes any difference on the variable coefficients. Try with the most significant variables first. Example: add a Device*Lat variable. Copy model2 code above and paste into code chunk below, then and add in your interaction term. Rename your model the name model3 and then also compute R^2 and adjusted R^2 for model3
Before proceeding, let’s check for problematic observations that might be affecting our model.
# Check for multicollinearity
vif_values <- vif(model1)
cat("Variance Inflation Factors (VIF):\n")
## Variance Inflation Factors (VIF):
print(round(vif_values, 2))
## Lat alt_m Hour_sin Hour_cos Device
## 1.43 1.40 1.02 1.02 1.03
cat("\nNote: VIF > 10 indicates problematic multicollinearity\n\n")
##
## Note: VIF > 10 indicates problematic multicollinearity
# Normality test
shapiro_test <- shapiro.test(residuals(model1))
shapiro_test2 <- shapiro.test(residuals(model2))
cat("Shapiro-Wilk Normality Test:\n")
## Shapiro-Wilk Normality Test:
cat("W statistic:", round(shapiro_test$statistic, 4), "\n")
## W statistic: 0.9523
cat("p-value:", format.pval(shapiro_test$p.value, digits = 3), "\n")
## p-value: 0.00119
if (shapiro_test$p.value < 0.05) {
cat("WARNING: Residuals are NOT normally distributed (p < 0.05)\n\n")
} else {
cat("Residuals appear normally distributed (p >= 0.05)\n\n")
}
## WARNING: Residuals are NOT normally distributed (p < 0.05)
# Heteroscedasticity test
bp_test <- bptest(model1)
bp_test2 <- bptest(model2)
cat("Breusch-Pagan Heteroscedasticity Test:\n")
## Breusch-Pagan Heteroscedasticity Test:
cat("BP statistic:", round(bp_test$statistic, 4), "\n")
## BP statistic: 39.2025
cat("p-value:", format.pval(bp_test$p.value, digits = 3), "\n")
## p-value: 2.16e-07
if (bp_test$p.value < 0.05) {
cat("WARNING: Heteroscedasticity detected (p < 0.05)\n\n")
} else {
cat("Homoscedasticity assumption met (p >= 0.05)\n\n")
}
## WARNING: Heteroscedasticity detected (p < 0.05)
Interpretation: If we see violations of normality or homoscedasticity, this often indicates the presence of outliers or influential points. Let’s investigate!
Cook’s Distance measures how much a single observation influences the entire regression model. We will calculate this for model1
# Calculate Cook's distance
model_data$cooks_d <- cooks.distance(model1)
model_data$residuals <- residuals(model1)
model_data$predicted <- predict(model1)
# Find influential points (Cook's D > 4/n is a common threshold)
threshold <- 4 / nrow(model_data)
influential_points <- model_data %>%
filter(cooks_d > threshold) %>%
arrange(desc(cooks_d))
# Create influence plot
ggplot(model_data, aes(x = seq_along(cooks_d), y = cooks_d, color = Device)) +
geom_point(alpha = 0.6, size = 3) +
geom_hline(yintercept = threshold, linetype = "dashed", color = "red", linewidth = 1.2) +
annotate("text", x = nrow(model_data) * 0.7, y = threshold,
label = paste0("Threshold (4/n) = ", round(threshold, 5)),
vjust = -0.5, color = "red") +
labs(title = "Cook's Distance - Influence Plot",
subtitle = "Points above red line are potentially influential",
x = "Observation Index",
y = "Cook's Distance") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
cat("\nINFLUENTIAL POINTS ANALYSIS\n")
##
## INFLUENTIAL POINTS ANALYSIS
cat("===========================\n")
## ===========================
cat("Threshold (4/n):", round(threshold, 5), "\n")
## Threshold (4/n): 0.04
cat("Number of influential points:", nrow(influential_points), "\n\n")
## Number of influential points: 8
if (nrow(influential_points) > 0) {
cat("Top 10 most influential observations:\n")
influential_summary <- influential_points %>%
select(TimeStampFull, Device, Temp_F, NOAA_Temp_F, Temp_Diff,
Temp_Abs_Diff, alt_m, cooks_d) %>%
head(10)
print(knitr::kable(influential_summary, digits = 3,
caption = "Most Influential Points"))
# Visualize the influential points
cat("\n")
cat("Characteristics of influential points:\n")
cat("- Mean temperature difference:", round(mean(influential_points$Temp_Diff), 2), "°F\n")
cat("- Mean absolute error:", round(mean(influential_points$Temp_Abs_Diff), 2), "°F\n")
cat("- Compare to overall mean abs error:", round(mean(model_data$Temp_Abs_Diff), 2), "°F\n")
}
## Top 10 most influential observations:
##
##
## Table: Most Influential Points
##
## |TimeStampFull |Device | Temp_F| NOAA_Temp_F| Temp_Diff| Temp_Abs_Diff| alt_m| cooks_d|
## |:-------------------|:------|------:|-----------:|---------:|-------------:|-----:|-------:|
## |2025-09-11 21:19:44 |Red | 34.70| 64.8| -30.10| 30.10| 2871| 0.110|
## |2025-09-11 23:32:56 |Red | 34.70| 63.4| -28.70| 28.70| 2871| 0.101|
## |2025-09-11 23:52:20 |Red | 34.70| 63.3| -28.60| 28.60| 2871| 0.100|
## |2025-09-11 18:11:56 |Red | 34.70| 64.0| -29.30| 29.30| 2871| 0.089|
## |2025-09-11 10:36:21 |Red | 44.06| 38.0| 6.06| 6.06| 2877| 0.070|
## |2025-09-11 17:16:44 |Red | 34.70| 57.9| -23.20| 23.20| 2871| 0.044|
## |2025-09-08 20:51:17 |Red | 78.98| 62.0| 16.98| 16.98| 2758| 0.042|
## |2025-09-11 17:10:02 |Red | 34.70| 57.6| -22.90| 22.90| 2871| 0.042|
##
## Characteristics of influential points:
## - Mean temperature difference: -17.47 °F
## - Mean absolute error: 23.23 °F
## - Compare to overall mean abs error: 6.31 °F
# Scatter plot highlighting influential points
model_data$is_influential <- model_data$cooks_d > threshold
ggplot(model_data, aes(x = Temp_Mean, y = Temp_Diff)) +
geom_point(aes(color = is_influential, size = is_influential, alpha = is_influential)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
scale_color_manual(values = c("FALSE" = "gray60", "TRUE" = "red"),
labels = c("Normal", "Influential"),
name = "Point Type") +
scale_size_manual(values = c("FALSE" = 2, "TRUE" = 4),
labels = c("Normal", "Influential"),
name = "Point Type") +
scale_alpha_manual(values = c("FALSE" = 0.5, "TRUE" = 1),
labels = c("Normal", "Influential"),
name = "Point Type") +
labs(title = "Temperature Errors with Influential Points Highlighted",
subtitle = paste("Red points exceed Cook's D threshold of", round(threshold, 5)),
x = "Mean Temperature (°F)",
y = "Temperature Difference (Sensor - NOAA) (°F)") +
theme_minimal()
# Box plot comparing influential vs non-influential
ggplot(model_data, aes(x = is_influential, y = Temp_Abs_Diff, fill = is_influential)) +
geom_boxplot() +
scale_fill_manual(values = c("FALSE" = "lightblue", "TRUE" = "red"),
labels = c("Normal Points", "Influential Points"),
name = "") +
labs(title = "Absolute Temperature Error: Normal vs Influential Points",
x = "Point Type",
y = "Absolute Temperature Error (°F)") +
scale_x_discrete(labels = c("FALSE" = "Normal", "TRUE" = "Influential")) +
theme_minimal()
Now let’s remove the influential points and see how it affects our models.
cat("DATA CLEANING DECISION\n")
## DATA CLEANING DECISION
cat("======================\n")
## ======================
cat("Original dataset:", nrow(model_data), "observations\n")
## Original dataset: 100 observations
cat("Influential points to remove:", sum(model_data$is_influential), "observations\n")
## Influential points to remove: 8 observations
cat("Cleaned dataset:", sum(!model_data$is_influential), "observations\n")
## Cleaned dataset: 92 observations
cat("Percentage removed:", round(sum(model_data$is_influential) / nrow(model_data) * 100, 2), "%\n\n")
## Percentage removed: 8 %
# Create cleaned dataset
model_data_clean <- model_data %>%
filter(!is_influential)
# Refit the model with cleaned data
model1_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device,
data = model_data_clean)
model2_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
cloudcover + solar_radiation + windspeed,
data = model_data_clean)
cat("MODEL COMPARISON: BEFORE vs AFTER REMOVING INFLUENTIAL POINTS\n")
## MODEL COMPARISON: BEFORE vs AFTER REMOVING INFLUENTIAL POINTS
cat("=============================================================\n\n")
## =============================================================
cat("ORIGINAL MODEL 1 (with influential points):\n")
## ORIGINAL MODEL 1 (with influential points):
cat("R-squared:", round(summary(model1)$r.squared, 4), "\n")
## R-squared: 0.2814
cat("Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.2432
cat("RMSE:", round(sqrt(mean(residuals(model1)^2)), 4), "°F\n")
## RMSE: 7.7523 °F
cat("Residual Std Error:", round(summary(model1)$sigma, 4), "°F\n\n")
## Residual Std Error: 7.9959 °F
cat("CLEANED MODEL 1 (influential points removed):\n")
## CLEANED MODEL 1 (influential points removed):
cat("R-squared:", round(summary(model1_clean)$r.squared, 4), "\n")
## R-squared: 0.3849
cat("Adjusted R-squared:", round(summary(model1_clean)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.3491
cat("RMSE:", round(sqrt(mean(residuals(model1_clean)^2)), 4), "°F\n")
## RMSE: 4.755 °F
cat("Residual Std Error:", round(summary(model1_clean)$sigma, 4), "°F\n\n")
## Residual Std Error: 4.9181 °F
cat("ORIGINAL MODEL 2 (with influential points):\n")
## ORIGINAL MODEL 2 (with influential points):
cat("R-squared:", round(summary(model2)$r.squared, 4), "\n")
## R-squared: 0.3561
cat("Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.2994
cat("RMSE:", round(sqrt(mean(residuals(model2)^2)), 4), "°F\n")
## RMSE: 7.3388 °F
cat("Residual Std Error:", round(summary(model2)$sigma, 4), "°F\n\n")
## Residual Std Error: 7.6931 °F
cat("CLEANED MODEL 2 (influential points removed):\n")
## CLEANED MODEL 2 (influential points removed):
cat("R-squared:", round(summary(model2_clean)$r.squared, 4), "\n")
## R-squared: 0.4463
cat("Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.393
cat("RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 4), "°F\n")
## RMSE: 4.5113 °F
cat("Residual Std Error:", round(summary(model2_clean)$sigma, 4), "°F\n\n")
## Residual Std Error: 4.7496 °F
# Improvement metrics
r2_improvement <- (summary(model1_clean)$r.squared - summary(model1)$r.squared) * 100
rmse_improvement <- ((sqrt(mean(residuals(model1)^2)) - sqrt(mean(residuals(model1_clean)^2))) /
sqrt(mean(residuals(model1)^2))) * 100
cat("IMPROVEMENTS MODEL 1:\n")
## IMPROVEMENTS MODEL 1:
cat("R-squared increase:", round(r2_improvement, 2), "percentage points\n")
## R-squared increase: 10.35 percentage points
cat("RMSE reduction:", round(rmse_improvement, 2), "%\n")
## RMSE reduction: 38.66 %
r2_improvement2 <- (summary(model2_clean)$r.squared - summary(model2)$r.squared) * 100
rmse_improvement2 <- ((sqrt(mean(residuals(model2)^2)) - sqrt(mean(residuals(model2_clean)^2))) /
sqrt(mean(residuals(model2)^2))) * 100
cat("IMPROVEMENTS MODEL 2:\n")
## IMPROVEMENTS MODEL 2:
cat("R-squared increase:", round(r2_improvement2, 2), "percentage points\n")
## R-squared increase: 9.03 percentage points
cat("RMSE reduction:", round(rmse_improvement2, 2), "%\n")
## RMSE reduction: 38.53 %
# Show full model 1 summary
cat("\n")
summary(model1_clean)
##
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos +
## Device, data = model_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.4648 -2.6905 0.0476 2.7094 14.3754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.286e+02 2.148e+02 -1.995 0.0492 *
## Lat 9.684e+00 5.122e+00 1.891 0.0621 .
## alt_m 4.441e-03 2.840e-03 1.564 0.1216
## Hour_sin -3.692e+00 8.113e-01 -4.551 1.74e-05 ***
## Hour_cos 3.499e+00 7.415e-01 4.719 9.11e-06 ***
## DeviceRed 1.863e+00 1.045e+00 1.783 0.0781 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.918 on 86 degrees of freedom
## Multiple R-squared: 0.3849, Adjusted R-squared: 0.3491
## F-statistic: 10.76 on 5 and 86 DF, p-value: 4.692e-08
model1_clean_tidy <- tidy(model1_clean, conf.int = TRUE)
knitr::kable(model1_clean_tidy, digits = 4,
caption = "Cleaned Model 1: Location and Time Effects on Temperature Error (Influential Points Removed)")
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -428.5773 | 214.8415 | -1.9949 | 0.0492 | -855.6681 | -1.4864 |
| Lat | 9.6837 | 5.1222 | 1.8905 | 0.0621 | -0.4989 | 19.8663 |
| alt_m | 0.0044 | 0.0028 | 1.5636 | 0.1216 | -0.0012 | 0.0101 |
| Hour_sin | -3.6919 | 0.8113 | -4.5507 | 0.0000 | -5.3047 | -2.0791 |
| Hour_cos | 3.4989 | 0.7415 | 4.7189 | 0.0000 | 2.0249 | 4.9728 |
| DeviceRed | 1.8635 | 1.0449 | 1.7833 | 0.0781 | -0.2138 | 3.9407 |
# Show full model 2 summary
cat("\n")
summary(model2_clean)
##
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos +
## Device + cloudcover + solar_radiation + windspeed, data = model_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.8425 -2.2988 0.1258 2.7823 13.5543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.326e+02 2.656e+02 -2.758 0.00715 **
## Lat 1.687e+01 6.332e+00 2.664 0.00928 **
## alt_m 3.336e-03 3.102e-03 1.075 0.28531
## Hour_sin -3.848e+00 2.038e+00 -1.888 0.06248 .
## Hour_cos 3.216e+00 1.188e+00 2.707 0.00823 **
## DeviceRed 2.973e-01 1.134e+00 0.262 0.79382
## cloudcover 5.865e-02 2.045e-02 2.868 0.00524 **
## solar_radiation 1.068e-03 4.259e-03 0.251 0.80263
## windspeed -1.424e-01 1.274e-01 -1.118 0.26688
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.75 on 83 degrees of freedom
## Multiple R-squared: 0.4463, Adjusted R-squared: 0.393
## F-statistic: 8.363 on 8 and 83 DF, p-value: 2.949e-08
model2_clean_tidy <- tidy(model2_clean, conf.int = TRUE)
knitr::kable(model2_clean_tidy, digits = 4,
caption = "Cleaned Model 2: Location, Time, and Environmental Effects on Temperature Error (Influential Points Removed)")
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -732.5966 | 265.6347 | -2.7579 | 0.0072 | -1260.9333 | -204.2598 |
| Lat | 16.8665 | 6.3320 | 2.6637 | 0.0093 | 4.2723 | 29.4607 |
| alt_m | 0.0033 | 0.0031 | 1.0754 | 0.2853 | -0.0028 | 0.0095 |
| Hour_sin | -3.8479 | 2.0377 | -1.8883 | 0.0625 | -7.9008 | 0.2051 |
| Hour_cos | 3.2156 | 1.1877 | 2.7074 | 0.0082 | 0.8533 | 5.5780 |
| DeviceRed | 0.2973 | 1.1340 | 0.2622 | 0.7938 | -1.9581 | 2.5527 |
| cloudcover | 0.0587 | 0.0205 | 2.8677 | 0.0052 | 0.0180 | 0.0993 |
| solar_radiation | 0.0011 | 0.0043 | 0.2507 | 0.8026 | -0.0074 | 0.0095 |
| windspeed | -0.1424 | 0.1274 | -1.1178 | 0.2669 | -0.3957 | 0.1110 |
cat("DIAGNOSTIC TESTS ON CLEANED MODEL 1\n")
## DIAGNOSTIC TESTS ON CLEANED MODEL 1
cat("==================================\n\n")
## ==================================
# Normality test
shapiro_test_clean <- shapiro.test(residuals(model1_clean))
cat("Shapiro-Wilk Normality Test:\n")
## Shapiro-Wilk Normality Test:
cat("W statistic:", round(shapiro_test_clean$statistic, 4), "\n")
## W statistic: 0.955
cat("p-value:", format.pval(shapiro_test_clean$p.value, digits = 3), "\n")
## p-value: 0.00305
if (shapiro_test_clean$p.value < 0.05) {
cat("Result: Residuals still show departure from normality\n\n")
} else {
cat("Result: ✓ Residuals are normally distributed\n\n")
}
## Result: Residuals still show departure from normality
# Heteroscedasticity test
bp_test_clean <- bptest(model1_clean)
cat("Breusch-Pagan Heteroscedasticity Test:\n")
## Breusch-Pagan Heteroscedasticity Test:
cat("BP statistic:", round(bp_test_clean$statistic, 4), "\n")
## BP statistic: 8.5065
cat("p-value:", format.pval(bp_test_clean$p.value, digits = 3), "\n")
## p-value: 0.13
if (bp_test_clean$p.value < 0.05) {
cat("Result: Some heteroscedasticity still present\n\n")
} else {
cat("Result: ✓ Homoscedasticity assumption met\n\n")
}
## Result: ✓ Homoscedasticity assumption met
# Compare test statistics
cat("COMPARISON OF DIAGNOSTIC TESTS:\n")
## COMPARISON OF DIAGNOSTIC TESTS:
cat(" Original Model Cleaned Model\n")
## Original Model Cleaned Model
cat("Shapiro-Wilk p-value: ", format.pval(shapiro_test$p.value, digits = 4), " ",
format.pval(shapiro_test_clean$p.value, digits = 4), "\n")
## Shapiro-Wilk p-value: 0.001186 0.003045
cat("Breusch-Pagan p-value: ", format.pval(bp_test$p.value, digits = 4), " ",
format.pval(bp_test_clean$p.value, digits = 4), "\n")
## Breusch-Pagan p-value: 2.162e-07 0.1304
# Diagnostic plots for cleaned model
par(mfrow = c(2, 2))
plot(model1_clean, which = 1:4)
par(mfrow = c(1, 1))
cat("DIAGNOSTIC TESTS ON CLEANED MODEL 2\n")
## DIAGNOSTIC TESTS ON CLEANED MODEL 2
cat("==================================\n\n")
## ==================================
# Normality test
shapiro_test_clean2 <- shapiro.test(residuals(model2_clean))
cat("Shapiro-Wilk Normality Test:\n")
## Shapiro-Wilk Normality Test:
cat("W statistic:", round(shapiro_test_clean2$statistic, 4), "\n")
## W statistic: 0.953
cat("p-value:", format.pval(shapiro_test_clean2$p.value, digits = 3), "\n")
## p-value: 0.00225
if (shapiro_test_clean2$p.value < 0.05) {
cat("Result: Residuals still show departure from normality\n\n")
} else {
cat("Result: ✓ Residuals are normally distributed\n\n")
}
## Result: Residuals still show departure from normality
# Heteroscedasticity test
bp_test_clean2 <- bptest(model2_clean)
cat("Breusch-Pagan Heteroscedasticity Test:\n")
## Breusch-Pagan Heteroscedasticity Test:
cat("BP statistic:", round(bp_test_clean2$statistic, 4), "\n")
## BP statistic: 12.1473
cat("p-value:", format.pval(bp_test_clean2$p.value, digits = 3), "\n")
## p-value: 0.145
if (bp_test_clean2$p.value < 0.05) {
cat("Result: Some heteroscedasticity still present\n\n")
} else {
cat("Result: ✓ Homoscedasticity assumption met\n\n")
}
## Result: ✓ Homoscedasticity assumption met
# Compare test statistics
cat("COMPARISON OF DIAGNOSTIC TESTS:\n")
## COMPARISON OF DIAGNOSTIC TESTS:
cat(" Original Model Cleaned Model\n")
## Original Model Cleaned Model
cat("Shapiro-Wilk p-value: ", format.pval(shapiro_test2$p.value, digits = 4), " ",
format.pval(shapiro_test_clean2$p.value, digits = 4), "\n")
## Shapiro-Wilk p-value: 0.001141 0.002252
cat("Breusch-Pagan p-value: ", format.pval(bp_test2$p.value, digits = 4), " ",
format.pval(bp_test_clean2$p.value, digits = 4), "\n")
## Breusch-Pagan p-value: 7.469e-06 0.1448
# Diagnostic plots for cleaned model
par(mfrow = c(2, 2))
plot(model2_clean, which = 1:4)
par(mfrow = c(1, 1))
Key Learning Points: - Removing influential points can dramatically improve model fit and assumption compliance - However, consider WHY these points are influential - they may contain important information - Document any data removal decisions and justify them scientifically - The cleaned model provides more reliable coefficient estimates - There still are issues with the normality assumption (failed Shapiro-Wilk test) - There still seem to be additional points that are still influential. Consider addressing issues with them as well.
Exercise 4: Investigate the influential points. Are they measurement errors, unusual environmental conditions, or legitimate extreme events? Should they be removed or studied separately? What do you think we should do with these influential points?
We will use model2_clean (with the influential points removed) with the dataset model_data_clean for subsequent analysis. If you think you like a different model better, replace every instance of “model2_clean” with “model1”, “model1_clean”, or “model2_clean” and model_data_clean with model_data accordingly.
# Add predictions and residuals for model 2 clean
model_data_clean$predicted <- predict(model2_clean, newdata = model_data_clean)
model_data_clean$residuals <- residuals(model2_clean, newdata = model_data_clean)
# Predicted vs Actual for model 1
ggplot(model_data_clean, aes(x = predicted, y = Temp_Diff)) +
geom_point(aes(color = Device), alpha = 0.6, size = 3) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
labs(title = "Model 2 Clean: Predicted vs Actual Temperature Error",
x = "Predicted Temperature Error (°F)",
y = "Actual Temperature Error (°F)") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
# Residuals by device for model 1
ggplot(model_data_clean, aes(x = Device, y = residuals, fill = Device)) +
geom_boxplot() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Model 2 Clean: Residuals by Device",
x = "Device",
y = "Residuals (°F)") +
theme_minimal() +
scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))
# Create comprehensive model summary
model_summary <- data.frame(
Model = c("Model 1", if(exists("model2")) "Model 2" else NULL),
Predictors = c("Location + Time + Device",
if(exists("model2")) "Location + Time + Device + Environment" else NULL),
N = c(nobs(model1), if(exists("model2")) nobs(model2) else NULL),
R_squared = c(summary(model1)$r.squared,
if(exists("model2")) summary(model2)$r.squared else NULL),
Adj_R_squared = c(summary(model1)$adj.r.squared,
if(exists("model2")) summary(model2)$adj.r.squared else NULL),
RMSE = c(sqrt(mean(residuals(model1)^2)),
if(exists("model2")) sqrt(mean(residuals(model2)^2)) else NULL),
AIC = c(AIC(model1), if(exists("model2")) AIC(model2) else NULL),
BIC = c(BIC(model1), if(exists("model2")) BIC(model2) else NULL)
)
knitr::kable(model_summary, digits = 4,
caption = "Model Performance Comparison for Full Data",
col.names = c("Model", "Predictors", "N", "R²", "Adj R²", "RMSE", "AIC", "BIC"))
| Model | Predictors | N | R² | Adj R² | RMSE | AIC | BIC |
|---|---|---|---|---|---|---|---|
| Model 1 | Location + Time + Device | 100 | 0.2814 | 0.2432 | 7.7523 | 707.3865 | 725.6227 |
| Model 2 | Location + Time + Device + Environment | 100 | 0.3561 | 0.2994 | 7.3388 | 702.4219 | 728.4736 |
# Create comprehensive model summary for cleaned models
model_summary2 <- data.frame(
Model = c("Model 1 Clean", if(exists("model2_clean")) "Model 2 Clean" else NULL),
Predictors = c("Location + Time + Device",
if(exists("model2_clean")) "Location + Time + Device + Environment" else NULL),
N = c(nobs(model1_clean), if(exists("model2_clean")) nobs(model2_clean) else NULL),
R_squared = c(summary(model1_clean)$r.squared,
if(exists("model2_clean")) summary(model2_clean)$r.squared else NULL),
Adj_R_squared = c(summary(model1_clean)$adj.r.squared,
if(exists("model2_clean")) summary(model2_clean)$adj.r.squared else NULL),
RMSE = c(sqrt(mean(residuals(model1)^2)),
if(exists("model2_clean")) sqrt(mean(residuals(model2_clean)^2)) else NULL),
AIC = c(AIC(model1_clean), if(exists("model2_clean")) AIC(model2_clean) else NULL),
BIC = c(BIC(model1_clean), if(exists("model2_clean")) BIC(model2_clean) else NULL)
)
knitr::kable(model_summary2, digits = 4,
caption = "Model Performance Comparison for Cleaned Data",
col.names = c("Model", "Predictors", "N", "R²", "Adj R²", "RMSE", "AIC", "BIC"))
| Model | Predictors | N | R² | Adj R² | RMSE | AIC | BIC |
|---|---|---|---|---|---|---|---|
| Model 1 Clean | Location + Time + Device | 92 | 0.3849 | 0.3491 | 7.7523 | 561.9764 | 579.6290 |
| Model 2 Clean | Location + Time + Device + Environment | 92 | 0.4463 | 0.3930 | 4.5113 | 558.2972 | 583.5151 |
# Create model 1 coefficient plot with confidence intervals
coef_plot_data <- model1_clean_tidy %>%
filter(term != "(Intercept)") %>%
mutate(term = factor(term, levels = rev(term)))
ggplot(coef_plot_data, aes(x = estimate, y = term)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
geom_point(size = 3, color = "blue") +
labs(title = "Model 1 (Clean) Coefficients with 95% Confidence Intervals",
subtitle = "Effects on Temperature Error (°F)",
x = "Coefficient Estimate",
y = "Predictor Variable") +
theme_minimal() +
theme(axis.text.y = element_text(hjust = 1))
# Create model 2 coefficient plot with confidence intervals
coef_plot_data2 <- model2_clean_tidy %>%
filter(term != "(Intercept)") %>%
mutate(term = factor(term, levels = rev(term)))
ggplot(coef_plot_data2, aes(x = estimate, y = term)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
geom_point(size = 3, color = "blue") +
labs(title = "Model 2 (Clean) Coefficients with 95% Confidence Intervals",
subtitle = "Effects on Temperature Error (°F)",
x = "Coefficient Estimate",
y = "Predictor Variable") +
theme_minimal() +
theme(axis.text.y = element_text(hjust = 1))
# 3D scatter plot: Lat, Lon, Altitude colored by temp error
plot_ly(model_data,
x = ~Lon, y = ~Lat, z = ~alt_m,
color = ~Temp_Diff,
colors = colorRamp(c("blue", "white", "red")),
marker = list(size = 5),
text = ~paste0("Device: ", Device,
"<br>Temp Error: ", round(Temp_Diff, 2), "°F",
"<br>Altitude: ", round(alt_m, 0), "m"),
hoverinfo = "text") %>%
add_markers() %>%
layout(title = "Temperature Error in 3D Space",
scene = list(
xaxis = list(title = "Longitude"),
yaxis = list(title = "Latitude"),
zaxis = list(title = "Altitude (m)")
))
# Extract significant predictors (p < 0.05)
significant_predictors <- model2_clean_tidy %>%
filter(p.value < 0.05, term != "(Intercept)")
cat("REGRESSION ANALYSIS SUMMARY FOR MODEL 2 CLEAN (with influential points removed)\n")
## REGRESSION ANALYSIS SUMMARY FOR MODEL 2 CLEAN (with influential points removed)
cat("======================================================\n\n")
## ======================================================
cat("Model Performance:\n")
## Model Performance:
cat("- R-squared:", round(summary(model2_clean)$r.squared, 4),
"(model explains", round(summary(model2_clean)$r.squared * 100, 1), "% of variance)\n")
## - R-squared: 0.4463 (model explains 44.6 % of variance)
cat("- Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")
## - Adjusted R-squared: 0.393
cat("- RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 3), "°F\n")
## - RMSE: 4.511 °F
cat("- F-statistic:", round(summary(model2_clean)$fstatistic[1], 2),
"on", summary(model2_clean)$fstatistic[2], "and", summary(model2_clean)$fstatistic[3], "DF\n")
## - F-statistic: 8.36 on 8 and 83 DF
cat("- Overall p-value:", format.pval(pf(summary(model2_clean)$fstatistic[1],
summary(model2_clean)$fstatistic[2], summary(model2_clean)$fstatistic[3],
lower.tail = FALSE), digits = 3), "\n\n")
## - Overall p-value: 2.95e-08
if (nrow(significant_predictors) > 0) {
cat("Significant Predictors (p < 0.05):\n")
for (i in 1:nrow(significant_predictors)) {
cat("-", significant_predictors$term[i], ":",
round(significant_predictors$estimate[i], 4),
"(p =", format.pval(significant_predictors$p.value[i], digits = 3), ")\n")
}
} else {
cat("No significant predictors at p < 0.05 level.\n")
}
## Significant Predictors (p < 0.05):
## - Lat : 16.8665 (p = 0.00928 )
## - Hour_cos : 3.2156 (p = 0.00823 )
## - cloudcover : 0.0587 (p = 0.00524 )
cat("\nInterpretation:\n")
##
## Interpretation:
cat("- Device effect:",
if("DeviceRed" %in% significant_predictors$term) {
"Red device shows significantly different error pattern"
} else {
"No significant difference between devices"
}, "\n")
## - Device effect: No significant difference between devices
cat("- Spatial effects:",
if(any(c("Lat", "Lon", "alt_m") %in% significant_predictors$term)) {
"Location/elevation significantly affects temperature error"
} else {
"Location/elevation does not significantly affect error"
}, "\n")
## - Spatial effects: Location/elevation significantly affects temperature error
cat("- Temporal effects:",
if(any(c("Hour_sin", "Hour_cos") %in% significant_predictors$term)) {
"Time of day significantly affects temperature error"
} else {
"Time of day does not significantly affect error"
}, "\n")
## - Temporal effects: Time of day significantly affects temperature error
cat("- Environmental effects:",
if(any(c("cloudcover", "solar_radiation", "windspeed") %in% significant_predictors$term)) {
"Environment significantly affects temperature error"
} else {
"Environment does not significantly affect error"
}, "\n")
## - Environmental effects: Environment significantly affects temperature error
Exercise 5: Based on the regression results, what are the most important factors influencing temperature measurement errors and how did you identify them?
# Enhanced residual plot
ggplot(model_data_clean, aes(x = predicted, y = residuals, color = Device)) +
geom_point(alpha = 0.6, size = 3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
geom_smooth(method = "loess", se = TRUE, linewidth = 1) +
labs(title = "Residual Analysis by Device",
x = "Fitted Values (°F)",
y = "Residuals (°F)") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
ggplot(model_data_clean, aes(sample = residuals, color = Device)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~Device) +
labs(title = "Normal Q-Q Plots by Device",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
model_data_clean$sqrt_abs_residuals <- sqrt(abs(model_data_clean$residuals))
ggplot(model_data_clean, aes(x = predicted, y = sqrt_abs_residuals, color = Device)) +
geom_point(alpha = 0.6, size = 3) +
geom_smooth(method = "loess", se = TRUE) +
labs(title = "Scale-Location Plot",
subtitle = "Check for homoscedasticity",
x = "Fitted Values (°F)",
y = expression(sqrt("|Standardized Residuals|"))) +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
# Calculate Cook's distance
model_data_clean$cooks_d <- cooks.distance(model2_clean)
# Find influential points (Cook's D > 4/n)
threshold <- 4 / nrow(model_data_clean)
influential_points <- model_data_clean %>%
filter(cooks_d > threshold) %>%
arrange(desc(cooks_d))
ggplot(model_data_clean, aes(x = seq_along(cooks_d), y = cooks_d, color = Device)) +
geom_point(alpha = 0.6, size = 3) +
geom_hline(yintercept = threshold, linetype = "dashed", color = "red") +
labs(title = "Cook's Distance - Influence Plot",
subtitle = paste("Red line indicates threshold (4/n =", round(threshold, 4), ")"),
x = "Observation Index",
y = "Cook's Distance") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
if (nrow(influential_points) > 0) {
cat("\nInfluential observations (Cook's D >", round(threshold, 4), "):\n")
cat("Number of influential points:", nrow(influential_points), "\n")
cat("Top 5 most influential:\n")
print(head(influential_points %>%
select(TimeStampFull, Device, Temp_Diff, predicted, cooks_d), 5))
}
##
## Influential observations (Cook's D > 0.0435 ):
## Number of influential points: 6
## Top 5 most influential:
## # A tibble: 5 × 5
## TimeStampFull Device Temp_Diff predicted cooks_d
## <dttm> <chr> <dbl> <dbl> <dbl>
## 1 2025-09-11 16:30:02 Red -20.6 -1.76 0.239
## 2 2025-09-08 23:18:47 Red 11.9 0.681 0.0662
## 3 2025-09-09 15:25:35 Red -3.18 2.93 0.0538
## 4 2024-09-12 16:52:36 Purple -2.42 2.99 0.0481
## 5 2025-09-04 21:49:32 Red 21.4 7.81 0.0469
# Create error magnitude categories
model_data_clean <- model_data_clean %>%
mutate(
Error_Magnitude = cut(Temp_Abs_Diff,
breaks = c(0, 1, 3, 5, Inf),
labels = c("Very Low (<1°F)", "Low (1-3°F)",
"Medium (3-5°F)", "High (>5°F)"))
)
# Color palette
error_pal <- colorFactor(
palette = c("#2ecc71", "#f1c40f", "#e67e22", "#e74c3c"),
domain = model_data_clean$Error_Magnitude
)
hotspot_map <- leaflet(model_data_clean) %>%
addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "Topo") %>%
addWMSTiles(
baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
layers = "0",
options = WMSTileOptions(format = "image/png", transparent = TRUE),
attribution = "USGS The National Map",
group = "USGS Topo"
) %>%
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = ~Temp_Abs_Diff * 2, # Size proportional to error magnitude
color = ~error_pal(Error_Magnitude),
fillOpacity = 0.7,
stroke = TRUE,
weight = 1,
label = ~paste0("Error: ", round(Temp_Diff, 2), "°F | Magnitude: ", Error_Magnitude),
popup = ~paste0(
"<b>Error Hotspot Analysis</b><br>",
"<b>Device:</b> ", Device, "<br>",
"<b>Temperature Error:</b> ", round(Temp_Diff, 2), "°F<br>",
"<b>Absolute Error:</b> ", round(Temp_Abs_Diff, 2), "°F<br>",
"<b>Magnitude:</b> ", Error_Magnitude, "<br>",
"<b>Model Prediction:</b> ", round(predicted, 2), "°F<br>",
"<b>Residual:</b> ", round(residuals, 2), "°F<br>",
"<hr>",
"<b>Location:</b><br>",
"Lat: ", round(Lat, 5), "<br>",
"Lon: ", round(Lon, 5), "<br>",
"Altitude: ", round(alt_m, 0), "m<br>",
"<b>Time:</b> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
)
) %>%
addLayersControl(
baseGroups = c("Satellite", "Topo", "USGS Topo"),
options = layersControlOptions(collapsed = FALSE)
) %>%
addLegend(
position = "bottomright",
pal = error_pal,
values = ~Error_Magnitude,
title = "Error Magnitude",
opacity = 1
) %>%
addScaleBar(position = "bottomleft")
hotspot_map
# Color palette for residuals
residual_pal <- colorNumeric(
palette = "RdBu",
domain = c(-max(abs(model_data_clean$residuals)), max(abs(model_data_clean$residuals))),
reverse = TRUE
)
residuals_map <- leaflet(model_data_clean) %>%
addProviderTiles("Esri.WorldTopoMap") %>%
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = 6,
color = ~residual_pal(residuals),
fillOpacity = 0.8,
stroke = TRUE,
weight = 1,
label = ~paste0("Residual: ", round(residuals, 2), "°F"),
popup = ~paste0(
"<b>Model Residual Analysis</b><br>",
"<b>Observed:</b> ", round(Temp_Diff, 2), "°F<br>",
"<b>Predicted:</b> ", round(predicted, 2), "°F<br>",
"<b>Residual:</b> ", round(residuals, 2), "°F<br>",
"<b>Device:</b> ", Device, "<br>",
"<b>Time:</b> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
)
) %>%
addLegend(
position = "bottomright",
pal = residual_pal,
values = ~residuals,
title = "Residuals (°F)<br>Red = Over-predicted<br>Blue = Under-predicted",
opacity = 1
)
residuals_map
cat("STATISTICAL ANALYSIS SUMMARY\n")
## STATISTICAL ANALYSIS SUMMARY
cat("============================\n\n")
## ============================
cat("1. BLAND-ALTMAN ANALYSIS\n")
## 1. BLAND-ALTMAN ANALYSIS
cat(" Temperature:\n")
## Temperature:
cat(" - Mean Bias:", round(temp_bias, 3), "°F\n")
## - Mean Bias: -0.08 °F
cat(" - 95% Limits of Agreement:", round(temp_lower_loa, 3), "to", round(temp_upper_loa, 3), "°F\n")
## - 95% Limits of Agreement: -18.095 to 17.935 °F
cat(" - Within limits:",
round(sum(comparison_data$Temp_Diff >= temp_lower_loa &
comparison_data$Temp_Diff <= temp_upper_loa, na.rm = TRUE) /
sum(!is.na(comparison_data$Temp_Diff)) * 100, 1), "%\n\n")
## - Within limits: 91 %
cat("2. PAIRED T-TESTS\n")
## 2. PAIRED T-TESTS
cat(" Temperature: t =", round(temp_ttest$statistic, 3),
", p =", format.pval(temp_ttest$p.value, digits = 3), "\n")
## Temperature: t = -0.087 , p = 0.931
cat(" Humidity: t =", round(hum_ttest$statistic, 3),
", p =", format.pval(hum_ttest$p.value, digits = 3), "\n")
## Humidity: t = 3.364 , p = 0.00109
cat(" Pressure: t =", round(press_ttest$statistic, 3),
", p =", format.pval(press_ttest$p.value, digits = 3), "\n\n")
## Pressure: t = -8.694 , p = 7.65e-14
cat("3. REGRESSION MODEL 2 CLEAN (with influential points removed)\n")
## 3. REGRESSION MODEL 2 CLEAN (with influential points removed)
cat(" R-squared:", round(summary(model2_clean)$r.squared, 4), "\n")
## R-squared: 0.4463
cat(" Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.393
cat(" RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 3), "°F\n")
## RMSE: 4.511 °F
cat(" Number of significant predictors:", nrow(significant_predictors), "\n")
## Number of significant predictors: 3
Based on the statistical analysis:
Exercise 6: Design and implement one of the following extensions and put your code in the chunk below.
Easier Extensions:
Perform seperate regression models for each device
Remove some variables from model2_clean that are not significant. Try to get the Baysian Information Criterion (BIC) or Akaike’s Information Criterion (AIC) as low as possible.
Harder Extensions:
Deal with the issue of influential points. There are still influential points in the model made from the cleaned data. Are there errors in the data set?
Deal with the issue of non-normality (failed Shapiro-Wilk test). Ideas: inspect distribution, maybe do a transform on the dependent variable (logarithm?), or use robust regression.
Deal with the issue of heteroscedasticity (failed Breusch-Pagan test). Ideas: weighted least squares, or robust standard errors.
Try a mixed effect model (try function lmer from package lme4 instead of lm). Create data clusters based on time/space bins and a term “+ (1 | cluster_variable_name)” to your model. This might also help deal with the issues of non-normality and/or heteroscedasticity
#put your code for exercise 5 extension below
Publishing to RPubs:
Knit this document to HTML
Click “Publish” in the preview window
Select RPubs and follow prompts
Submit your statistical analysis by pasting the url in Canvas
Data Export:
# Export enhanced dataset with model predictions
# write.csv(model_data, "weather_modeling_results.csv", row.names = FALSE)
cat("\nAnalysis complete! Enhanced dataset includes:\n")
##
## Analysis complete! Enhanced dataset includes:
cat("- NOAA reference data\n")
## - NOAA reference data
cat("- Extended environmental variables\n")
## - Extended environmental variables
cat("- Model predictions and residuals\n")
## - Model predictions and residuals
cat("- Statistical test results\n")
## - Statistical test results