Introduction

This tutorial builds on the descriptive statistical analysis of Part 1 by performing statistical modeling to understand the differences between sensor measurements and NOAA reference data. We’ll use Bland-Altman plots, paired t-tests, collect additional meteorological variables, and build linear regression models to identify factors that influence measurement errors.

1. Load Required Packages

# Install additional packages if needed
# install.packages(c("ggExtra", "broom", "car", "lmtest", "ggpubr"))

library(dplyr)
library(ggplot2)
library(plotly)
library(leaflet)
library(lubridate)
library(ggExtra)     # For marginal plots
library(broom)       # For tidy model output
library(car)         # For regression diagnostics
library(lmtest)      # For statistical tests
library(ggpubr)      # For publication-ready plots
library(httr)
library(jsonlite)
library(readr)

2. Load Data from Part 1

# Assuming you have the comparison_data from Part 1 saved as "CDT_Weather_Comparison.csv" in your working directory (part 9 from the first the activity).

# If not, you'll need to run the first activity again first to generate it. Make sure to (uncomment and then) run the code in part 9 that saves the data to the csv file. This code is:  write.csv(comparison_data, "CDT_Weather_Comparison.csv", row.names = FALSE)

#or, optionally you can just download this csv file from canvas and put it in your working directory.
 
comparison_data <- read_csv("CDT_Weather_Comparison.csv")


# For this tutorial, we'll assume comparison_data exists
cat("Data dimensions:", nrow(comparison_data), "rows x", ncol(comparison_data), "columns\n")
## Data dimensions: 100 rows x 26 columns
cat("Variables available:\n")
## Variables available:
print(names(comparison_data))
##  [1] "TimeStampFull"       "Date"                "Time"               
##  [4] "Lat"                 "Lon"                 "alt_m"              
##  [7] "Temp_F"              "Hum"                 "Press"              
## [10] "DP"                  "Device"              "Temp_C"             
## [13] "NOAA_Temp_F"         "NOAA_Humidity"       "NOAA_Pressure_hPa"  
## [16] "Temp_Diff"           "Temp_Pct_Diff"       "Temp_Abs_Diff"      
## [19] "Hum_Diff"            "Hum_Abs_Diff"        "Press_Diff"         
## [22] "Press_Abs_Diff"      "Temp_Error_Category" "Temp_Error_Type"    
## [25] "Hum_Error_Type"      "Press_Error_Type"

3. Bland-Altman Analysis

Bland-Altman plots are the gold standard for assessing agreement between two measurement methods.

3.1 Temperature Bland-Altman Plot

# Calculate mean and difference for Bland-Altman
comparison_data <- comparison_data %>%
  mutate(
    Temp_Mean = (Temp_F + NOAA_Temp_F) / 2,
    Hum_Mean = (Hum + NOAA_Humidity) / 2,
    Press_Mean = (Press + NOAA_Pressure_hPa) / 2
  )

# Calculate bias and limits of agreement
temp_bias <- mean(comparison_data$Temp_Diff, na.rm = TRUE)
temp_sd <- sd(comparison_data$Temp_Diff, na.rm = TRUE)
temp_upper_loa <- temp_bias + 1.96 * temp_sd
temp_lower_loa <- temp_bias - 1.96 * temp_sd

# Create Bland-Altman plot
ba_temp <- ggplot(comparison_data, aes(x = Temp_Mean, y = Temp_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_hline(yintercept = temp_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
  geom_hline(yintercept = temp_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = temp_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_bias, 
           label = paste0("Bias = ", round(temp_bias, 2), "°F"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_upper_loa, 
           label = paste0("+1.96 SD = ", round(temp_upper_loa, 2), "°F"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_lower_loa, 
           label = paste0("-1.96 SD = ", round(temp_lower_loa, 2), "°F"), hjust = 0, vjust = 1.5) +
  labs(title = "Bland-Altman Plot: Temperature",
       subtitle = "Agreement between Sensor and NOAA measurements",
       x = "Mean Temperature [(Sensor + NOAA) / 2] (°F)",
       y = "Difference (Sensor - NOAA) (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Add marginal density plots
ggMarginal(ba_temp, type = "density", groupColour = TRUE, groupFill = TRUE)

Interpretation: - Blue line: Mean bias (systematic difference) - Red dashed lines: 95% limits of agreement (±1.96 SD) - Points within limits: Good agreement (95% of differences should fall within these limits) - Pattern detection: Look for trends or heteroscedasticity

3.2 Humidity Bland-Altman Plot

# Calculate bias and limits for humidity
hum_bias <- mean(comparison_data$Hum_Diff, na.rm = TRUE)
hum_sd <- sd(comparison_data$Hum_Diff, na.rm = TRUE)
hum_upper_loa <- hum_bias + 1.96 * hum_sd
hum_lower_loa <- hum_bias - 1.96 * hum_sd

ba_hum <- ggplot(comparison_data, aes(x = Hum_Mean, y = Hum_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_hline(yintercept = hum_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
  geom_hline(yintercept = hum_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = hum_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_bias, 
           label = paste0("Bias = ", round(hum_bias, 2), "%"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_upper_loa, 
           label = paste0("+1.96 SD = ", round(hum_upper_loa, 2), "%"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_lower_loa, 
           label = paste0("-1.96 SD = ", round(hum_lower_loa, 2), "%"), hjust = 0, vjust = 1.5) +
  labs(title = "Bland-Altman Plot: Humidity",
       x = "Mean Humidity [(Sensor + NOAA) / 2] (%)",
       y = "Difference (Sensor - NOAA) (%)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

ggMarginal(ba_hum, type = "density", groupColour = TRUE, groupFill = TRUE)

3.3 Pressure Bland-Altman Plot

Exercise 1

Exercise 1: Create a Bland-Altman plot for pressure following the same pattern as temperature and humidity above. Copy and paste the code from the example above into the code chunk below and modify it accordingly.

#put code for Excercise 1 below

3.4 Bland-Altman Summary Statistics

# Create summary table
ba_summary <- data.frame(
  Variable = c("Temperature (°F)", "Humidity (%)", "Pressure (hPa)"),
  Bias = c(temp_bias, hum_bias, mean(comparison_data$Press_Diff, na.rm = TRUE)),
  SD = c(temp_sd, hum_sd, sd(comparison_data$Press_Diff, na.rm = TRUE)),
  Lower_LOA = c(temp_lower_loa, hum_lower_loa, 
                mean(comparison_data$Press_Diff, na.rm = TRUE) - 
                  1.96 * sd(comparison_data$Press_Diff, na.rm = TRUE)),
  Upper_LOA = c(temp_upper_loa, hum_upper_loa, 
                mean(comparison_data$Press_Diff, na.rm = TRUE) + 
                  1.96 * sd(comparison_data$Press_Diff, na.rm = TRUE))
)

knitr::kable(ba_summary, digits = 2, 
             caption = "Bland-Altman Analysis: Bias and Limits of Agreement",
             col.names = c("Variable", "Bias", "SD", "Lower LOA", "Upper LOA"))
Bland-Altman Analysis: Bias and Limits of Agreement
Variable Bias SD Lower LOA Upper LOA
Temperature (°F) -0.08 9.19 -18.10 17.94
Humidity (%) 5.02 14.92 -24.23 34.27
Pressure (hPa) -6.22 7.15 -20.23 7.80

4. Paired T-Tests

Paired t-tests assess whether there is a statistically significant systematic difference between sensor and NOAA measurements.

4.1 Temperature Paired T-Test

# Perform paired t-test
temp_ttest <- t.test(comparison_data$Temp_F, comparison_data$NOAA_Temp_F, 
                     paired = TRUE, alternative = "two.sided")

# Calculate effect size (Cohen's d)
temp_cohens_d <- temp_bias / temp_sd

cat("Temperature Paired T-Test Results:\n")
## Temperature Paired T-Test Results:
cat("=====================================\n")
## =====================================
cat("Mean difference (Bias):", round(temp_bias, 3), "°F\n")
## Mean difference (Bias): -0.08 °F
cat("t-statistic:", round(temp_ttest$statistic, 3), "\n")
## t-statistic: -0.087
cat("Degrees of freedom:", temp_ttest$parameter, "\n")
## Degrees of freedom: 99
cat("p-value:", format.pval(temp_ttest$p.value, digits = 3), "\n")
## p-value: 0.931
cat("95% CI:", round(temp_ttest$conf.int[1], 3), "to", round(temp_ttest$conf.int[2], 3), "°F\n")
## 95% CI: -1.904 to 1.744 °F
cat("Cohen's d (effect size):", round(temp_cohens_d, 3), "\n")
## Cohen's d (effect size): -0.009
cat("\nInterpretation:\n")
## 
## Interpretation:
if (temp_ttest$p.value < 0.05) {
  cat("There IS a statistically significant difference between sensor and NOAA temperatures (p < 0.05).\n")
} else {
  cat("There is NO statistically significant difference between sensor and NOAA temperatures (p >= 0.05).\n")
}
## There is NO statistically significant difference between sensor and NOAA temperatures (p >= 0.05).

4.2 Humidity and Pressure Paired T-Tests

# Humidity t-test
hum_ttest <- t.test(comparison_data$Hum, comparison_data$NOAA_Humidity, 
                    paired = TRUE, alternative = "two.sided")

# Pressure t-test
press_ttest <- t.test(comparison_data$Press, comparison_data$NOAA_Pressure_hPa, 
                      paired = TRUE, alternative = "two.sided")

# Create summary table
ttest_summary <- data.frame(
  Variable = c("Temperature", "Humidity", "Pressure"),
  Mean_Diff = c(temp_bias, hum_bias, mean(comparison_data$Press_Diff, na.rm = TRUE)),
  t_statistic = c(temp_ttest$statistic, hum_ttest$statistic, press_ttest$statistic),
  df = c(temp_ttest$parameter, hum_ttest$parameter, press_ttest$parameter),
  p_value = c(temp_ttest$p.value, hum_ttest$p.value, press_ttest$p.value),
  CI_Lower = c(temp_ttest$conf.int[1], hum_ttest$conf.int[1], press_ttest$conf.int[1]),
  CI_Upper = c(temp_ttest$conf.int[2], hum_ttest$conf.int[2], press_ttest$conf.int[2]),
  Significant = c(temp_ttest$p.value < 0.05, hum_ttest$p.value < 0.05, press_ttest$p.value < 0.05)
)

knitr::kable(ttest_summary, digits = 4, 
             caption = "Paired T-Test Results Summary",
             col.names = c("Variable", "Mean Diff", "t", "df", "p-value", "CI Lower", "CI Upper", "Significant (α=0.05)"))
Paired T-Test Results Summary
Variable Mean Diff t df p-value CI Lower CI Upper Significant (α=0.05)
Temperature -0.080 -0.0870 99 0.9308 -1.9038 1.7438 FALSE
Humidity 5.020 3.3637 99 0.0011 2.0587 7.9813 TRUE
Pressure -6.216 -8.6941 99 0.0000 -7.6347 -4.7973 TRUE

4.3 Paired T-Test by Device

# Test each device separately
red_data <- comparison_data %>% filter(Device == "Red")
purple_data <- comparison_data %>% filter(Device == "Purple")

if (nrow(red_data) > 1) {
  red_temp_ttest <- t.test(red_data$Temp_F, red_data$NOAA_Temp_F, paired = TRUE)
  cat("Red Device Temperature T-Test:\n")
  cat("Mean difference:", round(mean(red_data$Temp_Diff, na.rm = TRUE), 3), "°F\n")
  cat("p-value:", format.pval(red_temp_ttest$p.value, digits = 3), "\n\n")
}
## Red Device Temperature T-Test:
## Mean difference: -0.829 °F
## p-value: 0.6
if (nrow(purple_data) > 1) {
  purple_temp_ttest <- t.test(purple_data$Temp_F, purple_data$NOAA_Temp_F, paired = TRUE)
  cat("Purple Device Temperature T-Test:\n")
  cat("Mean difference:", round(mean(purple_data$Temp_Diff, na.rm = TRUE), 3), "°F\n")
  cat("p-value:", format.pval(purple_temp_ttest$p.value, digits = 3), "\n")
}
## Purple Device Temperature T-Test:
## Mean difference: 0.836 °F
## p-value: 0.232

Exercise 2

Exercise 2:

Interpret the t-test results. Write your interpretation below each question part below.

2a) From 4.1 and 4.2, which weather variable(s) (temp, pressure, humidity) were significant?

2b) From 4.3, which device (if any) shows systematic bias for temperature?

2c) Modify the code from 4.3 do a paired t test for each device but for one of the other variables (pressure or humidity). Then interpret your results.

#put your modified 4.3 code here:

5. Collect Additional NOAA Variables

Let’s fetch additional meteorological variables that might explain measurement differences.

5.1 Function to Fetch Extended Weather Data

get_extended_weather_data <- function(lat, lon, datetime, debug = FALSE) {
  date <- as.Date(datetime)
  
  base_url <- "https://archive-api.open-meteo.com/v1/archive"
  
  url <- paste0(
    base_url,
    "?latitude=", lat,
    "&longitude=", lon,
    "&start_date=", date,
    "&end_date=", date,
    "&hourly=cloudcover,shortwave_radiation,windspeed_10m,winddirection_10m,precipitation",
    "&temperature_unit=fahrenheit"
  )
  
  default_result <- list(
    cloudcover = NA_real_,
    shortwave_radiation = NA_real_,
    windspeed_10m = NA_real_,
    winddirection_10m = NA_real_,
    precipitation = NA_real_
  )
  
  result <- tryCatch({
    response <- GET(url)
    
    if (status_code(response) == 200) {
      data <- content(response, "parsed")
      
      if (is.null(data$hourly)) {
        return(default_result)
      }
      
      hourly_data <- data$hourly
      target_hour <- hour(datetime)
      idx <- target_hour + 1
      
      # Extract values with NULL checks
      cloudcover_val <- if (!is.null(hourly_data$cloudcover) && idx <= length(hourly_data$cloudcover)) {
        val <- hourly_data$cloudcover[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      radiation_val <- if (!is.null(hourly_data$shortwave_radiation) && idx <= length(hourly_data$shortwave_radiation)) {
        val <- hourly_data$shortwave_radiation[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      windspeed_val <- if (!is.null(hourly_data$windspeed_10m) && idx <= length(hourly_data$windspeed_10m)) {
        val <- hourly_data$windspeed_10m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      winddir_val <- if (!is.null(hourly_data$winddirection_10m) && idx <= length(hourly_data$winddirection_10m)) {
        val <- hourly_data$winddirection_10m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      precip_val <- if (!is.null(hourly_data$precipitation) && idx <= length(hourly_data$precipitation)) {
        val <- hourly_data$precipitation[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      return(list(
        cloudcover = cloudcover_val,
        shortwave_radiation = radiation_val,
        windspeed_10m = windspeed_val,
        winddirection_10m = winddir_val,
        precipitation = precip_val
      ))
    } else {
      return(default_result)
    }
  }, error = function(e) {
    if (debug) cat("Error:", e$message, "\n")
    return(default_result)
  })
  
  return(result)
}

5.2 Fetch Extended Weather Variables

OpenMeteo has other NOAA data variables we can add to our dataset. This might be helpful to investigate if any of the differences in weather data are dependent on any of these environmental variables.

# Initialize columns
comparison_data$cloudcover <- NA
comparison_data$solar_radiation <- NA
comparison_data$windspeed <- NA
comparison_data$wind_direction <- NA
comparison_data$precipitation <- NA

cat("Fetching extended weather data...\n")
## Fetching extended weather data...
cat("NOTE: Due to API rate limits, we'll fetch with longer delays.\n\n")
## NOTE: Due to API rate limits, we'll fetch with longer delays.
success_count <- 0

for (i in 1:nrow(comparison_data)) {
  if (i %% 5 == 0) {
    cat("Processing point", i, "of", nrow(comparison_data), "\n")
  }
  
  ext_data <- get_extended_weather_data(
    comparison_data$Lat[i],
    comparison_data$Lon[i],
    comparison_data$TimeStampFull[i]
  )
  
  # Safely assign with length checks
  if (length(ext_data$cloudcover) > 0) {
    comparison_data$cloudcover[i] <- ext_data$cloudcover
  }
  if (length(ext_data$shortwave_radiation) > 0) {
    comparison_data$solar_radiation[i] <- ext_data$shortwave_radiation
  }
  if (length(ext_data$windspeed_10m) > 0) {
    comparison_data$windspeed[i] <- ext_data$windspeed_10m
  }
  if (length(ext_data$winddirection_10m) > 0) {
    comparison_data$wind_direction[i] <- ext_data$winddirection_10m
  }
  if (length(ext_data$precipitation) > 0) {
    comparison_data$precipitation[i] <- ext_data$precipitation
  }
  
  if (!is.na(comparison_data$cloudcover[i])) {
    success_count <- success_count + 1
  }
  
  Sys.sleep(0.5)  # delay to respect rate limits. If you download too fast or too much data your IP address will get cut off for the day!
}
## Processing point 5 of 100 
## Processing point 10 of 100 
## Processing point 15 of 100 
## Processing point 20 of 100 
## Processing point 25 of 100 
## Processing point 30 of 100 
## Processing point 35 of 100 
## Processing point 40 of 100 
## Processing point 45 of 100 
## Processing point 50 of 100 
## Processing point 55 of 100 
## Processing point 60 of 100 
## Processing point 65 of 100 
## Processing point 70 of 100 
## Processing point 75 of 100 
## Processing point 80 of 100 
## Processing point 85 of 100 
## Processing point 90 of 100 
## Processing point 95 of 100 
## Processing point 100 of 100
cat("\nExtended data fetch complete!\n")
## 
## Extended data fetch complete!
cat("Successfully retrieved:", success_count, "of", nrow(comparison_data), "points\n")
## Successfully retrieved: 100 of 100 points

5.3 Visualize Additional Variables

# Cloud cover vs temperature error
p1 <- ggplot(comparison_data %>% filter(!is.na(cloudcover)), 
       aes(x = cloudcover, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Cloud Cover",
       x = "Cloud Cover (%)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Solar radiation vs temperature error
p2 <- ggplot(comparison_data %>% filter(!is.na(solar_radiation)), 
       aes(x = solar_radiation, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Solar Radiation",
       x = "Solar Radiation (W/m²)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Wind speed vs temperature error
p3 <- ggplot(comparison_data %>% filter(!is.na(windspeed)), 
       aes(x = windspeed, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Wind Speed",
       x = "Wind Speed (km/h)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

print(p1)

print(p2)

print(p3)

5.4 Interactive Map of Environmental Variables

# Create color palette for cloud cover
cloud_pal <- colorNumeric(
  palette = "Blues",
  domain = comparison_data$cloudcover,
  na.color = "transparent"
)

# Create map
env_map <- leaflet(comparison_data %>% filter(!is.na(cloudcover))) %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addProviderTiles(providers$Esri.WorldTopoMap, group = "Topo") %>%
  
  # Cloud cover markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 6,
    color = ~cloud_pal(cloudcover),
    fillOpacity = 0.8,
    stroke = TRUE,
    weight = 1,
    group = "Cloud Cover",
    label = ~paste0("Cloud Cover: ", round(cloudcover, 0), "%"),
    popup = ~paste0(
      "<b>Environmental Conditions</b><br>",
      "Cloud Cover: ", round(cloudcover, 0), "%<br>",
      "Solar Radiation: ", round(solar_radiation, 0), " W/m²<br>",
      "Wind Speed: ", round(windspeed, 1), " km/h<br>",
      "Temperature Error: ", round(Temp_Diff, 1), "°F"
    )
  ) %>%
  
  # Solar radiation markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = ~scales::rescale(solar_radiation, to = c(3, 10)),
    color = "orange",
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 1,
    group = "Solar Radiation",
    label = ~paste0("Solar: ", round(solar_radiation, 0), " W/m²"),
    popup = ~paste0(
      "<b>Solar Radiation</b><br>",
      "Value: ", round(solar_radiation, 0), " W/m²<br>",
      "Cloud Cover: ", round(cloudcover, 0), "%<br>",
      "Temperature Error: ", round(Temp_Diff, 1), "°F"
    )
  ) %>%
  
  addLayersControl(
    baseGroups = c("Satellite", "Topo"),
    overlayGroups = c("Cloud Cover", "Solar Radiation"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  hideGroup("Solar Radiation") %>%
  
  addLegend(
    position = "bottomright",
    pal = cloud_pal,
    values = ~cloudcover,
    title = "Cloud Cover (%)",
    group = "Cloud Cover"
  )

env_map

6. Linear Regression Modeling

Build models to predict temperature error based on multiple factors.

6.1 Prepare Modeling Data

# Add time-based variables
model_data <- comparison_data %>%
  filter(!is.na(Temp_Diff)) %>%
  mutate(
    Hour = hour(TimeStampFull),
    Hour_sin = sin(2 * pi * Hour / 24),  # Circular encoding of hour
    Hour_cos = cos(2 * pi * Hour / 24),
    DayOfYear = yday(Date)
  )

cat("Modeling dataset:", nrow(model_data), "observations\n")
## Modeling dataset: 100 observations
cat("Complete cases (no missing values):", sum(complete.cases(model_data)), "\n")
## Complete cases (no missing values): 100

6.2 Model 1: Basic Model (Location and Time)

model1 <- lm(Temp_Diff ~ Lat  + alt_m + Hour_sin + Hour_cos + Device, 
             data = model_data)

summary(model1)
## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.934  -3.241   0.131   4.417  19.027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.523e+03  2.985e+02  -5.102 1.74e-06 ***
## Lat          3.571e+01  7.132e+00   5.007 2.58e-06 ***
## alt_m       -2.007e-03  4.519e-03  -0.444   0.6579    
## Hour_sin    -2.086e+00  1.272e+00  -1.639   0.1045    
## Hour_cos     2.505e+00  1.165e+00   2.150   0.0341 *  
## DeviceRed   -4.772e-01  1.634e+00  -0.292   0.7709    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.996 on 94 degrees of freedom
## Multiple R-squared:  0.2814, Adjusted R-squared:  0.2432 
## F-statistic: 7.363 on 5 and 94 DF,  p-value: 7.309e-06
# Tidy output
model1_tidy <- tidy(model1, conf.int = TRUE)
knitr::kable(model1_tidy, digits = 4, 
             caption = "Model 1: Location and Time Effects on Temperature Error")
Model 1: Location and Time Effects on Temperature Error
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -1523.0441 298.5085 -5.1022 0.0000 -2115.7397 -930.3485
Lat 35.7100 7.1320 5.0070 0.0000 21.5492 49.8708
alt_m -0.0020 0.0045 -0.4442 0.6579 -0.0110 0.0070
Hour_sin -2.0858 1.2723 -1.6394 0.1045 -4.6119 0.4403
Hour_cos 2.5052 1.1652 2.1501 0.0341 0.1917 4.8186
DeviceRed -0.4772 1.6343 -0.2920 0.7709 -3.7222 2.7678

6.2 Model 2: Extended Model (Adding Environmental Variables)

# Filter to complete cases for fair comparison
model_data_complete <- model_data %>%
  filter(!is.na(cloudcover) & !is.na(solar_radiation) & !is.na(windspeed))

if (nrow(model_data_complete) > 10) {
  model2 <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
                 cloudcover + solar_radiation + windspeed, 
               data = model_data_complete)
  
  summary(model2)
  
  model2_tidy <- tidy(model2, conf.int = TRUE)
  knitr::kable(model2_tidy, digits = 4, 
               caption = "Model 2: Location, Time, and Environmental Effects")
  
  # Compare models
  cat("\nModel Comparison:\n")
  cat("Model 1 R-squared:", round(summary(model1)$r.squared, 4), "\n")
  cat("Model 1 Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")
  cat("Model 2 R-squared:", round(summary(model2)$r.squared, 4), "\n")
  cat("Model 2 Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")
} else {
  cat("Insufficient complete cases for Model 2.\n")
  cat("Available complete cases:", nrow(model_data_complete), "\n")
}
## 
## Model Comparison:
## Model 1 R-squared: 0.2814 
## Model 1 Adjusted R-squared: 0.2432 
## Model 2 R-squared: 0.3561 
## Model 2 Adjusted R-squared: 0.2994

Exercise 3

Exercise 3: Add Interaction Term(s)

In Model 2 above, add in one or more interaction term to the model to see if the Device makes any difference on the variable coefficients. Try with the most significant variables first. Example: add a Device*Lat variable. Copy model2 code above and paste into code chunk below, then and add in your interaction term. Rename your model the name model3 and then also compute R^2 and adjusted R^2 for model3

6.3 Initial Model Diagnostics

Before proceeding, let’s check for problematic observations that might be affecting our model.

# Check for multicollinearity
vif_values <- vif(model1)
cat("Variance Inflation Factors (VIF):\n")
## Variance Inflation Factors (VIF):
print(round(vif_values, 2))
##      Lat    alt_m Hour_sin Hour_cos   Device 
##     1.43     1.40     1.02     1.02     1.03
cat("\nNote: VIF > 10 indicates problematic multicollinearity\n\n")
## 
## Note: VIF > 10 indicates problematic multicollinearity
# Normality test
shapiro_test <- shapiro.test(residuals(model1))
shapiro_test2 <- shapiro.test(residuals(model2))
cat("Shapiro-Wilk Normality Test:\n")
## Shapiro-Wilk Normality Test:
cat("W statistic:", round(shapiro_test$statistic, 4), "\n")
## W statistic: 0.9523
cat("p-value:", format.pval(shapiro_test$p.value, digits = 3), "\n")
## p-value: 0.00119
if (shapiro_test$p.value < 0.05) {
  cat("WARNING: Residuals are NOT normally distributed (p < 0.05)\n\n")
} else {
  cat("Residuals appear normally distributed (p >= 0.05)\n\n")
}
## WARNING: Residuals are NOT normally distributed (p < 0.05)
# Heteroscedasticity test
bp_test <- bptest(model1)
bp_test2 <- bptest(model2)
cat("Breusch-Pagan Heteroscedasticity Test:\n")
## Breusch-Pagan Heteroscedasticity Test:
cat("BP statistic:", round(bp_test$statistic, 4), "\n")
## BP statistic: 39.2025
cat("p-value:", format.pval(bp_test$p.value, digits = 3), "\n")
## p-value: 2.16e-07
if (bp_test$p.value < 0.05) {
  cat("WARNING: Heteroscedasticity detected (p < 0.05)\n\n")
} else {
  cat("Homoscedasticity assumption met (p >= 0.05)\n\n")
}
## WARNING: Heteroscedasticity detected (p < 0.05)

Interpretation: If we see violations of normality or homoscedasticity, this often indicates the presence of outliers or influential points. Let’s investigate!

6.4 Identify Influential Points (Cook’s Distance)

Cook’s Distance measures how much a single observation influences the entire regression model. We will calculate this for model1

# Calculate Cook's distance
model_data$cooks_d <- cooks.distance(model1)
model_data$residuals <- residuals(model1)
model_data$predicted <- predict(model1)

# Find influential points (Cook's D > 4/n is a common threshold)
threshold <- 4 / nrow(model_data)
influential_points <- model_data %>%
  filter(cooks_d > threshold) %>%
  arrange(desc(cooks_d))

# Create influence plot
ggplot(model_data, aes(x = seq_along(cooks_d), y = cooks_d, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = threshold, linetype = "dashed", color = "red", linewidth = 1.2) +
  annotate("text", x = nrow(model_data) * 0.7, y = threshold, 
           label = paste0("Threshold (4/n) = ", round(threshold, 5)), 
           vjust = -0.5, color = "red") +
  labs(title = "Cook's Distance - Influence Plot",
       subtitle = "Points above red line are potentially influential",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

cat("\nINFLUENTIAL POINTS ANALYSIS\n")
## 
## INFLUENTIAL POINTS ANALYSIS
cat("===========================\n")
## ===========================
cat("Threshold (4/n):", round(threshold, 5), "\n")
## Threshold (4/n): 0.04
cat("Number of influential points:", nrow(influential_points), "\n\n")
## Number of influential points: 8
if (nrow(influential_points) > 0) {
  cat("Top 10 most influential observations:\n")
  influential_summary <- influential_points %>%
    select(TimeStampFull, Device, Temp_F, NOAA_Temp_F, Temp_Diff, 
           Temp_Abs_Diff, alt_m, cooks_d) %>%
    head(10)
  
  print(knitr::kable(influential_summary, digits = 3,
                     caption = "Most Influential Points"))
  
  # Visualize the influential points
  cat("\n")
  cat("Characteristics of influential points:\n")
  cat("- Mean temperature difference:", round(mean(influential_points$Temp_Diff), 2), "°F\n")
  cat("- Mean absolute error:", round(mean(influential_points$Temp_Abs_Diff), 2), "°F\n")
  cat("- Compare to overall mean abs error:", round(mean(model_data$Temp_Abs_Diff), 2), "°F\n")
}
## Top 10 most influential observations:
## 
## 
## Table: Most Influential Points
## 
## |TimeStampFull       |Device | Temp_F| NOAA_Temp_F| Temp_Diff| Temp_Abs_Diff| alt_m| cooks_d|
## |:-------------------|:------|------:|-----------:|---------:|-------------:|-----:|-------:|
## |2025-09-11 21:19:44 |Red    |  34.70|        64.8|    -30.10|         30.10|  2871|   0.110|
## |2025-09-11 23:32:56 |Red    |  34.70|        63.4|    -28.70|         28.70|  2871|   0.101|
## |2025-09-11 23:52:20 |Red    |  34.70|        63.3|    -28.60|         28.60|  2871|   0.100|
## |2025-09-11 18:11:56 |Red    |  34.70|        64.0|    -29.30|         29.30|  2871|   0.089|
## |2025-09-11 10:36:21 |Red    |  44.06|        38.0|      6.06|          6.06|  2877|   0.070|
## |2025-09-11 17:16:44 |Red    |  34.70|        57.9|    -23.20|         23.20|  2871|   0.044|
## |2025-09-08 20:51:17 |Red    |  78.98|        62.0|     16.98|         16.98|  2758|   0.042|
## |2025-09-11 17:10:02 |Red    |  34.70|        57.6|    -22.90|         22.90|  2871|   0.042|
## 
## Characteristics of influential points:
## - Mean temperature difference: -17.47 °F
## - Mean absolute error: 23.23 °F
## - Compare to overall mean abs error: 6.31 °F

6.5 Visualize Influential Points

# Scatter plot highlighting influential points
model_data$is_influential <- model_data$cooks_d > threshold

ggplot(model_data, aes(x = Temp_Mean, y = Temp_Diff)) +
  geom_point(aes(color = is_influential, size = is_influential, alpha = is_influential)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  scale_color_manual(values = c("FALSE" = "gray60", "TRUE" = "red"),
                     labels = c("Normal", "Influential"),
                     name = "Point Type") +
  scale_size_manual(values = c("FALSE" = 2, "TRUE" = 4),
                    labels = c("Normal", "Influential"),
                    name = "Point Type") +
  scale_alpha_manual(values = c("FALSE" = 0.5, "TRUE" = 1),
                     labels = c("Normal", "Influential"),
                     name = "Point Type") +
  labs(title = "Temperature Errors with Influential Points Highlighted",
       subtitle = paste("Red points exceed Cook's D threshold of", round(threshold, 5)),
       x = "Mean Temperature (°F)",
       y = "Temperature Difference (Sensor - NOAA) (°F)") +
  theme_minimal()

# Box plot comparing influential vs non-influential
ggplot(model_data, aes(x = is_influential, y = Temp_Abs_Diff, fill = is_influential)) +
  geom_boxplot() +
  scale_fill_manual(values = c("FALSE" = "lightblue", "TRUE" = "red"),
                    labels = c("Normal Points", "Influential Points"),
                    name = "") +
  labs(title = "Absolute Temperature Error: Normal vs Influential Points",
       x = "Point Type",
       y = "Absolute Temperature Error (°F)") +
  scale_x_discrete(labels = c("FALSE" = "Normal", "TRUE" = "Influential")) +
  theme_minimal()

6.6 Remove Influential Points and Refit Model

Now let’s remove the influential points and see how it affects our models.

cat("DATA CLEANING DECISION\n")
## DATA CLEANING DECISION
cat("======================\n")
## ======================
cat("Original dataset:", nrow(model_data), "observations\n")
## Original dataset: 100 observations
cat("Influential points to remove:", sum(model_data$is_influential), "observations\n")
## Influential points to remove: 8 observations
cat("Cleaned dataset:", sum(!model_data$is_influential), "observations\n")
## Cleaned dataset: 92 observations
cat("Percentage removed:", round(sum(model_data$is_influential) / nrow(model_data) * 100, 2), "%\n\n")
## Percentage removed: 8 %
# Create cleaned dataset
model_data_clean <- model_data %>%
  filter(!is_influential)

# Refit the model with cleaned data
model1_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device, 
                   data = model_data_clean)

model2_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
                 cloudcover + solar_radiation + windspeed, 
               data = model_data_clean)

cat("MODEL COMPARISON: BEFORE vs AFTER REMOVING INFLUENTIAL POINTS\n")
## MODEL COMPARISON: BEFORE vs AFTER REMOVING INFLUENTIAL POINTS
cat("=============================================================\n\n")
## =============================================================
cat("ORIGINAL MODEL 1 (with influential points):\n")
## ORIGINAL MODEL 1 (with influential points):
cat("R-squared:", round(summary(model1)$r.squared, 4), "\n")
## R-squared: 0.2814
cat("Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.2432
cat("RMSE:", round(sqrt(mean(residuals(model1)^2)), 4), "°F\n")
## RMSE: 7.7523 °F
cat("Residual Std Error:", round(summary(model1)$sigma, 4), "°F\n\n")
## Residual Std Error: 7.9959 °F
cat("CLEANED MODEL 1 (influential points removed):\n")
## CLEANED MODEL 1 (influential points removed):
cat("R-squared:", round(summary(model1_clean)$r.squared, 4), "\n")
## R-squared: 0.3849
cat("Adjusted R-squared:", round(summary(model1_clean)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.3491
cat("RMSE:", round(sqrt(mean(residuals(model1_clean)^2)), 4), "°F\n")
## RMSE: 4.755 °F
cat("Residual Std Error:", round(summary(model1_clean)$sigma, 4), "°F\n\n")
## Residual Std Error: 4.9181 °F
cat("ORIGINAL MODEL 2 (with influential points):\n")
## ORIGINAL MODEL 2 (with influential points):
cat("R-squared:", round(summary(model2)$r.squared, 4), "\n")
## R-squared: 0.3561
cat("Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.2994
cat("RMSE:", round(sqrt(mean(residuals(model2)^2)), 4), "°F\n")
## RMSE: 7.3388 °F
cat("Residual Std Error:", round(summary(model2)$sigma, 4), "°F\n\n")
## Residual Std Error: 7.6931 °F
cat("CLEANED MODEL 2 (influential points removed):\n")
## CLEANED MODEL 2 (influential points removed):
cat("R-squared:", round(summary(model2_clean)$r.squared, 4), "\n")
## R-squared: 0.4463
cat("Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")
## Adjusted R-squared: 0.393
cat("RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 4), "°F\n")
## RMSE: 4.5113 °F
cat("Residual Std Error:", round(summary(model2_clean)$sigma, 4), "°F\n\n")
## Residual Std Error: 4.7496 °F
# Improvement metrics
r2_improvement <- (summary(model1_clean)$r.squared - summary(model1)$r.squared) * 100
rmse_improvement <- ((sqrt(mean(residuals(model1)^2)) - sqrt(mean(residuals(model1_clean)^2))) / 
                     sqrt(mean(residuals(model1)^2))) * 100

cat("IMPROVEMENTS MODEL 1:\n")
## IMPROVEMENTS MODEL 1:
cat("R-squared increase:", round(r2_improvement, 2), "percentage points\n")
## R-squared increase: 10.35 percentage points
cat("RMSE reduction:", round(rmse_improvement, 2), "%\n")
## RMSE reduction: 38.66 %
r2_improvement2 <- (summary(model2_clean)$r.squared - summary(model2)$r.squared) * 100
rmse_improvement2 <- ((sqrt(mean(residuals(model2)^2)) - sqrt(mean(residuals(model2_clean)^2))) / 
                     sqrt(mean(residuals(model2)^2))) * 100

cat("IMPROVEMENTS MODEL 2:\n")
## IMPROVEMENTS MODEL 2:
cat("R-squared increase:", round(r2_improvement2, 2), "percentage points\n")
## R-squared increase: 9.03 percentage points
cat("RMSE reduction:", round(rmse_improvement2, 2), "%\n")
## RMSE reduction: 38.53 %
# Show full model 1 summary
cat("\n")
summary(model1_clean)
## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device, data = model_data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.4648  -2.6905   0.0476   2.7094  14.3754 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.286e+02  2.148e+02  -1.995   0.0492 *  
## Lat          9.684e+00  5.122e+00   1.891   0.0621 .  
## alt_m        4.441e-03  2.840e-03   1.564   0.1216    
## Hour_sin    -3.692e+00  8.113e-01  -4.551 1.74e-05 ***
## Hour_cos     3.499e+00  7.415e-01   4.719 9.11e-06 ***
## DeviceRed    1.863e+00  1.045e+00   1.783   0.0781 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.918 on 86 degrees of freedom
## Multiple R-squared:  0.3849, Adjusted R-squared:  0.3491 
## F-statistic: 10.76 on 5 and 86 DF,  p-value: 4.692e-08
model1_clean_tidy <- tidy(model1_clean, conf.int = TRUE)
knitr::kable(model1_clean_tidy, digits = 4, 
             caption = "Cleaned Model 1: Location and Time Effects on Temperature Error (Influential Points Removed)")
Cleaned Model 1: Location and Time Effects on Temperature Error (Influential Points Removed)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -428.5773 214.8415 -1.9949 0.0492 -855.6681 -1.4864
Lat 9.6837 5.1222 1.8905 0.0621 -0.4989 19.8663
alt_m 0.0044 0.0028 1.5636 0.1216 -0.0012 0.0101
Hour_sin -3.6919 0.8113 -4.5507 0.0000 -5.3047 -2.0791
Hour_cos 3.4989 0.7415 4.7189 0.0000 2.0249 4.9728
DeviceRed 1.8635 1.0449 1.7833 0.0781 -0.2138 3.9407
# Show full model 2 summary
cat("\n")
summary(model2_clean)
## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device + cloudcover + solar_radiation + windspeed, data = model_data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8425  -2.2988   0.1258   2.7823  13.5543 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     -7.326e+02  2.656e+02  -2.758  0.00715 **
## Lat              1.687e+01  6.332e+00   2.664  0.00928 **
## alt_m            3.336e-03  3.102e-03   1.075  0.28531   
## Hour_sin        -3.848e+00  2.038e+00  -1.888  0.06248 . 
## Hour_cos         3.216e+00  1.188e+00   2.707  0.00823 **
## DeviceRed        2.973e-01  1.134e+00   0.262  0.79382   
## cloudcover       5.865e-02  2.045e-02   2.868  0.00524 **
## solar_radiation  1.068e-03  4.259e-03   0.251  0.80263   
## windspeed       -1.424e-01  1.274e-01  -1.118  0.26688   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.75 on 83 degrees of freedom
## Multiple R-squared:  0.4463, Adjusted R-squared:  0.393 
## F-statistic: 8.363 on 8 and 83 DF,  p-value: 2.949e-08
model2_clean_tidy <- tidy(model2_clean, conf.int = TRUE)
knitr::kable(model2_clean_tidy, digits = 4, 
             caption = "Cleaned Model 2: Location, Time, and Environmental Effects on Temperature Error (Influential Points Removed)")
Cleaned Model 2: Location, Time, and Environmental Effects on Temperature Error (Influential Points Removed)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -732.5966 265.6347 -2.7579 0.0072 -1260.9333 -204.2598
Lat 16.8665 6.3320 2.6637 0.0093 4.2723 29.4607
alt_m 0.0033 0.0031 1.0754 0.2853 -0.0028 0.0095
Hour_sin -3.8479 2.0377 -1.8883 0.0625 -7.9008 0.2051
Hour_cos 3.2156 1.1877 2.7074 0.0082 0.8533 5.5780
DeviceRed 0.2973 1.1340 0.2622 0.7938 -1.9581 2.5527
cloudcover 0.0587 0.0205 2.8677 0.0052 0.0180 0.0993
solar_radiation 0.0011 0.0043 0.2507 0.8026 -0.0074 0.0095
windspeed -0.1424 0.1274 -1.1178 0.2669 -0.3957 0.1110

6.7 Verify Diagnostic Assumptions on Cleaned Model

cat("DIAGNOSTIC TESTS ON CLEANED MODEL 1\n")
## DIAGNOSTIC TESTS ON CLEANED MODEL 1
cat("==================================\n\n")
## ==================================
# Normality test
shapiro_test_clean <- shapiro.test(residuals(model1_clean))
cat("Shapiro-Wilk Normality Test:\n")
## Shapiro-Wilk Normality Test:
cat("W statistic:", round(shapiro_test_clean$statistic, 4), "\n")
## W statistic: 0.955
cat("p-value:", format.pval(shapiro_test_clean$p.value, digits = 3), "\n")
## p-value: 0.00305
if (shapiro_test_clean$p.value < 0.05) {
  cat("Result: Residuals still show departure from normality\n\n")
} else {
  cat("Result: ✓ Residuals are normally distributed\n\n")
}
## Result: Residuals still show departure from normality
# Heteroscedasticity test
bp_test_clean <- bptest(model1_clean)
cat("Breusch-Pagan Heteroscedasticity Test:\n")
## Breusch-Pagan Heteroscedasticity Test:
cat("BP statistic:", round(bp_test_clean$statistic, 4), "\n")
## BP statistic: 8.5065
cat("p-value:", format.pval(bp_test_clean$p.value, digits = 3), "\n")
## p-value: 0.13
if (bp_test_clean$p.value < 0.05) {
  cat("Result: Some heteroscedasticity still present\n\n")
} else {
  cat("Result: ✓ Homoscedasticity assumption met\n\n")
}
## Result: ✓ Homoscedasticity assumption met
# Compare test statistics
cat("COMPARISON OF DIAGNOSTIC TESTS:\n")
## COMPARISON OF DIAGNOSTIC TESTS:
cat("                           Original Model    Cleaned Model\n")
##                            Original Model    Cleaned Model
cat("Shapiro-Wilk p-value:     ", format.pval(shapiro_test$p.value, digits = 4), "      ", 
    format.pval(shapiro_test_clean$p.value, digits = 4), "\n")
## Shapiro-Wilk p-value:      0.001186        0.003045
cat("Breusch-Pagan p-value:    ", format.pval(bp_test$p.value, digits = 4), "      ", 
    format.pval(bp_test_clean$p.value, digits = 4), "\n")
## Breusch-Pagan p-value:     2.162e-07        0.1304
# Diagnostic plots for cleaned model
par(mfrow = c(2, 2))
plot(model1_clean, which = 1:4)

par(mfrow = c(1, 1))

cat("DIAGNOSTIC TESTS ON CLEANED MODEL 2\n")
## DIAGNOSTIC TESTS ON CLEANED MODEL 2
cat("==================================\n\n")
## ==================================
# Normality test
shapiro_test_clean2 <- shapiro.test(residuals(model2_clean))
cat("Shapiro-Wilk Normality Test:\n")
## Shapiro-Wilk Normality Test:
cat("W statistic:", round(shapiro_test_clean2$statistic, 4), "\n")
## W statistic: 0.953
cat("p-value:", format.pval(shapiro_test_clean2$p.value, digits = 3), "\n")
## p-value: 0.00225
if (shapiro_test_clean2$p.value < 0.05) {
  cat("Result: Residuals still show departure from normality\n\n")
} else {
  cat("Result: ✓ Residuals are normally distributed\n\n")
}
## Result: Residuals still show departure from normality
# Heteroscedasticity test
bp_test_clean2 <- bptest(model2_clean)
cat("Breusch-Pagan Heteroscedasticity Test:\n")
## Breusch-Pagan Heteroscedasticity Test:
cat("BP statistic:", round(bp_test_clean2$statistic, 4), "\n")
## BP statistic: 12.1473
cat("p-value:", format.pval(bp_test_clean2$p.value, digits = 3), "\n")
## p-value: 0.145
if (bp_test_clean2$p.value < 0.05) {
  cat("Result: Some heteroscedasticity still present\n\n")
} else {
  cat("Result: ✓ Homoscedasticity assumption met\n\n")
}
## Result: ✓ Homoscedasticity assumption met
# Compare test statistics
cat("COMPARISON OF DIAGNOSTIC TESTS:\n")
## COMPARISON OF DIAGNOSTIC TESTS:
cat("                           Original Model    Cleaned Model\n")
##                            Original Model    Cleaned Model
cat("Shapiro-Wilk p-value:     ", format.pval(shapiro_test2$p.value, digits = 4), "      ", 
    format.pval(shapiro_test_clean2$p.value, digits = 4), "\n")
## Shapiro-Wilk p-value:      0.001141        0.002252
cat("Breusch-Pagan p-value:    ", format.pval(bp_test2$p.value, digits = 4), "      ", 
    format.pval(bp_test_clean2$p.value, digits = 4), "\n")
## Breusch-Pagan p-value:     7.469e-06        0.1448
# Diagnostic plots for cleaned model
par(mfrow = c(2, 2))
plot(model2_clean, which = 1:4)

par(mfrow = c(1, 1))

Key Learning Points: - Removing influential points can dramatically improve model fit and assumption compliance - However, consider WHY these points are influential - they may contain important information - Document any data removal decisions and justify them scientifically - The cleaned model provides more reliable coefficient estimates - There still are issues with the normality assumption (failed Shapiro-Wilk test) - There still seem to be additional points that are still influential. Consider addressing issues with them as well.

Exercise 4

Exercise 4: Investigate the influential points. Are they measurement errors, unusual environmental conditions, or legitimate extreme events? Should they be removed or studied separately? What do you think we should do with these influential points?

6.8 Model Predictions and Visualization

We will use model2_clean (with the influential points removed) with the dataset model_data_clean for subsequent analysis. If you think you like a different model better, replace every instance of “model2_clean” with “model1”, “model1_clean”, or “model2_clean” and model_data_clean with model_data accordingly.

# Add predictions and residuals for model 2 clean
model_data_clean$predicted <- predict(model2_clean, newdata = model_data_clean)
model_data_clean$residuals <- residuals(model2_clean, newdata = model_data_clean)

# Predicted vs Actual for model 1
ggplot(model_data_clean, aes(x = predicted, y = Temp_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
  labs(title = "Model 2 Clean: Predicted vs Actual Temperature Error",
       x = "Predicted Temperature Error (°F)",
       y = "Actual Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Residuals by device for model 1
ggplot(model_data_clean, aes(x = Device, y = residuals, fill = Device)) +
  geom_boxplot() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Model 2 Clean: Residuals by Device",
       x = "Device",
       y = "Residuals (°F)") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))

6.9 Model Summary Table

# Create comprehensive model summary
model_summary <- data.frame(
  Model = c("Model 1", if(exists("model2")) "Model 2" else NULL),
  Predictors = c("Location + Time + Device", 
                 if(exists("model2")) "Location + Time + Device + Environment" else NULL),
  N = c(nobs(model1), if(exists("model2")) nobs(model2) else NULL),
  R_squared = c(summary(model1)$r.squared, 
                if(exists("model2")) summary(model2)$r.squared else NULL),
  Adj_R_squared = c(summary(model1)$adj.r.squared, 
                    if(exists("model2")) summary(model2)$adj.r.squared else NULL),
  RMSE = c(sqrt(mean(residuals(model1)^2)), 
           if(exists("model2")) sqrt(mean(residuals(model2)^2)) else NULL),
  AIC = c(AIC(model1), if(exists("model2")) AIC(model2) else NULL),
  BIC = c(BIC(model1), if(exists("model2")) BIC(model2) else NULL)
)

knitr::kable(model_summary, digits = 4, 
             caption = "Model Performance Comparison for Full Data",
             col.names = c("Model", "Predictors", "N", "R²", "Adj R²", "RMSE", "AIC", "BIC"))
Model Performance Comparison for Full Data
Model Predictors N Adj R² RMSE AIC BIC
Model 1 Location + Time + Device 100 0.2814 0.2432 7.7523 707.3865 725.6227
Model 2 Location + Time + Device + Environment 100 0.3561 0.2994 7.3388 702.4219 728.4736
# Create comprehensive model summary for cleaned models
model_summary2 <- data.frame(
  Model = c("Model 1 Clean", if(exists("model2_clean")) "Model 2 Clean" else NULL),
  Predictors = c("Location + Time + Device", 
                 if(exists("model2_clean")) "Location + Time + Device + Environment" else NULL),
  N = c(nobs(model1_clean), if(exists("model2_clean")) nobs(model2_clean) else NULL),
  R_squared = c(summary(model1_clean)$r.squared, 
                if(exists("model2_clean")) summary(model2_clean)$r.squared else NULL),
  Adj_R_squared = c(summary(model1_clean)$adj.r.squared, 
                    if(exists("model2_clean")) summary(model2_clean)$adj.r.squared else NULL),
  RMSE = c(sqrt(mean(residuals(model1)^2)), 
           if(exists("model2_clean")) sqrt(mean(residuals(model2_clean)^2)) else NULL),
  AIC = c(AIC(model1_clean), if(exists("model2_clean")) AIC(model2_clean) else NULL),
  BIC = c(BIC(model1_clean), if(exists("model2_clean")) BIC(model2_clean) else NULL)
)

knitr::kable(model_summary2, digits = 4, 
             caption = "Model Performance Comparison for Cleaned Data",
             col.names = c("Model", "Predictors", "N", "R²", "Adj R²", "RMSE", "AIC", "BIC"))
Model Performance Comparison for Cleaned Data
Model Predictors N Adj R² RMSE AIC BIC
Model 1 Clean Location + Time + Device 92 0.3849 0.3491 7.7523 561.9764 579.6290
Model 2 Clean Location + Time + Device + Environment 92 0.4463 0.3930 4.5113 558.2972 583.5151

6.10 Coefficient Plot

# Create model 1 coefficient plot with confidence intervals
coef_plot_data <- model1_clean_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = factor(term, levels = rev(term)))

ggplot(coef_plot_data, aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
  geom_point(size = 3, color = "blue") +
  labs(title = "Model 1 (Clean) Coefficients with 95% Confidence Intervals",
       subtitle = "Effects on Temperature Error (°F)",
       x = "Coefficient Estimate",
       y = "Predictor Variable") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1))

# Create model 2 coefficient plot with confidence intervals
coef_plot_data2 <- model2_clean_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = factor(term, levels = rev(term)))

ggplot(coef_plot_data2, aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
  geom_point(size = 3, color = "blue") +
  labs(title = "Model 2 (Clean) Coefficients with 95% Confidence Intervals",
       subtitle = "Effects on Temperature Error (°F)",
       x = "Coefficient Estimate",
       y = "Predictor Variable") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1))

6.11 Interactive 3D Visualization

# 3D scatter plot: Lat, Lon, Altitude colored by temp error
plot_ly(model_data, 
        x = ~Lon, y = ~Lat, z = ~alt_m,
        color = ~Temp_Diff, 
        colors = colorRamp(c("blue", "white", "red")),
        marker = list(size = 5),
        text = ~paste0("Device: ", Device, 
                      "<br>Temp Error: ", round(Temp_Diff, 2), "°F",
                      "<br>Altitude: ", round(alt_m, 0), "m"),
        hoverinfo = "text") %>%
  add_markers() %>%
  layout(title = "Temperature Error in 3D Space",
         scene = list(
           xaxis = list(title = "Longitude"),
           yaxis = list(title = "Latitude"),
           zaxis = list(title = "Altitude (m)")
         ))

6.12 Key Findings and Interpretation

# Extract significant predictors (p < 0.05)
significant_predictors <- model2_clean_tidy %>%
  filter(p.value < 0.05, term != "(Intercept)")

cat("REGRESSION ANALYSIS SUMMARY FOR MODEL 2 CLEAN (with influential points removed)\n")
## REGRESSION ANALYSIS SUMMARY FOR MODEL 2 CLEAN (with influential points removed)
cat("======================================================\n\n")
## ======================================================
cat("Model Performance:\n")
## Model Performance:
cat("- R-squared:", round(summary(model2_clean)$r.squared, 4), 
    "(model explains", round(summary(model2_clean)$r.squared * 100, 1), "% of variance)\n")
## - R-squared: 0.4463 (model explains 44.6 % of variance)
cat("- Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")
## - Adjusted R-squared: 0.393
cat("- RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 3), "°F\n")
## - RMSE: 4.511 °F
cat("- F-statistic:", round(summary(model2_clean)$fstatistic[1], 2), 
    "on", summary(model2_clean)$fstatistic[2], "and", summary(model2_clean)$fstatistic[3], "DF\n")
## - F-statistic: 8.36 on 8 and 83 DF
cat("- Overall p-value:", format.pval(pf(summary(model2_clean)$fstatistic[1], 
    summary(model2_clean)$fstatistic[2], summary(model2_clean)$fstatistic[3], 
    lower.tail = FALSE), digits = 3), "\n\n")
## - Overall p-value: 2.95e-08
if (nrow(significant_predictors) > 0) {
  cat("Significant Predictors (p < 0.05):\n")
  for (i in 1:nrow(significant_predictors)) {
    cat("-", significant_predictors$term[i], ":", 
        round(significant_predictors$estimate[i], 4),
        "(p =", format.pval(significant_predictors$p.value[i], digits = 3), ")\n")
  }
} else {
  cat("No significant predictors at p < 0.05 level.\n")
}
## Significant Predictors (p < 0.05):
## - Lat : 16.8665 (p = 0.00928 )
## - Hour_cos : 3.2156 (p = 0.00823 )
## - cloudcover : 0.0587 (p = 0.00524 )
cat("\nInterpretation:\n")
## 
## Interpretation:
cat("- Device effect:", 
    if("DeviceRed" %in% significant_predictors$term) {
      "Red device shows significantly different error pattern"
    } else {
      "No significant difference between devices"
    }, "\n")
## - Device effect: No significant difference between devices
cat("- Spatial effects:", 
    if(any(c("Lat", "Lon", "alt_m") %in% significant_predictors$term)) {
      "Location/elevation significantly affects temperature error"
    } else {
      "Location/elevation does not significantly affect error"
    }, "\n")
## - Spatial effects: Location/elevation significantly affects temperature error
cat("- Temporal effects:", 
    if(any(c("Hour_sin", "Hour_cos") %in% significant_predictors$term)) {
      "Time of day significantly affects temperature error"
    } else {
      "Time of day does not significantly affect error"
    }, "\n")
## - Temporal effects: Time of day significantly affects temperature error
cat("- Environmental effects:", 
    if(any(c("cloudcover", "solar_radiation", "windspeed") %in% significant_predictors$term)) {
      "Environment significantly affects temperature error"
    } else {
      "Environment does not significantly affect error"
    }, "\n")
## - Environmental effects: Environment significantly affects temperature error

Exercise 5

Exercise 5: Based on the regression results, what are the most important factors influencing temperature measurement errors and how did you identify them?

7. Advanced Diagnostic Plots

7.1 Residuals vs Fitted by Device

# Enhanced residual plot
ggplot(model_data_clean, aes(x = predicted, y = residuals, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  geom_smooth(method = "loess", se = TRUE, linewidth = 1) +
  labs(title = "Residual Analysis by Device",
       x = "Fitted Values (°F)",
       y = "Residuals (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.2 Q-Q Plot by Device

ggplot(model_data_clean, aes(sample = residuals, color = Device)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~Device) +
  labs(title = "Normal Q-Q Plots by Device",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.3 Scale-Location Plot

model_data_clean$sqrt_abs_residuals <- sqrt(abs(model_data_clean$residuals))

ggplot(model_data_clean, aes(x = predicted, y = sqrt_abs_residuals, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  labs(title = "Scale-Location Plot",
       subtitle = "Check for homoscedasticity",
       x = "Fitted Values (°F)",
       y = expression(sqrt("|Standardized Residuals|"))) +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.4 Influence Plot

# Calculate Cook's distance
model_data_clean$cooks_d <- cooks.distance(model2_clean)

# Find influential points (Cook's D > 4/n)
threshold <- 4 / nrow(model_data_clean)
influential_points <- model_data_clean %>%
  filter(cooks_d > threshold) %>%
  arrange(desc(cooks_d))

ggplot(model_data_clean, aes(x = seq_along(cooks_d), y = cooks_d, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = threshold, linetype = "dashed", color = "red") +
  labs(title = "Cook's Distance - Influence Plot",
       subtitle = paste("Red line indicates threshold (4/n =", round(threshold, 4), ")"),
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

if (nrow(influential_points) > 0) {
  cat("\nInfluential observations (Cook's D >", round(threshold, 4), "):\n")
  cat("Number of influential points:", nrow(influential_points), "\n")
  cat("Top 5 most influential:\n")
  print(head(influential_points %>% 
               select(TimeStampFull, Device, Temp_Diff, predicted, cooks_d), 5))
}
## 
## Influential observations (Cook's D > 0.0435 ):
## Number of influential points: 6 
## Top 5 most influential:
## # A tibble: 5 × 5
##   TimeStampFull       Device Temp_Diff predicted cooks_d
##   <dttm>              <chr>      <dbl>     <dbl>   <dbl>
## 1 2025-09-11 16:30:02 Red       -20.6     -1.76   0.239 
## 2 2025-09-08 23:18:47 Red        11.9      0.681  0.0662
## 3 2025-09-09 15:25:35 Red        -3.18     2.93   0.0538
## 4 2024-09-12 16:52:36 Purple     -2.42     2.99   0.0481
## 5 2025-09-04 21:49:32 Red        21.4      7.81   0.0469

8. Spatial Regression Analysis

8.1 Error Hotspot Map

# Create error magnitude categories
model_data_clean <- model_data_clean %>%
  mutate(
    Error_Magnitude = cut(Temp_Abs_Diff, 
                          breaks = c(0, 1, 3, 5, Inf),
                          labels = c("Very Low (<1°F)", "Low (1-3°F)", 
                                   "Medium (3-5°F)", "High (>5°F)"))
  )

# Color palette
error_pal <- colorFactor(
  palette = c("#2ecc71", "#f1c40f", "#e67e22", "#e74c3c"),
  domain = model_data_clean$Error_Magnitude
)

hotspot_map <- leaflet(model_data_clean) %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addProviderTiles(providers$Esri.WorldTopoMap, group = "Topo") %>%
  addWMSTiles(
    baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
    layers = "0",
    options = WMSTileOptions(format = "image/png", transparent = TRUE),
    attribution = "USGS The National Map",
    group = "USGS Topo"  
  ) %>%
  
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = ~Temp_Abs_Diff * 2,  # Size proportional to error magnitude
    color = ~error_pal(Error_Magnitude),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 1,
    label = ~paste0("Error: ", round(Temp_Diff, 2), "°F | Magnitude: ", Error_Magnitude),
    popup = ~paste0(
      "<b>Error Hotspot Analysis</b><br>",
      "<b>Device:</b> ", Device, "<br>",
      "<b>Temperature Error:</b> ", round(Temp_Diff, 2), "°F<br>",
      "<b>Absolute Error:</b> ", round(Temp_Abs_Diff, 2), "°F<br>",
      "<b>Magnitude:</b> ", Error_Magnitude, "<br>",
      "<b>Model Prediction:</b> ", round(predicted, 2), "°F<br>",
      "<b>Residual:</b> ", round(residuals, 2), "°F<br>",
      "<hr>",
      "<b>Location:</b><br>",
      "Lat: ", round(Lat, 5), "<br>",
      "Lon: ", round(Lon, 5), "<br>",
      "Altitude: ", round(alt_m, 0), "m<br>",
      "<b>Time:</b> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  addLayersControl(
    baseGroups = c("Satellite", "Topo", "USGS Topo"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = error_pal,
    values = ~Error_Magnitude,
    title = "Error Magnitude",
    opacity = 1
  ) %>%
  
  addScaleBar(position = "bottomleft")

hotspot_map

8.2 Model Residuals Map

# Color palette for residuals
residual_pal <- colorNumeric(
  palette = "RdBu",
  domain = c(-max(abs(model_data_clean$residuals)), max(abs(model_data_clean$residuals))),
  reverse = TRUE
)

residuals_map <- leaflet(model_data_clean) %>%
  addProviderTiles("Esri.WorldTopoMap") %>%
  
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 6,
    color = ~residual_pal(residuals),
    fillOpacity = 0.8,
    stroke = TRUE,
    weight = 1,
    label = ~paste0("Residual: ", round(residuals, 2), "°F"),
    popup = ~paste0(
      "<b>Model Residual Analysis</b><br>",
      "<b>Observed:</b> ", round(Temp_Diff, 2), "°F<br>",
      "<b>Predicted:</b> ", round(predicted, 2), "°F<br>",
      "<b>Residual:</b> ", round(residuals, 2), "°F<br>",
      "<b>Device:</b> ", Device, "<br>",
      "<b>Time:</b> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = residual_pal,
    values = ~residuals,
    title = "Residuals (°F)<br>Red = Over-predicted<br>Blue = Under-predicted",
    opacity = 1
  )

residuals_map

9. Summary and Conclusions

9.1 Key Statistical Findings

cat("STATISTICAL ANALYSIS SUMMARY\n")
## STATISTICAL ANALYSIS SUMMARY
cat("============================\n\n")
## ============================
cat("1. BLAND-ALTMAN ANALYSIS\n")
## 1. BLAND-ALTMAN ANALYSIS
cat("   Temperature:\n")
##    Temperature:
cat("   - Mean Bias:", round(temp_bias, 3), "°F\n")
##    - Mean Bias: -0.08 °F
cat("   - 95% Limits of Agreement:", round(temp_lower_loa, 3), "to", round(temp_upper_loa, 3), "°F\n")
##    - 95% Limits of Agreement: -18.095 to 17.935 °F
cat("   - Within limits:", 
    round(sum(comparison_data$Temp_Diff >= temp_lower_loa & 
              comparison_data$Temp_Diff <= temp_upper_loa, na.rm = TRUE) / 
          sum(!is.na(comparison_data$Temp_Diff)) * 100, 1), "%\n\n")
##    - Within limits: 91 %
cat("2. PAIRED T-TESTS\n")
## 2. PAIRED T-TESTS
cat("   Temperature: t =", round(temp_ttest$statistic, 3), 
    ", p =", format.pval(temp_ttest$p.value, digits = 3), "\n")
##    Temperature: t = -0.087 , p = 0.931
cat("   Humidity: t =", round(hum_ttest$statistic, 3), 
    ", p =", format.pval(hum_ttest$p.value, digits = 3), "\n")
##    Humidity: t = 3.364 , p = 0.00109
cat("   Pressure: t =", round(press_ttest$statistic, 3), 
    ", p =", format.pval(press_ttest$p.value, digits = 3), "\n\n")
##    Pressure: t = -8.694 , p = 7.65e-14
cat("3. REGRESSION MODEL 2 CLEAN (with influential points removed)\n")
## 3. REGRESSION MODEL 2 CLEAN (with influential points removed)
cat("   R-squared:", round(summary(model2_clean)$r.squared, 4), "\n")
##    R-squared: 0.4463
cat("   Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")
##    Adjusted R-squared: 0.393
cat("   RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 3), "°F\n")
##    RMSE: 4.511 °F
cat("   Number of significant predictors:", nrow(significant_predictors), "\n")
##    Number of significant predictors: 3

9.2 Recommendations

Based on the statistical analysis:

  1. Sensor Accuracy:
    • Evaluate the magnitude of bias from Bland-Altman plots
    • Consider sensor calibration if systematic bias is detected
    • Assess if errors are within acceptable limits for your application
  2. Influential Factors:
    • Review regression coefficients to identify key error drivers
    • Consider environmental corrections for cloud cover
    • Account for time-of-day effects in data interpretation
  3. Model Improvement:
    • Deal with influential data points. Where there days/time when the sensors were obviously not working correctly or were there errors in the data set?
    • Consider transformations on the dependent variable to correct for non-normality of residuals
    • Consider non-linear relationships or interactions
    • Consider a mixed-effects model, where data points are grouped into time/space clusters
    • Validate model on independent test data

9.3 Further Analysis Suggestions

Exercise 6

Exercise 6: Design and implement one of the following extensions and put your code in the chunk below.

Easier Extensions:

  • Perform seperate regression models for each device

  • Remove some variables from model2_clean that are not significant. Try to get the Baysian Information Criterion (BIC) or Akaike’s Information Criterion (AIC) as low as possible.

Harder Extensions:

  • Deal with the issue of influential points. There are still influential points in the model made from the cleaned data. Are there errors in the data set?

  • Deal with the issue of non-normality (failed Shapiro-Wilk test). Ideas: inspect distribution, maybe do a transform on the dependent variable (logarithm?), or use robust regression.

  • Deal with the issue of heteroscedasticity (failed Breusch-Pagan test). Ideas: weighted least squares, or robust standard errors.

  • Try a mixed effect model (try function lmer from package lme4 instead of lm). Create data clusters based on time/space bins and a term “+ (1 | cluster_variable_name)” to your model. This might also help deal with the issues of non-normality and/or heteroscedasticity


#put your code for exercise 5 extension below

Publishing to RPubs:

  1. Knit this document to HTML

  2. Click “Publish” in the preview window

  3. Select RPubs and follow prompts

  4. Submit your statistical analysis by pasting the url in Canvas

Data Export:

# Export enhanced dataset with model predictions
# write.csv(model_data, "weather_modeling_results.csv", row.names = FALSE)

cat("\nAnalysis complete! Enhanced dataset includes:\n")
## 
## Analysis complete! Enhanced dataset includes:
cat("- NOAA reference data\n")
## - NOAA reference data
cat("- Extended environmental variables\n")
## - Extended environmental variables
cat("- Model predictions and residuals\n")
## - Model predictions and residuals
cat("- Statistical test results\n")
## - Statistical test results