Introduction

This tutorial builds on the descriptive statistical analysis of Part 1 by performing statistical modeling to understand the differences between sensor measurements and NOAA reference data. We’ll use Bland-Altman plots, paired t-tests, collect additional meteorological variables, and build linear regression models to identify factors that influence measurement errors.

1. Load Required Packages

# Install additional packages if needed
# install.packages(c("ggExtra", "broom", "car", "lmtest", "ggpubr"))

library(dplyr)
library(ggplot2)
library(plotly)
library(leaflet)
library(lubridate)
library(ggExtra)     # For marginal plots
library(broom)       # For tidy model output
library(car)         # For regression diagnostics
library(lmtest)      # For statistical tests
library(ggpubr)      # For publication-ready plots
library(httr)
library(jsonlite)
library(readr)

2. Load Data from Part 1

# Assuming you have the comparison_data from Part 1 saved as "CDT_Weather_Comparison.csv" in your working directory (part 9 from the first the activity).

# If not, you'll need to run the first activity again first to generate it. Make sure to (uncomment and then) run the code in part 9 that saves the data to the csv file. This code is:  write.csv(comparison_data, "CDT_Weather_Comparison.csv", row.names = FALSE)

#or, optionally you can just download this csv file from canvas and put it in your working directory.
 
comparison_data <- read_csv("CDT_Weather_Comparison.csv")


# For this tutorial, we'll assume comparison_data exists
cat("Data dimensions:", nrow(comparison_data), "rows x", ncol(comparison_data), "columns\n")

## Data dimensions: 100 rows x 26 columns

cat("Variables available:\n")

## Variables available:

print(names(comparison_data))

##  [1] "TimeStampFull"       "Date"                "Time"               
##  [4] "Lat"                 "Lon"                 "alt_m"              
##  [7] "Temp_F"              "Hum"                 "Press"              
## [10] "DP"                  "Device"              "Temp_C"             
## [13] "NOAA_Temp_F"         "NOAA_Humidity"       "NOAA_Pressure_hPa"  
## [16] "Temp_Diff"           "Temp_Pct_Diff"       "Temp_Abs_Diff"      
## [19] "Hum_Diff"            "Hum_Abs_Diff"        "Press_Diff"         
## [22] "Press_Abs_Diff"      "Temp_Error_Category" "Temp_Error_Type"    
## [25] "Hum_Error_Type"      "Press_Error_Type"

3. Bland-Altman Analysis

Bland-Altman plots are the gold standard for assessing agreement between two measurement methods.

3.1 Temperature Bland-Altman Plot

# Calculate mean and difference for Bland-Altman
comparison_data <- comparison_data %>%
  mutate(
    Temp_Mean = (Temp_F + NOAA_Temp_F) / 2,
    Hum_Mean = (Hum + NOAA_Humidity) / 2,
    Press_Mean = (Press + NOAA_Pressure_hPa) / 2
  )

# Calculate bias and limits of agreement
temp_bias <- mean(comparison_data$Temp_Diff, na.rm = TRUE)
temp_sd <- sd(comparison_data$Temp_Diff, na.rm = TRUE)
temp_upper_loa <- temp_bias + 1.96 * temp_sd
temp_lower_loa <- temp_bias - 1.96 * temp_sd

# Create Bland-Altman plot
ba_temp <- ggplot(comparison_data, aes(x = Temp_Mean, y = Temp_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_hline(yintercept = temp_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
  geom_hline(yintercept = temp_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = temp_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_bias, 
           label = paste0("Bias = ", round(temp_bias, 2), "°F"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_upper_loa, 
           label = paste0("+1.96 SD = ", round(temp_upper_loa, 2), "°F"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_lower_loa, 
           label = paste0("-1.96 SD = ", round(temp_lower_loa, 2), "°F"), hjust = 0, vjust = 1.5) +
  labs(title = "Bland-Altman Plot: Temperature",
       subtitle = "Agreement between Sensor and NOAA measurements",
       x = "Mean Temperature [(Sensor + NOAA) / 2] (°F)",
       y = "Difference (Sensor - NOAA) (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Add marginal density plots
ggMarginal(ba_temp, type = "density", groupColour = TRUE, groupFill = TRUE)

Interpretation: - Blue line: Mean bias (systematic difference) - Red dashed lines: 95% limits of agreement (±1.96 SD) - Points within limits: Good agreement (95% of differences should fall within these limits) - Pattern detection: Look for trends or heteroscedasticity

3.2 Humidity Bland-Altman Plot

# Calculate bias and limits for humidity
hum_bias <- mean(comparison_data$Hum_Diff, na.rm = TRUE)
hum_sd <- sd(comparison_data$Hum_Diff, na.rm = TRUE)
hum_upper_loa <- hum_bias + 1.96 * hum_sd
hum_lower_loa <- hum_bias - 1.96 * hum_sd

ba_hum <- ggplot(comparison_data, aes(x = Hum_Mean, y = Hum_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_hline(yintercept = hum_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
  geom_hline(yintercept = hum_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = hum_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_bias, 
           label = paste0("Bias = ", round(hum_bias, 2), "%"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_upper_loa, 
           label = paste0("+1.96 SD = ", round(hum_upper_loa, 2), "%"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_lower_loa, 
           label = paste0("-1.96 SD = ", round(hum_lower_loa, 2), "%"), hjust = 0, vjust = 1.5) +
  labs(title = "Bland-Altman Plot: Humidity",
       x = "Mean Humidity [(Sensor + NOAA) / 2] (%)",
       y = "Difference (Sensor - NOAA) (%)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

ggMarginal(ba_hum, type = "density", groupColour = TRUE, groupFill = TRUE)

3.3 Pressure Bland-Altman Plot

Exercise 1

Exercise 1: Create a Bland-Altman plot for pressure following the same pattern as temperature and humidity above. Copy and paste the code from the example above into the code chunk below and modify it accordingly.

#put code for Excercise 1 below

3.4 Bland-Altman Summary Statistics

# Create summary table
ba_summary <- data.frame(
  Variable = c("Temperature (°F)", "Humidity (%)", "Pressure (hPa)"),
  Bias = c(temp_bias, hum_bias, mean(comparison_data$Press_Diff, na.rm = TRUE)),
  SD = c(temp_sd, hum_sd, sd(comparison_data$Press_Diff, na.rm = TRUE)),
  Lower_LOA = c(temp_lower_loa, hum_lower_loa, 
                mean(comparison_data$Press_Diff, na.rm = TRUE) - 
                  1.96 * sd(comparison_data$Press_Diff, na.rm = TRUE)),
  Upper_LOA = c(temp_upper_loa, hum_upper_loa, 
                mean(comparison_data$Press_Diff, na.rm = TRUE) + 
                  1.96 * sd(comparison_data$Press_Diff, na.rm = TRUE))
)

knitr::kable(ba_summary, digits = 2, 
             caption = "Bland-Altman Analysis: Bias and Limits of Agreement",
             col.names = c("Variable", "Bias", "SD", "Lower LOA", "Upper LOA"))

Bland-Altman Analysis: Bias and Limits of Agreement
Variable	Bias	SD	Lower LOA	Upper LOA
Temperature (°F)	-0.08	9.19	-18.10	17.94
Humidity (%)	5.02	14.92	-24.23	34.27
Pressure (hPa)	-6.22	7.15	-20.23	7.80

4. Paired T-Tests

Paired t-tests assess whether there is a statistically significant systematic difference between sensor and NOAA measurements.

4.1 Temperature Paired T-Test

# Perform paired t-test
temp_ttest <- t.test(comparison_data$Temp_F, comparison_data$NOAA_Temp_F, 
                     paired = TRUE, alternative = "two.sided")

# Calculate effect size (Cohen's d)
temp_cohens_d <- temp_bias / temp_sd

cat("Temperature Paired T-Test Results:\n")

## Temperature Paired T-Test Results:

cat("=====================================\n")

## =====================================

cat("Mean difference (Bias):", round(temp_bias, 3), "°F\n")

## Mean difference (Bias): -0.08 °F

cat("t-statistic:", round(temp_ttest$statistic, 3), "\n")

## t-statistic: -0.087

cat("Degrees of freedom:", temp_ttest$parameter, "\n")

## Degrees of freedom: 99

cat("p-value:", format.pval(temp_ttest$p.value, digits = 3), "\n")

## p-value: 0.931

cat("95% CI:", round(temp_ttest$conf.int[1], 3), "to", round(temp_ttest$conf.int[2], 3), "°F\n")

## 95% CI: -1.904 to 1.744 °F

cat("Cohen's d (effect size):", round(temp_cohens_d, 3), "\n")

## Cohen's d (effect size): -0.009

cat("\nInterpretation:\n")

## 
## Interpretation:

if (temp_ttest$p.value < 0.05) {
  cat("There IS a statistically significant difference between sensor and NOAA temperatures (p < 0.05).\n")
} else {
  cat("There is NO statistically significant difference between sensor and NOAA temperatures (p >= 0.05).\n")
}

## There is NO statistically significant difference between sensor and NOAA temperatures (p >= 0.05).

4.2 Humidity and Pressure Paired T-Tests

# Humidity t-test
hum_ttest <- t.test(comparison_data$Hum, comparison_data$NOAA_Humidity, 
                    paired = TRUE, alternative = "two.sided")

# Pressure t-test
press_ttest <- t.test(comparison_data$Press, comparison_data$NOAA_Pressure_hPa, 
                      paired = TRUE, alternative = "two.sided")

# Create summary table
ttest_summary <- data.frame(
  Variable = c("Temperature", "Humidity", "Pressure"),
  Mean_Diff = c(temp_bias, hum_bias, mean(comparison_data$Press_Diff, na.rm = TRUE)),
  t_statistic = c(temp_ttest$statistic, hum_ttest$statistic, press_ttest$statistic),
  df = c(temp_ttest$parameter, hum_ttest$parameter, press_ttest$parameter),
  p_value = c(temp_ttest$p.value, hum_ttest$p.value, press_ttest$p.value),
  CI_Lower = c(temp_ttest$conf.int[1], hum_ttest$conf.int[1], press_ttest$conf.int[1]),
  CI_Upper = c(temp_ttest$conf.int[2], hum_ttest$conf.int[2], press_ttest$conf.int[2]),
  Significant = c(temp_ttest$p.value < 0.05, hum_ttest$p.value < 0.05, press_ttest$p.value < 0.05)
)

knitr::kable(ttest_summary, digits = 4, 
             caption = "Paired T-Test Results Summary",
             col.names = c("Variable", "Mean Diff", "t", "df", "p-value", "CI Lower", "CI Upper", "Significant (α=0.05)"))

Paired T-Test Results Summary
Variable	Mean Diff	t	df	p-value	CI Lower	CI Upper	Significant (α=0.05)
Temperature	-0.080	-0.0870	99	0.9308	-1.9038	1.7438	FALSE
Humidity	5.020	3.3637	99	0.0011	2.0587	7.9813	TRUE
Pressure	-6.216	-8.6941	99	0.0000	-7.6347	-4.7973	TRUE

4.3 Paired T-Test by Device

# Test each device separately
red_data <- comparison_data %>% filter(Device == "Red")
purple_data <- comparison_data %>% filter(Device == "Purple")

if (nrow(red_data) > 1) {
  red_temp_ttest <- t.test(red_data$Temp_F, red_data$NOAA_Temp_F, paired = TRUE)
  cat("Red Device Temperature T-Test:\n")
  cat("Mean difference:", round(mean(red_data$Temp_Diff, na.rm = TRUE), 3), "°F\n")
  cat("p-value:", format.pval(red_temp_ttest$p.value, digits = 3), "\n\n")
}

## Red Device Temperature T-Test:
## Mean difference: -0.829 °F
## p-value: 0.6

if (nrow(purple_data) > 1) {
  purple_temp_ttest <- t.test(purple_data$Temp_F, purple_data$NOAA_Temp_F, paired = TRUE)
  cat("Purple Device Temperature T-Test:\n")
  cat("Mean difference:", round(mean(purple_data$Temp_Diff, na.rm = TRUE), 3), "°F\n")
  cat("p-value:", format.pval(purple_temp_ttest$p.value, digits = 3), "\n")
}

## Purple Device Temperature T-Test:
## Mean difference: 0.836 °F
## p-value: 0.232

Exercise 2

Exercise 2:

Interpret the t-test results. Write your interpretation below each question part below.

2a) From 4.1 and 4.2, which weather variable(s) (temp, pressure, humidity) were significant?

2b) From 4.3, which device (if any) shows systematic bias for temperature?

2c) Modify the code from 4.3 do a paired t test for each device but for one of the other variables (pressure or humidity). Then interpret your results.

#put your modified 4.3 code here:

5. Collect Additional NOAA Variables

Let’s fetch additional meteorological variables that might explain measurement differences.

5.1 Function to Fetch Extended Weather Data

get_extended_weather_data <- function(lat, lon, datetime, debug = FALSE) {
  date <- as.Date(datetime)
  
  base_url <- "https://archive-api.open-meteo.com/v1/archive"
  
  url <- paste0(
    base_url,
    "?latitude=", lat,
    "&longitude=", lon,
    "&start_date=", date,
    "&end_date=", date,
    "&hourly=cloudcover,shortwave_radiation,windspeed_10m,winddirection_10m,precipitation",
    "&temperature_unit=fahrenheit"
  )
  
  default_result <- list(
    cloudcover = NA_real_,
    shortwave_radiation = NA_real_,
    windspeed_10m = NA_real_,
    winddirection_10m = NA_real_,
    precipitation = NA_real_
  )
  
  result <- tryCatch({
    response <- GET(url)
    
    if (status_code(response) == 200) {
      data <- content(response, "parsed")
      
      if (is.null(data$hourly)) {
        return(default_result)
      }
      
      hourly_data <- data$hourly
      target_hour <- hour(datetime)
      idx <- target_hour + 1
      
      # Extract values with NULL checks
      cloudcover_val <- if (!is.null(hourly_data$cloudcover) && idx <= length(hourly_data$cloudcover)) {
        val <- hourly_data$cloudcover[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      radiation_val <- if (!is.null(hourly_data$shortwave_radiation) && idx <= length(hourly_data$shortwave_radiation)) {
        val <- hourly_data$shortwave_radiation[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      windspeed_val <- if (!is.null(hourly_data$windspeed_10m) && idx <= length(hourly_data$windspeed_10m)) {
        val <- hourly_data$windspeed_10m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      winddir_val <- if (!is.null(hourly_data$winddirection_10m) && idx <= length(hourly_data$winddirection_10m)) {
        val <- hourly_data$winddirection_10m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      precip_val <- if (!is.null(hourly_data$precipitation) && idx <= length(hourly_data$precipitation)) {
        val <- hourly_data$precipitation[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      return(list(
        cloudcover = cloudcover_val,
        shortwave_radiation = radiation_val,
        windspeed_10m = windspeed_val,
        winddirection_10m = winddir_val,
        precipitation = precip_val
      ))
    } else {
      return(default_result)
    }
  }, error = function(e) {
    if (debug) cat("Error:", e$message, "\n")
    return(default_result)
  })
  
  return(result)
}

5.2 Fetch Extended Weather Variables

OpenMeteo has other NOAA data variables we can add to our dataset. This might be helpful to investigate if any of the differences in weather data are dependent on any of these environmental variables.

# Initialize columns
comparison_data$cloudcover <- NA
comparison_data$solar_radiation <- NA
comparison_data$windspeed <- NA
comparison_data$wind_direction <- NA
comparison_data$precipitation <- NA

cat("Fetching extended weather data...\n")

## Fetching extended weather data...

cat("NOTE: Due to API rate limits, we'll fetch with longer delays.\n\n")

## NOTE: Due to API rate limits, we'll fetch with longer delays.

success_count <- 0

for (i in 1:nrow(comparison_data)) {
  if (i %% 5 == 0) {
    cat("Processing point", i, "of", nrow(comparison_data), "\n")
  }
  
  ext_data <- get_extended_weather_data(
    comparison_data$Lat[i],
    comparison_data$Lon[i],
    comparison_data$TimeStampFull[i]
  )
  
  # Safely assign with length checks
  if (length(ext_data$cloudcover) > 0) {
    comparison_data$cloudcover[i] <- ext_data$cloudcover
  }
  if (length(ext_data$shortwave_radiation) > 0) {
    comparison_data$solar_radiation[i] <- ext_data$shortwave_radiation
  }
  if (length(ext_data$windspeed_10m) > 0) {
    comparison_data$windspeed[i] <- ext_data$windspeed_10m
  }
  if (length(ext_data$winddirection_10m) > 0) {
    comparison_data$wind_direction[i] <- ext_data$winddirection_10m
  }
  if (length(ext_data$precipitation) > 0) {
    comparison_data$precipitation[i] <- ext_data$precipitation
  }
  
  if (!is.na(comparison_data$cloudcover[i])) {
    success_count <- success_count + 1
  }
  
  Sys.sleep(0.5)  # delay to respect rate limits. If you download too fast or too much data your IP address will get cut off for the day!
}

## Processing point 5 of 100 
## Processing point 10 of 100 
## Processing point 15 of 100 
## Processing point 20 of 100 
## Processing point 25 of 100 
## Processing point 30 of 100 
## Processing point 35 of 100 
## Processing point 40 of 100 
## Processing point 45 of 100 
## Processing point 50 of 100 
## Processing point 55 of 100 
## Processing point 60 of 100 
## Processing point 65 of 100 
## Processing point 70 of 100 
## Processing point 75 of 100 
## Processing point 80 of 100 
## Processing point 85 of 100 
## Processing point 90 of 100 
## Processing point 95 of 100 
## Processing point 100 of 100

cat("\nExtended data fetch complete!\n")

## 
## Extended data fetch complete!

cat("Successfully retrieved:", success_count, "of", nrow(comparison_data), "points\n")

## Successfully retrieved: 100 of 100 points

5.3 Visualize Additional Variables

# Cloud cover vs temperature error
p1 <- ggplot(comparison_data %>% filter(!is.na(cloudcover)), 
       aes(x = cloudcover, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Cloud Cover",
       x = "Cloud Cover (%)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Solar radiation vs temperature error
p2 <- ggplot(comparison_data %>% filter(!is.na(solar_radiation)), 
       aes(x = solar_radiation, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Solar Radiation",
       x = "Solar Radiation (W/m²)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Wind speed vs temperature error
p3 <- ggplot(comparison_data %>% filter(!is.na(windspeed)), 
       aes(x = windspeed, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Wind Speed",
       x = "Wind Speed (km/h)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

print(p1)

print(p2)

print(p3)

5.4 Interactive Map of Environmental Variables

# Create color palette for cloud cover
cloud_pal <- colorNumeric(
  palette = "Blues",
  domain = comparison_data$cloudcover,
  na.color = "transparent"
)

# Create map
env_map <- leaflet(comparison_data %>% filter(!is.na(cloudcover))) %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addProviderTiles(providers$Esri.WorldTopoMap, group = "Topo") %>%
  
  # Cloud cover markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 6,
    color = ~cloud_pal(cloudcover),
    fillOpacity = 0.8,
    stroke = TRUE,
    weight = 1,
    group = "Cloud Cover",
    label = ~paste0("Cloud Cover: ", round(cloudcover, 0), "%"),
    popup = ~paste0(
      "<b>Environmental Conditions</b><br>",
      "Cloud Cover: ", round(cloudcover, 0), "%<br>",
      "Solar Radiation: ", round(solar_radiation, 0), " W/m²<br>",
      "Wind Speed: ", round(windspeed, 1), " km/h<br>",
      "Temperature Error: ", round(Temp_Diff, 1), "°F"
    )
  ) %>%
  
  # Solar radiation markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = ~scales::rescale(solar_radiation, to = c(3, 10)),
    color = "orange",
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 1,
    group = "Solar Radiation",
    label = ~paste0("Solar: ", round(solar_radiation, 0), " W/m²"),
    popup = ~paste0(
      "<b>Solar Radiation</b><br>",
      "Value: ", round(solar_radiation, 0), " W/m²<br>",
      "Cloud Cover: ", round(cloudcover, 0), "%<br>",
      "Temperature Error: ", round(Temp_Diff, 1), "°F"
    )
  ) %>%
  
  addLayersControl(
    baseGroups = c("Satellite", "Topo"),
    overlayGroups = c("Cloud Cover", "Solar Radiation"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  hideGroup("Solar Radiation") %>%
  
  addLegend(
    position = "bottomright",
    pal = cloud_pal,
    values = ~cloudcover,
    title = "Cloud Cover (%)",
    group = "Cloud Cover"
  )

env_map

6. Linear Regression Modeling

Build models to predict temperature error based on multiple factors.

6.1 Prepare Modeling Data

# Add time-based variables
model_data <- comparison_data %>%
  filter(!is.na(Temp_Diff)) %>%
  mutate(
    Hour = hour(TimeStampFull),
    Hour_sin = sin(2 * pi * Hour / 24),  # Circular encoding of hour
    Hour_cos = cos(2 * pi * Hour / 24),
    DayOfYear = yday(Date)
  )

cat("Modeling dataset:", nrow(model_data), "observations\n")

## Modeling dataset: 100 observations

cat("Complete cases (no missing values):", sum(complete.cases(model_data)), "\n")

## Complete cases (no missing values): 100

6.2 Model 1: Basic Model (Location and Time)

model1 <- lm(Temp_Diff ~ Lat  + alt_m + Hour_sin + Hour_cos + Device, 
             data = model_data)

summary(model1)

## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.934  -3.241   0.131   4.417  19.027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.523e+03  2.985e+02  -5.102 1.74e-06 ***
## Lat          3.571e+01  7.132e+00   5.007 2.58e-06 ***
## alt_m       -2.007e-03  4.519e-03  -0.444   0.6579    
## Hour_sin    -2.086e+00  1.272e+00  -1.639   0.1045    
## Hour_cos     2.505e+00  1.165e+00   2.150   0.0341 *  
## DeviceRed   -4.772e-01  1.634e+00  -0.292   0.7709    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.996 on 94 degrees of freedom
## Multiple R-squared:  0.2814, Adjusted R-squared:  0.2432 
## F-statistic: 7.363 on 5 and 94 DF,  p-value: 7.309e-06

# Tidy output
model1_tidy <- tidy(model1, conf.int = TRUE)
knitr::kable(model1_tidy, digits = 4, 
             caption = "Model 1: Location and Time Effects on Temperature Error")

Model 1: Location and Time Effects on Temperature Error
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-1523.0441	298.5085	-5.1022	0.0000	-2115.7397	-930.3485
Lat	35.7100	7.1320	5.0070	0.0000	21.5492	49.8708
alt_m	-0.0020	0.0045	-0.4442	0.6579	-0.0110	0.0070
Hour_sin	-2.0858	1.2723	-1.6394	0.1045	-4.6119	0.4403
Hour_cos	2.5052	1.1652	2.1501	0.0341	0.1917	4.8186
DeviceRed	-0.4772	1.6343	-0.2920	0.7709	-3.7222	2.7678

6.2 Model 2: Extended Model (Adding Environmental Variables)

# Filter to complete cases for fair comparison
model_data_complete <- model_data %>%
  filter(!is.na(cloudcover) & !is.na(solar_radiation) & !is.na(windspeed))

if (nrow(model_data_complete) > 10) {
  model2 <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
                 cloudcover + solar_radiation + windspeed, 
               data = model_data_complete)
  
  summary(model2)
  
  model2_tidy <- tidy(model2, conf.int = TRUE)
  knitr::kable(model2_tidy, digits = 4, 
               caption = "Model 2: Location, Time, and Environmental Effects")
  
  # Compare models
  cat("\nModel Comparison:\n")
  cat("Model 1 R-squared:", round(summary(model1)$r.squared, 4), "\n")
  cat("Model 1 Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")
  cat("Model 2 R-squared:", round(summary(model2)$r.squared, 4), "\n")
  cat("Model 2 Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")
} else {
  cat("Insufficient complete cases for Model 2.\n")
  cat("Available complete cases:", nrow(model_data_complete), "\n")
}

## 
## Model Comparison:
## Model 1 R-squared: 0.2814 
## Model 1 Adjusted R-squared: 0.2432 
## Model 2 R-squared: 0.3561 
## Model 2 Adjusted R-squared: 0.2994

Exercise 3

Exercise 3: Add Interaction Term(s)

In Model 2 above, add in one or more interaction term to the model to see if the Device makes any difference on the variable coefficients. Try with the most significant variables first. Example: add a Device*Lat variable. Copy model2 code above and paste into code chunk below, then and add in your interaction term. Rename your model the name model3 and then also compute R^2 and adjusted R^2 for model3

6.3 Initial Model Diagnostics

Before proceeding, let’s check for problematic observations that might be affecting our model.

# Check for multicollinearity
vif_values <- vif(model1)
cat("Variance Inflation Factors (VIF):\n")

## Variance Inflation Factors (VIF):

print(round(vif_values, 2))

##      Lat    alt_m Hour_sin Hour_cos   Device 
##     1.43     1.40     1.02     1.02     1.03

cat("\nNote: VIF > 10 indicates problematic multicollinearity\n\n")

## 
## Note: VIF > 10 indicates problematic multicollinearity

# Normality test
shapiro_test <- shapiro.test(residuals(model1))
shapiro_test2 <- shapiro.test(residuals(model2))
cat("Shapiro-Wilk Normality Test:\n")

## Shapiro-Wilk Normality Test:

cat("W statistic:", round(shapiro_test$statistic, 4), "\n")

## W statistic: 0.9523

cat("p-value:", format.pval(shapiro_test$p.value, digits = 3), "\n")

## p-value: 0.00119

if (shapiro_test$p.value < 0.05) {
  cat("WARNING: Residuals are NOT normally distributed (p < 0.05)\n\n")
} else {
  cat("Residuals appear normally distributed (p >= 0.05)\n\n")
}

## WARNING: Residuals are NOT normally distributed (p < 0.05)

# Heteroscedasticity test
bp_test <- bptest(model1)
bp_test2 <- bptest(model2)
cat("Breusch-Pagan Heteroscedasticity Test:\n")

## Breusch-Pagan Heteroscedasticity Test:

cat("BP statistic:", round(bp_test$statistic, 4), "\n")

## BP statistic: 39.2025

cat("p-value:", format.pval(bp_test$p.value, digits = 3), "\n")

## p-value: 2.16e-07

if (bp_test$p.value < 0.05) {
  cat("WARNING: Heteroscedasticity detected (p < 0.05)\n\n")
} else {
  cat("Homoscedasticity assumption met (p >= 0.05)\n\n")
}

## WARNING: Heteroscedasticity detected (p < 0.05)

Interpretation: If we see violations of normality or homoscedasticity, this often indicates the presence of outliers or influential points. Let’s investigate!

6.4 Identify Influential Points (Cook’s Distance)

Cook’s Distance measures how much a single observation influences the entire regression model. We will calculate this for model1

# Calculate Cook's distance
model_data$cooks_d <- cooks.distance(model1)
model_data$residuals <- residuals(model1)
model_data$predicted <- predict(model1)

# Find influential points (Cook's D > 4/n is a common threshold)
threshold <- 4 / nrow(model_data)
influential_points <- model_data %>%
  filter(cooks_d > threshold) %>%
  arrange(desc(cooks_d))

# Create influence plot
ggplot(model_data, aes(x = seq_along(cooks_d), y = cooks_d, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = threshold, linetype = "dashed", color = "red", linewidth = 1.2) +
  annotate("text", x = nrow(model_data) * 0.7, y = threshold, 
           label = paste0("Threshold (4/n) = ", round(threshold, 5)), 
           vjust = -0.5, color = "red") +
  labs(title = "Cook's Distance - Influence Plot",
       subtitle = "Points above red line are potentially influential",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

cat("\nINFLUENTIAL POINTS ANALYSIS\n")

## 
## INFLUENTIAL POINTS ANALYSIS

cat("===========================\n")

## ===========================

cat("Threshold (4/n):", round(threshold, 5), "\n")

## Threshold (4/n): 0.04

cat("Number of influential points:", nrow(influential_points), "\n\n")

## Number of influential points: 8

if (nrow(influential_points) > 0) {
  cat("Top 10 most influential observations:\n")
  influential_summary <- influential_points %>%
    select(TimeStampFull, Device, Temp_F, NOAA_Temp_F, Temp_Diff, 
           Temp_Abs_Diff, alt_m, cooks_d) %>%
    head(10)
  
  print(knitr::kable(influential_summary, digits = 3,
                     caption = "Most Influential Points"))
  
  # Visualize the influential points
  cat("\n")
  cat("Characteristics of influential points:\n")
  cat("- Mean temperature difference:", round(mean(influential_points$Temp_Diff), 2), "°F\n")
  cat("- Mean absolute error:", round(mean(influential_points$Temp_Abs_Diff), 2), "°F\n")
  cat("- Compare to overall mean abs error:", round(mean(model_data$Temp_Abs_Diff), 2), "°F\n")
}

## Top 10 most influential observations:
## 
## 
## Table: Most Influential Points
## 
## |TimeStampFull       |Device | Temp_F| NOAA_Temp_F| Temp_Diff| Temp_Abs_Diff| alt_m| cooks_d|
## |:-------------------|:------|------:|-----------:|---------:|-------------:|-----:|-------:|
## |2025-09-11 21:19:44 |Red    |  34.70|        64.8|    -30.10|         30.10|  2871|   0.110|
## |2025-09-11 23:32:56 |Red    |  34.70|        63.4|    -28.70|         28.70|  2871|   0.101|
## |2025-09-11 23:52:20 |Red    |  34.70|        63.3|    -28.60|         28.60|  2871|   0.100|
## |2025-09-11 18:11:56 |Red    |  34.70|        64.0|    -29.30|         29.30|  2871|   0.089|
## |2025-09-11 10:36:21 |Red    |  44.06|        38.0|      6.06|          6.06|  2877|   0.070|
## |2025-09-11 17:16:44 |Red    |  34.70|        57.9|    -23.20|         23.20|  2871|   0.044|
## |2025-09-08 20:51:17 |Red    |  78.98|        62.0|     16.98|         16.98|  2758|   0.042|
## |2025-09-11 17:10:02 |Red    |  34.70|        57.6|    -22.90|         22.90|  2871|   0.042|
## 
## Characteristics of influential points:
## - Mean temperature difference: -17.47 °F
## - Mean absolute error: 23.23 °F
## - Compare to overall mean abs error: 6.31 °F

6.5 Visualize Influential Points

# Scatter plot highlighting influential points
model_data$is_influential <- model_data$cooks_d > threshold

ggplot(model_data, aes(x = Temp_Mean, y = Temp_Diff)) +
  geom_point(aes(color = is_influential, size = is_influential, alpha = is_influential)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  scale_color_manual(values = c("FALSE" = "gray60", "TRUE" = "red"),
                     labels = c("Normal", "Influential"),
                     name = "Point Type") +
  scale_size_manual(values = c("FALSE" = 2, "TRUE" = 4),
                    labels = c("Normal", "Influential"),
                    name = "Point Type") +
  scale_alpha_manual(values = c("FALSE" = 0.5, "TRUE" = 1),
                     labels = c("Normal", "Influential"),
                     name = "Point Type") +
  labs(title = "Temperature Errors with Influential Points Highlighted",
       subtitle = paste("Red points exceed Cook's D threshold of", round(threshold, 5)),
       x = "Mean Temperature (°F)",
       y = "Temperature Difference (Sensor - NOAA) (°F)") +
  theme_minimal()

# Box plot comparing influential vs non-influential
ggplot(model_data, aes(x = is_influential, y = Temp_Abs_Diff, fill = is_influential)) +
  geom_boxplot() +
  scale_fill_manual(values = c("FALSE" = "lightblue", "TRUE" = "red"),
                    labels = c("Normal Points", "Influential Points"),
                    name = "") +
  labs(title = "Absolute Temperature Error: Normal vs Influential Points",
       x = "Point Type",
       y = "Absolute Temperature Error (°F)") +
  scale_x_discrete(labels = c("FALSE" = "Normal", "TRUE" = "Influential")) +
  theme_minimal()

6.6 Remove Influential Points and Refit Model

Now let’s remove the influential points and see how it affects our models.

cat("DATA CLEANING DECISION\n")

## DATA CLEANING DECISION

cat("======================\n")

## ======================

cat("Original dataset:", nrow(model_data), "observations\n")

## Original dataset: 100 observations

cat("Influential points to remove:", sum(model_data$is_influential), "observations\n")

## Influential points to remove: 8 observations

cat("Cleaned dataset:", sum(!model_data$is_influential), "observations\n")

## Cleaned dataset: 92 observations

cat("Percentage removed:", round(sum(model_data$is_influential) / nrow(model_data) * 100, 2), "%\n\n")

## Percentage removed: 8 %

# Create cleaned dataset
model_data_clean <- model_data %>%
  filter(!is_influential)

# Refit the model with cleaned data
model1_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device, 
                   data = model_data_clean)

model2_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
                 cloudcover + solar_radiation + windspeed, 
               data = model_data_clean)

cat("MODEL COMPARISON: BEFORE vs AFTER REMOVING INFLUENTIAL POINTS\n")

## MODEL COMPARISON: BEFORE vs AFTER REMOVING INFLUENTIAL POINTS

cat("=============================================================\n\n")

## =============================================================

cat("ORIGINAL MODEL 1 (with influential points):\n")

## ORIGINAL MODEL 1 (with influential points):

cat("R-squared:", round(summary(model1)$r.squared, 4), "\n")

## R-squared: 0.2814

cat("Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")

## Adjusted R-squared: 0.2432

cat("RMSE:", round(sqrt(mean(residuals(model1)^2)), 4), "°F\n")

## RMSE: 7.7523 °F

cat("Residual Std Error:", round(summary(model1)$sigma, 4), "°F\n\n")

## Residual Std Error: 7.9959 °F

cat("CLEANED MODEL 1 (influential points removed):\n")

## CLEANED MODEL 1 (influential points removed):

cat("R-squared:", round(summary(model1_clean)$r.squared, 4), "\n")

## R-squared: 0.3849

cat("Adjusted R-squared:", round(summary(model1_clean)$adj.r.squared, 4), "\n")

## Adjusted R-squared: 0.3491

cat("RMSE:", round(sqrt(mean(residuals(model1_clean)^2)), 4), "°F\n")

## RMSE: 4.755 °F

cat("Residual Std Error:", round(summary(model1_clean)$sigma, 4), "°F\n\n")

## Residual Std Error: 4.9181 °F

cat("ORIGINAL MODEL 2 (with influential points):\n")

## ORIGINAL MODEL 2 (with influential points):

cat("R-squared:", round(summary(model2)$r.squared, 4), "\n")

## R-squared: 0.3561

cat("Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")

## Adjusted R-squared: 0.2994

cat("RMSE:", round(sqrt(mean(residuals(model2)^2)), 4), "°F\n")

## RMSE: 7.3388 °F

cat("Residual Std Error:", round(summary(model2)$sigma, 4), "°F\n\n")

## Residual Std Error: 7.6931 °F

cat("CLEANED MODEL 2 (influential points removed):\n")

## CLEANED MODEL 2 (influential points removed):

cat("R-squared:", round(summary(model2_clean)$r.squared, 4), "\n")

## R-squared: 0.4463

cat("Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")

## Adjusted R-squared: 0.393

cat("RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 4), "°F\n")

## RMSE: 4.5113 °F

cat("Residual Std Error:", round(summary(model2_clean)$sigma, 4), "°F\n\n")

## Residual Std Error: 4.7496 °F

# Improvement metrics
r2_improvement <- (summary(model1_clean)$r.squared - summary(model1)$r.squared) * 100
rmse_improvement <- ((sqrt(mean(residuals(model1)^2)) - sqrt(mean(residuals(model1_clean)^2))) / 
                     sqrt(mean(residuals(model1)^2))) * 100

cat("IMPROVEMENTS MODEL 1:\n")

## IMPROVEMENTS MODEL 1:

cat("R-squared increase:", round(r2_improvement, 2), "percentage points\n")

## R-squared increase: 10.35 percentage points

cat("RMSE reduction:", round(rmse_improvement, 2), "%\n")

## RMSE reduction: 38.66 %

r2_improvement2 <- (summary(model2_clean)$r.squared - summary(model2)$r.squared) * 100
rmse_improvement2 <- ((sqrt(mean(residuals(model2)^2)) - sqrt(mean(residuals(model2_clean)^2))) / 
                     sqrt(mean(residuals(model2)^2))) * 100

cat("IMPROVEMENTS MODEL 2:\n")

## IMPROVEMENTS MODEL 2:

cat("R-squared increase:", round(r2_improvement2, 2), "percentage points\n")

## R-squared increase: 9.03 percentage points

cat("RMSE reduction:", round(rmse_improvement2, 2), "%\n")

## RMSE reduction: 38.53 %

# Show full model 1 summary
cat("\n")

summary(model1_clean)

## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device, data = model_data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.4648  -2.6905   0.0476   2.7094  14.3754 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.286e+02  2.148e+02  -1.995   0.0492 *  
## Lat          9.684e+00  5.122e+00   1.891   0.0621 .  
## alt_m        4.441e-03  2.840e-03   1.564   0.1216    
## Hour_sin    -3.692e+00  8.113e-01  -4.551 1.74e-05 ***
## Hour_cos     3.499e+00  7.415e-01   4.719 9.11e-06 ***
## DeviceRed    1.863e+00  1.045e+00   1.783   0.0781 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.918 on 86 degrees of freedom
## Multiple R-squared:  0.3849, Adjusted R-squared:  0.3491 
## F-statistic: 10.76 on 5 and 86 DF,  p-value: 4.692e-08

model1_clean_tidy <- tidy(model1_clean, conf.int = TRUE)
knitr::kable(model1_clean_tidy, digits = 4, 
             caption = "Cleaned Model 1: Location and Time Effects on Temperature Error (Influential Points Removed)")

Cleaned Model 1: Location and Time Effects on Temperature Error (Influential Points Removed)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-428.5773	214.8415	-1.9949	0.0492	-855.6681	-1.4864
Lat	9.6837	5.1222	1.8905	0.0621	-0.4989	19.8663
alt_m	0.0044	0.0028	1.5636	0.1216	-0.0012	0.0101
Hour_sin	-3.6919	0.8113	-4.5507	0.0000	-5.3047	-2.0791
Hour_cos	3.4989	0.7415	4.7189	0.0000	2.0249	4.9728
DeviceRed	1.8635	1.0449	1.7833	0.0781	-0.2138	3.9407

# Show full model 2 summary
cat("\n")

summary(model2_clean)

## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device + cloudcover + solar_radiation + windspeed, data = model_data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8425  -2.2988   0.1258   2.7823  13.5543 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     -7.326e+02  2.656e+02  -2.758  0.00715 **
## Lat              1.687e+01  6.332e+00   2.664  0.00928 **
## alt_m            3.336e-03  3.102e-03   1.075  0.28531   
## Hour_sin        -3.848e+00  2.038e+00  -1.888  0.06248 . 
## Hour_cos         3.216e+00  1.188e+00   2.707  0.00823 **
## DeviceRed        2.973e-01  1.134e+00   0.262  0.79382   
## cloudcover       5.865e-02  2.045e-02   2.868  0.00524 **
## solar_radiation  1.068e-03  4.259e-03   0.251  0.80263   
## windspeed       -1.424e-01  1.274e-01  -1.118  0.26688   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.75 on 83 degrees of freedom
## Multiple R-squared:  0.4463, Adjusted R-squared:  0.393 
## F-statistic: 8.363 on 8 and 83 DF,  p-value: 2.949e-08

model2_clean_tidy <- tidy(model2_clean, conf.int = TRUE)
knitr::kable(model2_clean_tidy, digits = 4, 
             caption = "Cleaned Model 2: Location, Time, and Environmental Effects on Temperature Error (Influential Points Removed)")

Cleaned Model 2: Location, Time, and Environmental Effects on Temperature Error (Influential Points Removed)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-732.5966	265.6347	-2.7579	0.0072	-1260.9333	-204.2598
Lat	16.8665	6.3320	2.6637	0.0093	4.2723	29.4607
alt_m	0.0033	0.0031	1.0754	0.2853	-0.0028	0.0095
Hour_sin	-3.8479	2.0377	-1.8883	0.0625	-7.9008	0.2051
Hour_cos	3.2156	1.1877	2.7074	0.0082	0.8533	5.5780
DeviceRed	0.2973	1.1340	0.2622	0.7938	-1.9581	2.5527
cloudcover	0.0587	0.0205	2.8677	0.0052	0.0180	0.0993
solar_radiation	0.0011	0.0043	0.2507	0.8026	-0.0074	0.0095
windspeed	-0.1424	0.1274	-1.1178	0.2669	-0.3957	0.1110

6.7 Verify Diagnostic Assumptions on Cleaned Model

cat("DIAGNOSTIC TESTS ON CLEANED MODEL 1\n")

## DIAGNOSTIC TESTS ON CLEANED MODEL 1

cat("==================================\n\n")

## ==================================

# Normality test
shapiro_test_clean <- shapiro.test(residuals(model1_clean))
cat("Shapiro-Wilk Normality Test:\n")

## Shapiro-Wilk Normality Test:

cat("W statistic:", round(shapiro_test_clean$statistic, 4), "\n")

## W statistic: 0.955

cat("p-value:", format.pval(shapiro_test_clean$p.value, digits = 3), "\n")

## p-value: 0.00305

if (shapiro_test_clean$p.value < 0.05) {
  cat("Result: Residuals still show departure from normality\n\n")
} else {
  cat("Result: ✓ Residuals are normally distributed\n\n")
}

## Result: Residuals still show departure from normality

# Heteroscedasticity test
bp_test_clean <- bptest(model1_clean)
cat("Breusch-Pagan Heteroscedasticity Test:\n")

## Breusch-Pagan Heteroscedasticity Test:

cat("BP statistic:", round(bp_test_clean$statistic, 4), "\n")

## BP statistic: 8.5065

cat("p-value:", format.pval(bp_test_clean$p.value, digits = 3), "\n")

## p-value: 0.13

if (bp_test_clean$p.value < 0.05) {
  cat("Result: Some heteroscedasticity still present\n\n")
} else {
  cat("Result: ✓ Homoscedasticity assumption met\n\n")
}

## Result: ✓ Homoscedasticity assumption met

# Compare test statistics
cat("COMPARISON OF DIAGNOSTIC TESTS:\n")

## COMPARISON OF DIAGNOSTIC TESTS:

cat("                           Original Model    Cleaned Model\n")

##                            Original Model    Cleaned Model

cat("Shapiro-Wilk p-value:     ", format.pval(shapiro_test$p.value, digits = 4), "      ", 
    format.pval(shapiro_test_clean$p.value, digits = 4), "\n")

## Shapiro-Wilk p-value:      0.001186        0.003045

cat("Breusch-Pagan p-value:    ", format.pval(bp_test$p.value, digits = 4), "      ", 
    format.pval(bp_test_clean$p.value, digits = 4), "\n")

## Breusch-Pagan p-value:     2.162e-07        0.1304

# Diagnostic plots for cleaned model
par(mfrow = c(2, 2))
plot(model1_clean, which = 1:4)

par(mfrow = c(1, 1))

cat("DIAGNOSTIC TESTS ON CLEANED MODEL 2\n")

## DIAGNOSTIC TESTS ON CLEANED MODEL 2

cat("==================================\n\n")

## ==================================

# Normality test
shapiro_test_clean2 <- shapiro.test(residuals(model2_clean))
cat("Shapiro-Wilk Normality Test:\n")

## Shapiro-Wilk Normality Test:

cat("W statistic:", round(shapiro_test_clean2$statistic, 4), "\n")

## W statistic: 0.953

cat("p-value:", format.pval(shapiro_test_clean2$p.value, digits = 3), "\n")

## p-value: 0.00225

if (shapiro_test_clean2$p.value < 0.05) {
  cat("Result: Residuals still show departure from normality\n\n")
} else {
  cat("Result: ✓ Residuals are normally distributed\n\n")
}

## Result: Residuals still show departure from normality

# Heteroscedasticity test
bp_test_clean2 <- bptest(model2_clean)
cat("Breusch-Pagan Heteroscedasticity Test:\n")

## Breusch-Pagan Heteroscedasticity Test:

cat("BP statistic:", round(bp_test_clean2$statistic, 4), "\n")

## BP statistic: 12.1473

cat("p-value:", format.pval(bp_test_clean2$p.value, digits = 3), "\n")

## p-value: 0.145

if (bp_test_clean2$p.value < 0.05) {
  cat("Result: Some heteroscedasticity still present\n\n")
} else {
  cat("Result: ✓ Homoscedasticity assumption met\n\n")
}

## Result: ✓ Homoscedasticity assumption met

# Compare test statistics
cat("COMPARISON OF DIAGNOSTIC TESTS:\n")

## COMPARISON OF DIAGNOSTIC TESTS:

cat("                           Original Model    Cleaned Model\n")

##                            Original Model    Cleaned Model

cat("Shapiro-Wilk p-value:     ", format.pval(shapiro_test2$p.value, digits = 4), "      ", 
    format.pval(shapiro_test_clean2$p.value, digits = 4), "\n")

## Shapiro-Wilk p-value:      0.001141        0.002252

cat("Breusch-Pagan p-value:    ", format.pval(bp_test2$p.value, digits = 4), "      ", 
    format.pval(bp_test_clean2$p.value, digits = 4), "\n")

## Breusch-Pagan p-value:     7.469e-06        0.1448

# Diagnostic plots for cleaned model
par(mfrow = c(2, 2))
plot(model2_clean, which = 1:4)

par(mfrow = c(1, 1))

Key Learning Points: - Removing influential points can dramatically improve model fit and assumption compliance - However, consider WHY these points are influential - they may contain important information - Document any data removal decisions and justify them scientifically - The cleaned model provides more reliable coefficient estimates - There still are issues with the normality assumption (failed Shapiro-Wilk test) - There still seem to be additional points that are still influential. Consider addressing issues with them as well.

Exercise 4

Exercise 4: Investigate the influential points. Are they measurement errors, unusual environmental conditions, or legitimate extreme events? Should they be removed or studied separately? What do you think we should do with these influential points?

6.8 Model Predictions and Visualization

We will use model2_clean (with the influential points removed) with the dataset model_data_clean for subsequent analysis. If you think you like a different model better, replace every instance of “model2_clean” with “model1”, “model1_clean”, or “model2_clean” and model_data_clean with model_data accordingly.

# Add predictions and residuals for model 2 clean
model_data_clean$predicted <- predict(model2_clean, newdata = model_data_clean)
model_data_clean$residuals <- residuals(model2_clean, newdata = model_data_clean)

# Predicted vs Actual for model 1
ggplot(model_data_clean, aes(x = predicted, y = Temp_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
  labs(title = "Model 2 Clean: Predicted vs Actual Temperature Error",
       x = "Predicted Temperature Error (°F)",
       y = "Actual Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Residuals by device for model 1
ggplot(model_data_clean, aes(x = Device, y = residuals, fill = Device)) +
  geom_boxplot() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Model 2 Clean: Residuals by Device",
       x = "Device",
       y = "Residuals (°F)") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))

6.9 Model Summary Table

# Create comprehensive model summary
model_summary <- data.frame(
  Model = c("Model 1", if(exists("model2")) "Model 2" else NULL),
  Predictors = c("Location + Time + Device", 
                 if(exists("model2")) "Location + Time + Device + Environment" else NULL),
  N = c(nobs(model1), if(exists("model2")) nobs(model2) else NULL),
  R_squared = c(summary(model1)$r.squared, 
                if(exists("model2")) summary(model2)$r.squared else NULL),
  Adj_R_squared = c(summary(model1)$adj.r.squared, 
                    if(exists("model2")) summary(model2)$adj.r.squared else NULL),
  RMSE = c(sqrt(mean(residuals(model1)^2)), 
           if(exists("model2")) sqrt(mean(residuals(model2)^2)) else NULL),
  AIC = c(AIC(model1), if(exists("model2")) AIC(model2) else NULL),
  BIC = c(BIC(model1), if(exists("model2")) BIC(model2) else NULL)
)

knitr::kable(model_summary, digits = 4, 
             caption = "Model Performance Comparison for Full Data",
             col.names = c("Model", "Predictors", "N", "R²", "Adj R²", "RMSE", "AIC", "BIC"))

Model Performance Comparison for Full Data
Model	Predictors	N	R²	Adj R²	RMSE	AIC	BIC
Model 1	Location + Time + Device	100	0.2814	0.2432	7.7523	707.3865	725.6227
Model 2	Location + Time + Device + Environment	100	0.3561	0.2994	7.3388	702.4219	728.4736

# Create comprehensive model summary for cleaned models
model_summary2 <- data.frame(
  Model = c("Model 1 Clean", if(exists("model2_clean")) "Model 2 Clean" else NULL),
  Predictors = c("Location + Time + Device", 
                 if(exists("model2_clean")) "Location + Time + Device + Environment" else NULL),
  N = c(nobs(model1_clean), if(exists("model2_clean")) nobs(model2_clean) else NULL),
  R_squared = c(summary(model1_clean)$r.squared, 
                if(exists("model2_clean")) summary(model2_clean)$r.squared else NULL),
  Adj_R_squared = c(summary(model1_clean)$adj.r.squared, 
                    if(exists("model2_clean")) summary(model2_clean)$adj.r.squared else NULL),
  RMSE = c(sqrt(mean(residuals(model1)^2)), 
           if(exists("model2_clean")) sqrt(mean(residuals(model2_clean)^2)) else NULL),
  AIC = c(AIC(model1_clean), if(exists("model2_clean")) AIC(model2_clean) else NULL),
  BIC = c(BIC(model1_clean), if(exists("model2_clean")) BIC(model2_clean) else NULL)
)

knitr::kable(model_summary2, digits = 4, 
             caption = "Model Performance Comparison for Cleaned Data",
             col.names = c("Model", "Predictors", "N", "R²", "Adj R²", "RMSE", "AIC", "BIC"))

Model Performance Comparison for Cleaned Data
Model	Predictors	N	R²	Adj R²	RMSE	AIC	BIC
Model 1 Clean	Location + Time + Device	92	0.3849	0.3491	7.7523	561.9764	579.6290
Model 2 Clean	Location + Time + Device + Environment	92	0.4463	0.3930	4.5113	558.2972	583.5151

6.10 Coefficient Plot

# Create model 1 coefficient plot with confidence intervals
coef_plot_data <- model1_clean_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = factor(term, levels = rev(term)))

ggplot(coef_plot_data, aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
  geom_point(size = 3, color = "blue") +
  labs(title = "Model 1 (Clean) Coefficients with 95% Confidence Intervals",
       subtitle = "Effects on Temperature Error (°F)",
       x = "Coefficient Estimate",
       y = "Predictor Variable") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1))

# Create model 2 coefficient plot with confidence intervals
coef_plot_data2 <- model2_clean_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = factor(term, levels = rev(term)))

ggplot(coef_plot_data2, aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
  geom_point(size = 3, color = "blue") +
  labs(title = "Model 2 (Clean) Coefficients with 95% Confidence Intervals",
       subtitle = "Effects on Temperature Error (°F)",
       x = "Coefficient Estimate",
       y = "Predictor Variable") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1))

6.11 Interactive 3D Visualization

# 3D scatter plot: Lat, Lon, Altitude colored by temp error
plot_ly(model_data, 
        x = ~Lon, y = ~Lat, z = ~alt_m,
        color = ~Temp_Diff, 
        colors = colorRamp(c("blue", "white", "red")),
        marker = list(size = 5),
        text = ~paste0("Device: ", Device, 
                      "<br>Temp Error: ", round(Temp_Diff, 2), "°F",
                      "<br>Altitude: ", round(alt_m, 0), "m"),
        hoverinfo = "text") %>%
  add_markers() %>%
  layout(title = "Temperature Error in 3D Space",
         scene = list(
           xaxis = list(title = "Longitude"),
           yaxis = list(title = "Latitude"),
           zaxis = list(title = "Altitude (m)")
         ))

6.12 Key Findings and Interpretation

# Extract significant predictors (p < 0.05)
significant_predictors <- model2_clean_tidy %>%
  filter(p.value < 0.05, term != "(Intercept)")

cat("REGRESSION ANALYSIS SUMMARY FOR MODEL 2 CLEAN (with influential points removed)\n")

## REGRESSION ANALYSIS SUMMARY FOR MODEL 2 CLEAN (with influential points removed)

cat("======================================================\n\n")

## ======================================================

cat("Model Performance:\n")

## Model Performance:

cat("- R-squared:", round(summary(model2_clean)$r.squared, 4), 
    "(model explains", round(summary(model2_clean)$r.squared * 100, 1), "% of variance)\n")

## - R-squared: 0.4463 (model explains 44.6 % of variance)

cat("- Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")

## - Adjusted R-squared: 0.393

cat("- RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 3), "°F\n")

## - RMSE: 4.511 °F

cat("- F-statistic:", round(summary(model2_clean)$fstatistic[1], 2), 
    "on", summary(model2_clean)$fstatistic[2], "and", summary(model2_clean)$fstatistic[3], "DF\n")

## - F-statistic: 8.36 on 8 and 83 DF

cat("- Overall p-value:", format.pval(pf(summary(model2_clean)$fstatistic[1], 
    summary(model2_clean)$fstatistic[2], summary(model2_clean)$fstatistic[3], 
    lower.tail = FALSE), digits = 3), "\n\n")

## - Overall p-value: 2.95e-08

if (nrow(significant_predictors) > 0) {
  cat("Significant Predictors (p < 0.05):\n")
  for (i in 1:nrow(significant_predictors)) {
    cat("-", significant_predictors$term[i], ":", 
        round(significant_predictors$estimate[i], 4),
        "(p =", format.pval(significant_predictors$p.value[i], digits = 3), ")\n")
  }
} else {
  cat("No significant predictors at p < 0.05 level.\n")
}

## Significant Predictors (p < 0.05):
## - Lat : 16.8665 (p = 0.00928 )
## - Hour_cos : 3.2156 (p = 0.00823 )
## - cloudcover : 0.0587 (p = 0.00524 )

cat("\nInterpretation:\n")

## 
## Interpretation:

cat("- Device effect:", 
    if("DeviceRed" %in% significant_predictors$term) {
      "Red device shows significantly different error pattern"
    } else {
      "No significant difference between devices"
    }, "\n")

## - Device effect: No significant difference between devices

cat("- Spatial effects:", 
    if(any(c("Lat", "Lon", "alt_m") %in% significant_predictors$term)) {
      "Location/elevation significantly affects temperature error"
    } else {
      "Location/elevation does not significantly affect error"
    }, "\n")

## - Spatial effects: Location/elevation significantly affects temperature error

cat("- Temporal effects:", 
    if(any(c("Hour_sin", "Hour_cos") %in% significant_predictors$term)) {
      "Time of day significantly affects temperature error"
    } else {
      "Time of day does not significantly affect error"
    }, "\n")

## - Temporal effects: Time of day significantly affects temperature error

cat("- Environmental effects:", 
    if(any(c("cloudcover", "solar_radiation", "windspeed") %in% significant_predictors$term)) {
      "Environment significantly affects temperature error"
    } else {
      "Environment does not significantly affect error"
    }, "\n")

## - Environmental effects: Environment significantly affects temperature error

Exercise 5

Exercise 5: Based on the regression results, what are the most important factors influencing temperature measurement errors and how did you identify them?

7. Advanced Diagnostic Plots

7.1 Residuals vs Fitted by Device

# Enhanced residual plot
ggplot(model_data_clean, aes(x = predicted, y = residuals, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  geom_smooth(method = "loess", se = TRUE, linewidth = 1) +
  labs(title = "Residual Analysis by Device",
       x = "Fitted Values (°F)",
       y = "Residuals (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.2 Q-Q Plot by Device

ggplot(model_data_clean, aes(sample = residuals, color = Device)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~Device) +
  labs(title = "Normal Q-Q Plots by Device",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.3 Scale-Location Plot

model_data_clean$sqrt_abs_residuals <- sqrt(abs(model_data_clean$residuals))

ggplot(model_data_clean, aes(x = predicted, y = sqrt_abs_residuals, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  labs(title = "Scale-Location Plot",
       subtitle = "Check for homoscedasticity",
       x = "Fitted Values (°F)",
       y = expression(sqrt("|Standardized Residuals|"))) +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.4 Influence Plot

# Calculate Cook's distance
model_data_clean$cooks_d <- cooks.distance(model2_clean)

# Find influential points (Cook's D > 4/n)
threshold <- 4 / nrow(model_data_clean)
influential_points <- model_data_clean %>%
  filter(cooks_d > threshold) %>%
  arrange(desc(cooks_d))

ggplot(model_data_clean, aes(x = seq_along(cooks_d), y = cooks_d, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = threshold, linetype = "dashed", color = "red") +
  labs(title = "Cook's Distance - Influence Plot",
       subtitle = paste("Red line indicates threshold (4/n =", round(threshold, 4), ")"),
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

if (nrow(influential_points) > 0) {
  cat("\nInfluential observations (Cook's D >", round(threshold, 4), "):\n")
  cat("Number of influential points:", nrow(influential_points), "\n")
  cat("Top 5 most influential:\n")
  print(head(influential_points %>% 
               select(TimeStampFull, Device, Temp_Diff, predicted, cooks_d), 5))
}

## 
## Influential observations (Cook's D > 0.0435 ):
## Number of influential points: 6 
## Top 5 most influential:
## # A tibble: 5 × 5
##   TimeStampFull       Device Temp_Diff predicted cooks_d
##   <dttm>              <chr>      <dbl>     <dbl>   <dbl>
## 1 2025-09-11 16:30:02 Red       -20.6     -1.76   0.239 
## 2 2025-09-08 23:18:47 Red        11.9      0.681  0.0662
## 3 2025-09-09 15:25:35 Red        -3.18     2.93   0.0538
## 4 2024-09-12 16:52:36 Purple     -2.42     2.99   0.0481
## 5 2025-09-04 21:49:32 Red        21.4      7.81   0.0469

8. Spatial Regression Analysis

8.1 Error Hotspot Map

# Create error magnitude categories
model_data_clean <- model_data_clean %>%
  mutate(
    Error_Magnitude = cut(Temp_Abs_Diff, 
                          breaks = c(0, 1, 3, 5, Inf),
                          labels = c("Very Low (<1°F)", "Low (1-3°F)", 
                                   "Medium (3-5°F)", "High (>5°F)"))
  )

# Color palette
error_pal <- colorFactor(
  palette = c("#2ecc71", "#f1c40f", "#e67e22", "#e74c3c"),
  domain = model_data_clean$Error_Magnitude
)

hotspot_map <- leaflet(model_data_clean) %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addProviderTiles(providers$Esri.WorldTopoMap, group = "Topo") %>%
  addWMSTiles(
    baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
    layers = "0",
    options = WMSTileOptions(format = "image/png", transparent = TRUE),
    attribution = "USGS The National Map",
    group = "USGS Topo"  
  ) %>%
  
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = ~Temp_Abs_Diff * 2,  # Size proportional to error magnitude
    color = ~error_pal(Error_Magnitude),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 1,
    label = ~paste0("Error: ", round(Temp_Diff, 2), "°F | Magnitude: ", Error_Magnitude),
    popup = ~paste0(
      "<b>Error Hotspot Analysis</b><br>",
      "<b>Device:</b> ", Device, "<br>",
      "<b>Temperature Error:</b> ", round(Temp_Diff, 2), "°F<br>",
      "<b>Absolute Error:</b> ", round(Temp_Abs_Diff, 2), "°F<br>",
      "<b>Magnitude:</b> ", Error_Magnitude, "<br>",
      "<b>Model Prediction:</b> ", round(predicted, 2), "°F<br>",
      "<b>Residual:</b> ", round(residuals, 2), "°F<br>",
      "<hr>",
      "<b>Location:</b><br>",
      "Lat: ", round(Lat, 5), "<br>",
      "Lon: ", round(Lon, 5), "<br>",
      "Altitude: ", round(alt_m, 0), "m<br>",
      "<b>Time:</b> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  addLayersControl(
    baseGroups = c("Satellite", "Topo", "USGS Topo"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = error_pal,
    values = ~Error_Magnitude,
    title = "Error Magnitude",
    opacity = 1
  ) %>%
  
  addScaleBar(position = "bottomleft")

hotspot_map

8.2 Model Residuals Map

# Color palette for residuals
residual_pal <- colorNumeric(
  palette = "RdBu",
  domain = c(-max(abs(model_data_clean$residuals)), max(abs(model_data_clean$residuals))),
  reverse = TRUE
)

residuals_map <- leaflet(model_data_clean) %>%
  addProviderTiles("Esri.WorldTopoMap") %>%
  
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 6,
    color = ~residual_pal(residuals),
    fillOpacity = 0.8,
    stroke = TRUE,
    weight = 1,
    label = ~paste0("Residual: ", round(residuals, 2), "°F"),
    popup = ~paste0(
      "<b>Model Residual Analysis</b><br>",
      "<b>Observed:</b> ", round(Temp_Diff, 2), "°F<br>",
      "<b>Predicted:</b> ", round(predicted, 2), "°F<br>",
      "<b>Residual:</b> ", round(residuals, 2), "°F<br>",
      "<b>Device:</b> ", Device, "<br>",
      "<b>Time:</b> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = residual_pal,
    values = ~residuals,
    title = "Residuals (°F)<br>Red = Over-predicted<br>Blue = Under-predicted",
    opacity = 1
  )

residuals_map

9. Summary and Conclusions

9.1 Key Statistical Findings

cat("STATISTICAL ANALYSIS SUMMARY\n")

## STATISTICAL ANALYSIS SUMMARY

cat("============================\n\n")

## ============================

cat("1. BLAND-ALTMAN ANALYSIS\n")

## 1. BLAND-ALTMAN ANALYSIS

cat("   Temperature:\n")

##    Temperature:

cat("   - Mean Bias:", round(temp_bias, 3), "°F\n")

##    - Mean Bias: -0.08 °F

cat("   - 95% Limits of Agreement:", round(temp_lower_loa, 3), "to", round(temp_upper_loa, 3), "°F\n")

##    - 95% Limits of Agreement: -18.095 to 17.935 °F

cat("   - Within limits:", 
    round(sum(comparison_data$Temp_Diff >= temp_lower_loa & 
              comparison_data$Temp_Diff <= temp_upper_loa, na.rm = TRUE) / 
          sum(!is.na(comparison_data$Temp_Diff)) * 100, 1), "%\n\n")

##    - Within limits: 91 %

cat("2. PAIRED T-TESTS\n")

## 2. PAIRED T-TESTS

cat("   Temperature: t =", round(temp_ttest$statistic, 3), 
    ", p =", format.pval(temp_ttest$p.value, digits = 3), "\n")

##    Temperature: t = -0.087 , p = 0.931

cat("   Humidity: t =", round(hum_ttest$statistic, 3), 
    ", p =", format.pval(hum_ttest$p.value, digits = 3), "\n")

##    Humidity: t = 3.364 , p = 0.00109

cat("   Pressure: t =", round(press_ttest$statistic, 3), 
    ", p =", format.pval(press_ttest$p.value, digits = 3), "\n\n")

##    Pressure: t = -8.694 , p = 7.65e-14

cat("3. REGRESSION MODEL 2 CLEAN (with influential points removed)\n")

## 3. REGRESSION MODEL 2 CLEAN (with influential points removed)

cat("   R-squared:", round(summary(model2_clean)$r.squared, 4), "\n")

##    R-squared: 0.4463

cat("   Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")

##    Adjusted R-squared: 0.393

cat("   RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 3), "°F\n")

##    RMSE: 4.511 °F

cat("   Number of significant predictors:", nrow(significant_predictors), "\n")

##    Number of significant predictors: 3

9.2 Recommendations

Based on the statistical analysis:

Sensor Accuracy:
- Evaluate the magnitude of bias from Bland-Altman plots
- Consider sensor calibration if systematic bias is detected
- Assess if errors are within acceptable limits for your application
Influential Factors:
- Review regression coefficients to identify key error drivers
- Consider environmental corrections for cloud cover
- Account for time-of-day effects in data interpretation
Model Improvement:
- Deal with influential data points. Where there days/time when the sensors were obviously not working correctly or were there errors in the data set?
- Consider transformations on the dependent variable to correct for non-normality of residuals
- Consider non-linear relationships or interactions
- Consider a mixed-effects model, where data points are grouped into time/space clusters
- Validate model on independent test data

9.3 Further Analysis Suggestions

Exercise 6

Exercise 6: Design and implement one of the following extensions and put your code in the chunk below.

Easier Extensions:

Perform seperate regression models for each device
Remove some variables from model2_clean that are not significant. Try to get the Baysian Information Criterion (BIC) or Akaike’s Information Criterion (AIC) as low as possible.

Harder Extensions:

Deal with the issue of influential points. There are still influential points in the model made from the cleaned data. Are there errors in the data set?
Deal with the issue of non-normality (failed Shapiro-Wilk test). Ideas: inspect distribution, maybe do a transform on the dependent variable (logarithm?), or use robust regression.
Deal with the issue of heteroscedasticity (failed Breusch-Pagan test). Ideas: weighted least squares, or robust standard errors.
Try a mixed effect model (try function lmer from package lme4 instead of lm). Create data clusters based on time/space bins and a term “+ (1 | cluster_variable_name)” to your model. This might also help deal with the issues of non-normality and/or heteroscedasticity

#put your code for exercise 5 extension below

Publishing to RPubs:

Knit this document to HTML
Click “Publish” in the preview window
Select RPubs and follow prompts
Submit your statistical analysis by pasting the url in Canvas

Data Export:

# Export enhanced dataset with model predictions
# write.csv(model_data, "weather_modeling_results.csv", row.names = FALSE)

cat("\nAnalysis complete! Enhanced dataset includes:\n")

## 
## Analysis complete! Enhanced dataset includes:

cat("- NOAA reference data\n")

## - NOAA reference data

cat("- Extended environmental variables\n")

## - Extended environmental variables

cat("- Model predictions and residuals\n")

## - Model predictions and residuals

cat("- Statistical test results\n")

## - Statistical test results

Weather Data Statistical Modeling

Your Name

2025-11-02