Introduction

This tutorial builds on the descriptive statistical analysis of Part 1 by performing statistical modeling to understand the differences between sensor measurements and NOAA reference data. We’ll use Bland-Altman plots, paired t-tests, collect additional meteorological variables, and build linear regression models to identify factors that influence measurement errors.

1. Load Required Packages

# Install additional packages if needed
# install.packages(c("ggExtra", "broom", "car", "lmtest", "ggpubr"))

library(dplyr)
library(ggplot2)
library(plotly)
library(leaflet)
library(lubridate)
library(ggExtra)     # For marginal plots
library(broom)       # For tidy model output
library(car)         # For regression diagnostics
library(lmtest)      # For statistical tests
library(ggpubr)      # For publication-ready plots
library(httr)
library(jsonlite)
library(readr)

2. Load Data from Part 1

# Assuming you have the comparison_data from Part 1 saved as "CDT_Weather_Comparison.csv" in your working directory (part 9 from the first the activity).

# If not, you'll need to run the first activity again first to generate it. Make sure to (uncomment and then) run the code in part 9 that saves the data to the csv file. This code is:  write.csv(comparison_data, "CDT_Weather_Comparison.csv", row.names = FALSE)

#or, optionally you can just download this csv file from canvas and put it in your working directory.
 
comparison_data <- read_csv("CDT_Weather_Comparison.csv")


# For this tutorial, we'll assume comparison_data exists
cat("Data dimensions:", nrow(comparison_data), "rows x", ncol(comparison_data), "columns\n")

## Data dimensions: 100 rows x 27 columns

cat("Variables available:\n")

## Variables available:

print(names(comparison_data))

##  [1] "TimeStampFull"        "Date"                 "Lat"                 
##  [4] "Lon"                  "alt_m"                "Temp_F"              
##  [7] "Hum"                  "Press"                "DP"                  
## [10] "Device"               "Temp_C"               "NOAA_Temp_F"         
## [13] "NOAA_Humidity"        "NOAA_Pressure_hPa"    "Temp_Diff"           
## [16] "Temp_Pct_Diff"        "Temp_Abs_Diff"        "Hum_Diff"            
## [19] "Hum_Abs_Diff"         "Press_Diff"           "Press_Abs_Diff"      
## [22] "Temp_Error_Category"  "Hum_Error_Category"   "Press_Error_Category"
## [25] "Temp_Error_Type"      "Hum_Error_Type"       "Press_Error_Type"

3. Bland-Altman Analysis

Bland-Altman plots are the gold standard for assessing agreement between two measurement methods.

3.1 Temperature Bland-Altman Plot

# Calculate mean and difference for Bland-Altman
comparison_data <- comparison_data %>%
  mutate(
    Temp_Mean = (Temp_F + NOAA_Temp_F) / 2,
    Hum_Mean = (Hum + NOAA_Humidity) / 2,
    Press_Mean = (Press + NOAA_Pressure_hPa) / 2
  )

# Calculate bias and limits of agreement
temp_bias <- mean(comparison_data$Temp_Diff, na.rm = TRUE)
temp_sd <- sd(comparison_data$Temp_Diff, na.rm = TRUE)
temp_upper_loa <- temp_bias + 1.96 * temp_sd
temp_lower_loa <- temp_bias - 1.96 * temp_sd

# Create Bland-Altman plot
ba_temp <- ggplot(comparison_data, aes(x = Temp_Mean, y = Temp_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_hline(yintercept = temp_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
  geom_hline(yintercept = temp_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = temp_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_bias, 
           label = paste0("Bias = ", round(temp_bias, 2), "°F"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_upper_loa, 
           label = paste0("+1.96 SD = ", round(temp_upper_loa, 2), "°F"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Temp_Mean), y = temp_lower_loa, 
           label = paste0("-1.96 SD = ", round(temp_lower_loa, 2), "°F"), hjust = 0, vjust = 1.5) +
  labs(title = "Bland-Altman Plot: Temperature",
       subtitle = "Agreement between Sensor and NOAA measurements",
       x = "Mean Temperature [(Sensor + NOAA) / 2] (°F)",
       y = "Difference (Sensor - NOAA) (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Add marginal density plots
ggMarginal(ba_temp, type = "density", groupColour = TRUE, groupFill = TRUE)

Interpretation: - Blue line: Mean bias (systematic difference) - Red dashed lines: 95% limits of agreement (±1.96 SD) - Points within limits: Good agreement (95% of differences should fall within these limits) - Pattern detection: Look for trends or heteroscedasticity

3.2 Humidity Bland-Altman Plot

# Calculate bias and limits for humidity
hum_bias <- mean(comparison_data$Hum_Diff, na.rm = TRUE)
hum_sd <- sd(comparison_data$Hum_Diff, na.rm = TRUE)
hum_upper_loa <- hum_bias + 1.96 * hum_sd
hum_lower_loa <- hum_bias - 1.96 * hum_sd

ba_hum <- ggplot(comparison_data, aes(x = Hum_Mean, y = Hum_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_hline(yintercept = hum_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
  geom_hline(yintercept = hum_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = hum_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_bias, 
           label = paste0("Bias = ", round(hum_bias, 2), "%"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_upper_loa, 
           label = paste0("+1.96 SD = ", round(hum_upper_loa, 2), "%"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Hum_Mean), y = hum_lower_loa, 
           label = paste0("-1.96 SD = ", round(hum_lower_loa, 2), "%"), hjust = 0, vjust = 1.5) +
  labs(title = "Bland-Altman Plot: Humidity",
       x = "Mean Humidity [(Sensor + NOAA) / 2] (%)",
       y = "Difference (Sensor - NOAA) (%)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

ggMarginal(ba_hum, type = "density", groupColour = TRUE, groupFill = TRUE)

3.3 Pressure Bland-Altman Plot

Exercise 1

Exercise 1: Create a Bland-Altman plot for pressure following the same pattern as temperature and humidity above. Copy and paste the code from the example above into the code chunk below and modify it accordingly.

#put code for Excercise 1 below
# Calculate bias and limits for pressure
press_bias <- mean(comparison_data$Press_Diff, na.rm = TRUE)
press_sd <- sd(comparison_data$Press_Diff, na.rm = TRUE)
press_upper_loa <- press_bias + 1.96 * press_sd
press_lower_loa <- press_bias - 1.96 * press_sd

ba_press <- ggplot(comparison_data, aes(x = Press_Mean, y = Press_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_hline(yintercept = press_bias, linetype = "solid", color = "blue", linewidth = 1.2) +
  geom_hline(yintercept = press_upper_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = press_lower_loa, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "black", linewidth = 0.8) +
  annotate("text", x = min(comparison_data$Press_Mean), y = press_bias, 
           label = paste0("Bias = ", round(press_bias, 2), "%"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Press_Mean), y = press_upper_loa, 
           label = paste0("+1.96 SD = ", round(press_upper_loa, 2), "%"), hjust = 0, vjust = -0.5) +
  annotate("text", x = min(comparison_data$Press_Mean), y = press_lower_loa, 
           label = paste0("-1.96 SD = ", round(press_lower_loa, 2), "%"), hjust = 0, vjust = 1.5) +
  labs(title = "Bland-Altman Plot: Pressure",
       x = "Mean Pressure [(Sensor + NOAA) / 2] (%)",
       y = "Difference (Sensor - NOAA) (%)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

ggMarginal(ba_press, type = "density", groupColour = TRUE, groupFill = TRUE)

3.4 Bland-Altman Summary Statistics

# Create summary table
ba_summary <- data.frame(
  Variable = c("Temperature (°F)", "Humidity (%)", "Pressure (hPa)"),
  Bias = c(temp_bias, hum_bias, mean(comparison_data$Press_Diff, na.rm = TRUE)),
  SD = c(temp_sd, hum_sd, sd(comparison_data$Press_Diff, na.rm = TRUE)),
  Lower_LOA = c(temp_lower_loa, hum_lower_loa, 
                mean(comparison_data$Press_Diff, na.rm = TRUE) - 
                  1.96 * sd(comparison_data$Press_Diff, na.rm = TRUE)),
  Upper_LOA = c(temp_upper_loa, hum_upper_loa, 
                mean(comparison_data$Press_Diff, na.rm = TRUE) + 
                  1.96 * sd(comparison_data$Press_Diff, na.rm = TRUE))
)

knitr::kable(ba_summary, digits = 2, 
             caption = "Bland-Altman Analysis: Bias and Limits of Agreement",
             col.names = c("Variable", "Bias", "SD", "Lower LOA", "Upper LOA"))

Bland-Altman Analysis: Bias and Limits of Agreement
Variable	Bias	SD	Lower LOA	Upper LOA
Temperature (°F)	-0.12	8.99	-17.75	17.50
Humidity (%)	7.26	15.73	-23.57	38.09
Pressure (hPa)	-4.53	5.31	-14.95	5.89

4. Paired T-Tests

Paired t-tests assess whether there is a statistically significant systematic difference between sensor and NOAA measurements.

4.1 Temperature Paired T-Test

# Perform paired t-test
temp_ttest <- t.test(comparison_data$Temp_F, comparison_data$NOAA_Temp_F, 
                     paired = TRUE, alternative = "two.sided")

# Calculate effect size (Cohen's d)
temp_cohens_d <- temp_bias / temp_sd

cat("Temperature Paired T-Test Results:\n")

## Temperature Paired T-Test Results:

cat("=====================================\n")

## =====================================

cat("Mean difference (Bias):", round(temp_bias, 3), "°F\n")

## Mean difference (Bias): -0.124 °F

cat("t-statistic:", round(temp_ttest$statistic, 3), "\n")

## t-statistic: -0.138

cat("Degrees of freedom:", temp_ttest$parameter, "\n")

## Degrees of freedom: 99

cat("p-value:", format.pval(temp_ttest$p.value, digits = 3), "\n")

## p-value: 0.891

cat("95% CI:", round(temp_ttest$conf.int[1], 3), "to", round(temp_ttest$conf.int[2], 3), "°F\n")

## 95% CI: -1.908 to 1.66 °F

cat("Cohen's d (effect size):", round(temp_cohens_d, 3), "\n")

## Cohen's d (effect size): -0.014

cat("\nInterpretation:\n")

## 
## Interpretation:

if (temp_ttest$p.value < 0.05) {
  cat("There IS a statistically significant difference between sensor and NOAA temperatures (p < 0.05).\n")
} else {
  cat("There is NO statistically significant difference between sensor and NOAA temperatures (p >= 0.05).\n")
}

## There is NO statistically significant difference between sensor and NOAA temperatures (p >= 0.05).

4.2 Humidity and Pressure Paired T-Tests

# Humidity t-test
hum_ttest <- t.test(comparison_data$Hum, comparison_data$NOAA_Humidity, 
                    paired = TRUE, alternative = "two.sided")

# Pressure t-test
press_ttest <- t.test(comparison_data$Press, comparison_data$NOAA_Pressure_hPa, 
                      paired = TRUE, alternative = "two.sided")

# Create summary table
ttest_summary <- data.frame(
  Variable = c("Temperature", "Humidity", "Pressure"),
  Mean_Diff = c(temp_bias, hum_bias, mean(comparison_data$Press_Diff, na.rm = TRUE)),
  t_statistic = c(temp_ttest$statistic, hum_ttest$statistic, press_ttest$statistic),
  df = c(temp_ttest$parameter, hum_ttest$parameter, press_ttest$parameter),
  p_value = c(temp_ttest$p.value, hum_ttest$p.value, press_ttest$p.value),
  CI_Lower = c(temp_ttest$conf.int[1], hum_ttest$conf.int[1], press_ttest$conf.int[1]),
  CI_Upper = c(temp_ttest$conf.int[2], hum_ttest$conf.int[2], press_ttest$conf.int[2]),
  Significant = c(temp_ttest$p.value < 0.05, hum_ttest$p.value < 0.05, press_ttest$p.value < 0.05)
)

knitr::kable(ttest_summary, digits = 4, 
             caption = "Paired T-Test Results Summary",
             col.names = c("Variable", "Mean Diff", "t", "df", "p-value", "CI Lower", "CI Upper", "Significant (α=0.05)"))

Paired T-Test Results Summary
Variable	Mean Diff	t	df	p-value	CI Lower	CI Upper	Significant (α=0.05)
Temperature	-0.124	-0.1379	99	0.8906	-1.9081	1.6601	FALSE
Humidity	7.260	4.6149	99	0.0000	4.1385	10.3815	TRUE
Pressure	-4.528	-8.5194	99	0.0000	-5.5826	-3.4734	TRUE

4.3 Paired T-Test by Device

# Test each device separately
red_data <- comparison_data %>% filter(Device == "Red")
purple_data <- comparison_data %>% filter(Device == "Purple")

if (nrow(red_data) > 1) {
  red_temp_ttest <- t.test(red_data$Temp_F, red_data$NOAA_Temp_F, paired = TRUE)
  cat("Red Device Temperature T-Test:\n")
  cat("Mean difference:", round(mean(red_data$Temp_Diff, na.rm = TRUE), 3), "°F\n")
  cat("p-value:", format.pval(red_temp_ttest$p.value, digits = 3), "\n\n")
}

## Red Device Temperature T-Test:
## Mean difference: 0.096 °F
## p-value: 0.941

if (nrow(purple_data) > 1) {
  purple_temp_ttest <- t.test(purple_data$Temp_F, purple_data$NOAA_Temp_F, paired = TRUE)
  cat("Purple Device Temperature T-Test:\n")
  cat("Mean difference:", round(mean(purple_data$Temp_Diff, na.rm = TRUE), 3), "°F\n")
  cat("p-value:", format.pval(purple_temp_ttest$p.value, digits = 3), "\n")
}

## Purple Device Temperature T-Test:
## Mean difference: -0.483 °F
## p-value: 0.661

Exercise 2

Exercise 2:

Interpret the t-test results. Write your interpretation below each question part below.

2a) From 4.1 and 4.2, which weather variable(s) (temp, pressure, humidity) were significant?

There is a significant difference between sensor and NOAA humidity (p < 0.05) as well as between sensor and NOAA pressure (p < 0.05). There is no significant difference between sensor and NOAA temperature (p > 0.015).

2b) From 4.3, which device (if any) shows systematic bias for temperature? Red device has a difference of 0.096 degrees, resulting in no significant bias. The Purple device has a slightly negative difference of -0.483 degrees, which is also not significant, but mean it has more of a consistent bias, reading slightly lower than NOAA.

2c) Modify the code from 4.3 do a paired t test for each device but for one of the other variables (pressure or humidity). Then interpret your results.

#put your modified 4.3 code here:
# Test each device separately
red_data <- comparison_data %>% filter(Device == "Red")
purple_data <- comparison_data %>% filter(Device == "Purple")

if (nrow(red_data) > 1) {
  red_hum_ttest <- t.test(red_data$Hum, red_data$NOAA_Humidity, paired = TRUE)
  cat("Red Device Humidity T-Test:\n")
  cat("Mean difference:", round(mean(red_data$Hum_Diff, na.rm = TRUE), 3), "%\n")
  cat("p-value:", format.pval(red_hum_ttest$p.value, digits = 3), "\n\n")
}

## Red Device Humidity T-Test:
## Mean difference: 4.371 %
## p-value: 0.0424

if (nrow(purple_data) > 1) {
  purple_hum_ttest <- t.test(purple_data$Hum, purple_data$NOAA_Humidity, paired = TRUE)
  cat("Purple Device Humidity T-Test:\n")
  cat("Mean difference:", round(mean(purple_data$Hum_Diff, na.rm = TRUE), 3), "%\n")
  cat("p-value:", format.pval(purple_hum_ttest$p.value, digits = 3), "\n")
}

## Purple Device Humidity T-Test:
## Mean difference: 11.974 %
## p-value: 1.9e-06

red_data <- comparison_data %>% filter(Device == "Red")
purple_data <- comparison_data %>% filter(Device == "Purple")

if (nrow(red_data) > 1) {
  red_press_ttest <- t.test(red_data$Press, red_data$NOAA_Pressure_hPa, paired = TRUE)
  cat("Red Device Pressure T-Test:\n")
  cat("Mean difference:", round(mean(red_data$Press_Diff, na.rm = TRUE), 3), "hPa\n")
  cat("p-value:", format.pval(red_press_ttest$p.value, digits = 3), "\n\n")
}

## Red Device Pressure T-Test:
## Mean difference: -4.748 hPa
## p-value: 6.89e-08

if (nrow(purple_data) > 1) {
  purple_press_ttest <- t.test(purple_data$Press, purple_data$NOAA_Pressure_hPa, paired = TRUE)
  cat("Purple Device Pressure T-Test:\n")
  cat("Mean difference:", round(mean(purple_data$Press_Diff, na.rm = TRUE), 3), "hPa\n")
  cat("p-value:", format.pval(purple_hum_ttest$p.value, digits = 3), "\n")
}

## Purple Device Pressure T-Test:
## Mean difference: -4.168 hPa
## p-value: 1.9e-06

Results for red show: Average difference of 4.37% humidity with a p-value of 0.042 Average difference of -4.748hPa pressure with a p-value of 6.89e-08

Results for purple show: Average difference of 11.974% humidity with a p-value of 1.9e-06 Average difference of -4.168hPa pressure with a p-value of 1.9e-06

All of these p values show a significant difference between the NOAA and device readings. For humidity, both sensors read higher than NOAA, but purple device shows a significantly larger bias compared to red. For pressure, both sensors read lower than NOAA pressure.

5. Collect Additional NOAA Variables

Let’s fetch additional meteorological variables that might explain measurement differences.

5.1 Function to Fetch Extended Weather Data

get_extended_weather_data <- function(lat, lon, datetime, debug = FALSE) {
  date <- as.Date(datetime)
  
  base_url <- "https://archive-api.open-meteo.com/v1/archive"
  
  url <- paste0(
    base_url,
    "?latitude=", lat,
    "&longitude=", lon,
    "&start_date=", date,
    "&end_date=", date,
    "&hourly=cloudcover,shortwave_radiation,windspeed_10m,winddirection_10m,precipitation",
    "&temperature_unit=fahrenheit"
  )
  
  default_result <- list(
    cloudcover = NA_real_,
    shortwave_radiation = NA_real_,
    windspeed_10m = NA_real_,
    winddirection_10m = NA_real_,
    precipitation = NA_real_
  )
  
  result <- tryCatch({
    response <- GET(url)
    
    if (status_code(response) == 200) {
      data <- content(response, "parsed")
      
      if (is.null(data$hourly)) {
        return(default_result)
      }
      
      hourly_data <- data$hourly
      target_hour <- hour(datetime)
      idx <- target_hour + 1
      
      # Extract values with NULL checks
      cloudcover_val <- if (!is.null(hourly_data$cloudcover) && idx <= length(hourly_data$cloudcover)) {
        val <- hourly_data$cloudcover[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      radiation_val <- if (!is.null(hourly_data$shortwave_radiation) && idx <= length(hourly_data$shortwave_radiation)) {
        val <- hourly_data$shortwave_radiation[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      windspeed_val <- if (!is.null(hourly_data$windspeed_10m) && idx <= length(hourly_data$windspeed_10m)) {
        val <- hourly_data$windspeed_10m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      winddir_val <- if (!is.null(hourly_data$winddirection_10m) && idx <= length(hourly_data$winddirection_10m)) {
        val <- hourly_data$winddirection_10m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      precip_val <- if (!is.null(hourly_data$precipitation) && idx <= length(hourly_data$precipitation)) {
        val <- hourly_data$precipitation[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      return(list(
        cloudcover = cloudcover_val,
        shortwave_radiation = radiation_val,
        windspeed_10m = windspeed_val,
        winddirection_10m = winddir_val,
        precipitation = precip_val
      ))
    } else {
      return(default_result)
    }
  }, error = function(e) {
    if (debug) cat("Error:", e$message, "\n")
    return(default_result)
  })
  
  return(result)
}

5.2 Fetch Extended Weather Variables

OpenMeteo has other NOAA data variables we can add to our dataset. This might be helpful to investigate if any of the differences in weather data are dependent on any of these environmental variables.

# Initialize columns
comparison_data$cloudcover <- NA
comparison_data$solar_radiation <- NA
comparison_data$windspeed <- NA
comparison_data$wind_direction <- NA
comparison_data$precipitation <- NA

cat("Fetching extended weather data...\n")

## Fetching extended weather data...

cat("NOTE: Due to API rate limits, we'll fetch with longer delays.\n\n")

## NOTE: Due to API rate limits, we'll fetch with longer delays.

success_count <- 0

for (i in 1:nrow(comparison_data)) {
  if (i %% 5 == 0) {
    cat("Processing point", i, "of", nrow(comparison_data), "\n")
  }
  
  ext_data <- get_extended_weather_data(
    comparison_data$Lat[i],
    comparison_data$Lon[i],
    comparison_data$TimeStampFull[i]
  )
  
  # Safely assign with length checks
  if (length(ext_data$cloudcover) > 0) {
    comparison_data$cloudcover[i] <- ext_data$cloudcover
  }
  if (length(ext_data$shortwave_radiation) > 0) {
    comparison_data$solar_radiation[i] <- ext_data$shortwave_radiation
  }
  if (length(ext_data$windspeed_10m) > 0) {
    comparison_data$windspeed[i] <- ext_data$windspeed_10m
  }
  if (length(ext_data$winddirection_10m) > 0) {
    comparison_data$wind_direction[i] <- ext_data$winddirection_10m
  }
  if (length(ext_data$precipitation) > 0) {
    comparison_data$precipitation[i] <- ext_data$precipitation
  }
  
  if (!is.na(comparison_data$cloudcover[i])) {
    success_count <- success_count + 1
  }
  
  Sys.sleep(0.5)  # delay to respect rate limits. If you download too fast or too much data your IP address will get cut off for the day!
}

## Processing point 5 of 100 
## Processing point 10 of 100 
## Processing point 15 of 100 
## Processing point 20 of 100 
## Processing point 25 of 100 
## Processing point 30 of 100 
## Processing point 35 of 100 
## Processing point 40 of 100 
## Processing point 45 of 100 
## Processing point 50 of 100 
## Processing point 55 of 100 
## Processing point 60 of 100 
## Processing point 65 of 100 
## Processing point 70 of 100 
## Processing point 75 of 100 
## Processing point 80 of 100 
## Processing point 85 of 100 
## Processing point 90 of 100 
## Processing point 95 of 100 
## Processing point 100 of 100

cat("\nExtended data fetch complete!\n")

## 
## Extended data fetch complete!

cat("Successfully retrieved:", success_count, "of", nrow(comparison_data), "points\n")

## Successfully retrieved: 100 of 100 points

5.3 Visualize Additional Variables

# Cloud cover vs temperature error
p1 <- ggplot(comparison_data %>% filter(!is.na(cloudcover)), 
       aes(x = cloudcover, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Cloud Cover",
       x = "Cloud Cover (%)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Solar radiation vs temperature error
p2 <- ggplot(comparison_data %>% filter(!is.na(solar_radiation)), 
       aes(x = solar_radiation, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Solar Radiation",
       x = "Solar Radiation (W/m²)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Wind speed vs temperature error
p3 <- ggplot(comparison_data %>% filter(!is.na(windspeed)), 
       aes(x = windspeed, y = Temp_Diff, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Temperature Error vs Wind Speed",
       x = "Wind Speed (km/h)",
       y = "Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

print(p1)

print(p2)

print(p3)

5.4 Interactive Map of Environmental Variables

# Create color palette for cloud cover
cloud_pal <- colorNumeric(
  palette = "Blues",
  domain = comparison_data$cloudcover,
  na.color = "transparent"
)

# Create map
env_map <- leaflet(comparison_data %>% filter(!is.na(cloudcover))) %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addProviderTiles(providers$Esri.WorldTopoMap, group = "Topo") %>%
  
  # Cloud cover markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 6,
    color = ~cloud_pal(cloudcover),
    fillOpacity = 0.8,
    stroke = TRUE,
    weight = 1,
    group = "Cloud Cover",
    label = ~paste0("Cloud Cover: ", round(cloudcover, 0), "%"),
    popup = ~paste0(
      "<b>Environmental Conditions</b><br>",
      "Cloud Cover: ", round(cloudcover, 0), "%<br>",
      "Solar Radiation: ", round(solar_radiation, 0), " W/m²<br>",
      "Wind Speed: ", round(windspeed, 1), " km/h<br>",
      "Temperature Error: ", round(Temp_Diff, 1), "°F"
    )
  ) %>%
  
  # Solar radiation markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = ~scales::rescale(solar_radiation, to = c(3, 10)),
    color = "orange",
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 1,
    group = "Solar Radiation",
    label = ~paste0("Solar: ", round(solar_radiation, 0), " W/m²"),
    popup = ~paste0(
      "<b>Solar Radiation</b><br>",
      "Value: ", round(solar_radiation, 0), " W/m²<br>",
      "Cloud Cover: ", round(cloudcover, 0), "%<br>",
      "Temperature Error: ", round(Temp_Diff, 1), "°F"
    )
  ) %>%
  
  addLayersControl(
    baseGroups = c("Satellite", "Topo"),
    overlayGroups = c("Cloud Cover", "Solar Radiation"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  hideGroup("Solar Radiation") %>%
  
  addLegend(
    position = "bottomright",
    pal = cloud_pal,
    values = ~cloudcover,
    title = "Cloud Cover (%)",
    group = "Cloud Cover"
  )

env_map

6. Linear Regression Modeling

Build models to predict temperature error based on multiple factors.

6.1 Prepare Modeling Data

# Add time-based variables
model_data <- comparison_data %>%
  filter(!is.na(Temp_Diff)) %>%
  mutate(
    Hour = hour(TimeStampFull),
    Hour_sin = sin(2 * pi * Hour / 24),  # Circular encoding of hour
    Hour_cos = cos(2 * pi * Hour / 24),
    DayOfYear = yday(Date)
  )

cat("Modeling dataset:", nrow(model_data), "observations\n")

## Modeling dataset: 100 observations

cat("Complete cases (no missing values):", sum(complete.cases(model_data)), "\n")

## Complete cases (no missing values): 100

6.2 Model 1: Basic Model (Location and Time)

model1 <- lm(Temp_Diff ~ Lat  + alt_m + Hour_sin + Hour_cos + Device, 
             data = model_data)

summary(model1)

## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device, data = model_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.8536  -2.6879  -0.2452   3.8138  17.6757 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.552e+02  4.153e+02  -2.300 0.023655 *  
## Lat          2.248e+01  1.040e+01   2.161 0.033241 *  
## alt_m       -2.531e-03  1.182e-02  -0.214 0.830823    
## Hour_sin    -2.427e+00  1.146e+00  -2.117 0.036895 *  
## Hour_cos     4.221e+00  1.174e+00   3.596 0.000518 ***
## DeviceRed    1.449e+00  1.742e+00   0.832 0.407360    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.981 on 94 degrees of freedom
## Multiple R-squared:  0.252,  Adjusted R-squared:  0.2122 
## F-statistic: 6.332 on 5 and 94 DF,  p-value: 4.147e-05

# Tidy output
model1_tidy <- tidy(model1, conf.int = TRUE)
knitr::kable(model1_tidy, digits = 4, 
             caption = "Model 1: Location and Time Effects on Temperature Error")

Model 1: Location and Time Effects on Temperature Error
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-955.2139	415.2918	-2.3001	0.0237	-1779.7855	-130.6424
Lat	22.4786	10.4022	2.1609	0.0332	1.8248	43.1324
alt_m	-0.0025	0.0118	-0.2142	0.8308	-0.0260	0.0209
Hour_sin	-2.4267	1.1463	-2.1171	0.0369	-4.7026	-0.1508
Hour_cos	4.2211	1.1740	3.5956	0.0005	1.8902	6.5521
DeviceRed	1.4495	1.7416	0.8323	0.4074	-2.0085	4.9074

6.2 Model 2: Extended Model (Adding Environmental Variables)

# Filter to complete cases for fair comparison
model_data_complete <- model_data %>%
  filter(!is.na(cloudcover) & !is.na(solar_radiation) & !is.na(windspeed))

if (nrow(model_data_complete) > 10) {
  
  model2 <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
                 cloudcover + solar_radiation + windspeed, 
               data = model_data_complete)
  
  summary(model2)
  
  model2_tidy <- tidy(model2, conf.int = TRUE)
  knitr::kable(model2_tidy, digits = 4, 
               caption = "Model 2: Location, Time, and Environmental Effects")
  
  # Compare models
  cat("\nModel Comparison:\n")
  cat("Model 1 R-squared:", round(summary(model1)$r.squared, 4), "\n")
  cat("Model 1 Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")
  cat("Model 2 R-squared:", round(summary(model2)$r.squared, 4), "\n")
  cat("Model 2 Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")
  
} else {
  cat("Insufficient complete cases for Model 2.\n")
  cat("Available complete cases:", nrow(model_data_complete), "\n")
}

## 
## Model Comparison:
## Model 1 R-squared: 0.252 
## Model 1 Adjusted R-squared: 0.2122 
## Model 2 R-squared: 0.3245 
## Model 2 Adjusted R-squared: 0.2651

Exercise 3

Exercise 3: Add Interaction Term(s)

In Model 2 above, add in one or more interaction term to the model to see if the Device makes any difference on the variable coefficients. Try with the most significant variables first. Example: add a Device*Lat variable. Copy model2 code above and paste into code chunk below, then and add in your interaction term. Rename your model the name model3 and then also compute R^2 and adjusted R^2 for model3

# Filter to complete cases (same as above to ensure fairness)
model_data_complete <- model_data %>%
  filter(!is.na(cloudcover) & !is.na(solar_radiation) & !is.na(windspeed))

if (nrow(model_data_complete) > 10) {
  
  # Model 3 with interaction term Device * Lat
  model3 <- lm(Temp_Diff ~ Lat * Device + alt_m + Hour_sin + Hour_cos + 
                 cloudcover + solar_radiation + windspeed, 
               data = model_data_complete)
  
  summary(model3)
  
  # Display tidy coefficient table
  model3_tidy <- tidy(model3, conf.int = TRUE)
  knitr::kable(model3_tidy, digits = 4, 
               caption = "Model 3: Added Interaction Term (Device * Lat)")
  
  # Compare R-squared values across models
  cat("\nModel Comparison:\n")
  cat("Model 1 R-squared:", round(summary(model1)$r.squared, 4), "\n")
  cat("Model 1 Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")
  cat("Model 2 R-squared:", round(summary(model2)$r.squared, 4), "\n")
  cat("Model 2 Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")
  cat("Model 3 R-squared:", round(summary(model3)$r.squared, 4), "\n")
  cat("Model 3 Adjusted R-squared:", round(summary(model3)$adj.r.squared, 4), "\n")
  
} else {
  cat("Insufficient complete cases for Model 3.\n")
  cat("Available complete cases:", nrow(model_data_complete), "\n")
}

## 
## Model Comparison:
## Model 1 R-squared: 0.252 
## Model 1 Adjusted R-squared: 0.2122 
## Model 2 R-squared: 0.3245 
## Model 2 Adjusted R-squared: 0.2651 
## Model 3 R-squared: 0.3523 
## Model 3 Adjusted R-squared: 0.2876

6.3 Initial Model Diagnostics

Before proceeding, let’s check for problematic observations that might be affecting our model.

# Check for multicollinearity
vif_values <- vif(model1)
cat("Variance Inflation Factors (VIF):\n")

## Variance Inflation Factors (VIF):

print(round(vif_values, 2))

##      Lat    alt_m Hour_sin Hour_cos   Device 
##     4.45     4.26     1.04     1.07     1.12

cat("\nNote: VIF > 10 indicates problematic multicollinearity\n\n")

## 
## Note: VIF > 10 indicates problematic multicollinearity

# Normality test
shapiro_test <- shapiro.test(residuals(model1))
shapiro_test2 <- shapiro.test(residuals(model2))
cat("Shapiro-Wilk Normality Test:\n")

## Shapiro-Wilk Normality Test:

cat("W statistic:", round(shapiro_test$statistic, 4), "\n")

## W statistic: 0.9032

cat("p-value:", format.pval(shapiro_test$p.value, digits = 3), "\n")

## p-value: 2.01e-06

if (shapiro_test$p.value < 0.05) {
  cat("WARNING: Residuals are NOT normally distributed (p < 0.05)\n\n")
} else {
  cat("Residuals appear normally distributed (p >= 0.05)\n\n")
}

## WARNING: Residuals are NOT normally distributed (p < 0.05)

# Heteroscedasticity test
bp_test <- bptest(model1)
bp_test2 <- bptest(model2)
cat("Breusch-Pagan Heteroscedasticity Test:\n")

## Breusch-Pagan Heteroscedasticity Test:

cat("BP statistic:", round(bp_test$statistic, 4), "\n")

## BP statistic: 25.9562

cat("p-value:", format.pval(bp_test$p.value, digits = 3), "\n")

## p-value: 9.1e-05

if (bp_test$p.value < 0.05) {
  cat("WARNING: Heteroscedasticity detected (p < 0.05)\n\n")
} else {
  cat("Homoscedasticity assumption met (p >= 0.05)\n\n")
}

## WARNING: Heteroscedasticity detected (p < 0.05)

Interpretation: If we see violations of normality or homoscedasticity, this often indicates the presence of outliers or influential points. Let’s investigate!

6.4 Identify Influential Points (Cook’s Distance)

Cook’s Distance measures how much a single observation influences the entire regression model. We will calculate this for model1

# Calculate Cook's distance
model_data$cooks_d <- cooks.distance(model1)
model_data$residuals <- residuals(model1)
model_data$predicted <- predict(model1)

# Find influential points (Cook's D > 4/n is a common threshold)
threshold <- 4 / nrow(model_data)
influential_points <- model_data %>%
  filter(cooks_d > threshold) %>%
  arrange(desc(cooks_d))

# Create influence plot
ggplot(model_data, aes(x = seq_along(cooks_d), y = cooks_d, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = threshold, linetype = "dashed", color = "red", linewidth = 1.2) +
  annotate("text", x = nrow(model_data) * 0.7, y = threshold, 
           label = paste0("Threshold (4/n) = ", round(threshold, 5)), 
           vjust = -0.5, color = "red") +
  labs(title = "Cook's Distance - Influence Plot",
       subtitle = "Points above red line are potentially influential",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

cat("\nINFLUENTIAL POINTS ANALYSIS\n")

## 
## INFLUENTIAL POINTS ANALYSIS

cat("===========================\n")

## ===========================

cat("Threshold (4/n):", round(threshold, 5), "\n")

## Threshold (4/n): 0.04

cat("Number of influential points:", nrow(influential_points), "\n\n")

## Number of influential points: 4

if (nrow(influential_points) > 0) {
  cat("Top 10 most influential observations:\n")
  influential_summary <- influential_points %>%
    select(TimeStampFull, Device, Temp_F, NOAA_Temp_F, Temp_Diff, 
           Temp_Abs_Diff, alt_m, cooks_d) %>%
    head(10)
  
  print(knitr::kable(influential_summary, digits = 3,
                     caption = "Most Influential Points"))
  
  # Visualize the influential points
  cat("\n")
  cat("Characteristics of influential points:\n")
  cat("- Mean temperature difference:", round(mean(influential_points$Temp_Diff), 2), "°F\n")
  cat("- Mean absolute error:", round(mean(influential_points$Temp_Abs_Diff), 2), "°F\n")
  cat("- Compare to overall mean abs error:", round(mean(model_data$Temp_Abs_Diff), 2), "°F\n")
}

## Top 10 most influential observations:
## 
## 
## Table: Most Influential Points
## 
## |TimeStampFull       |Device | Temp_F| NOAA_Temp_F| Temp_Diff| Temp_Abs_Diff| alt_m| cooks_d|
## |:-------------------|:------|------:|-----------:|---------:|-------------:|-----:|-------:|
## |2025-09-11 20:51:29 |Red    |   34.7|        63.2|     -28.5|          28.5|  2871|   0.161|
## |2025-09-11 20:26:56 |Red    |   34.7|        63.2|     -28.5|          28.5|  2871|   0.161|
## |2025-09-11 19:58:47 |Red    |   34.7|        63.2|     -28.5|          28.5|  2871|   0.151|
## |2025-09-11 16:48:53 |Red    |   34.7|        55.5|     -20.8|          20.8|  2871|   0.056|
## 
## Characteristics of influential points:
## - Mean temperature difference: -26.58 °F
## - Mean absolute error: 26.58 °F
## - Compare to overall mean abs error: 6.3 °F

6.5 Visualize Influential Points

# Scatter plot highlighting influential points
model_data$is_influential <- model_data$cooks_d > threshold

ggplot(model_data, aes(x = Temp_Mean, y = Temp_Diff)) +
  geom_point(aes(color = is_influential, size = is_influential, alpha = is_influential)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  scale_color_manual(values = c("FALSE" = "gray60", "TRUE" = "red"),
                     labels = c("Normal", "Influential"),
                     name = "Point Type") +
  scale_size_manual(values = c("FALSE" = 2, "TRUE" = 4),
                    labels = c("Normal", "Influential"),
                    name = "Point Type") +
  scale_alpha_manual(values = c("FALSE" = 0.5, "TRUE" = 1),
                     labels = c("Normal", "Influential"),
                     name = "Point Type") +
  labs(title = "Temperature Errors with Influential Points Highlighted",
       subtitle = paste("Red points exceed Cook's D threshold of", round(threshold, 5)),
       x = "Mean Temperature (°F)",
       y = "Temperature Difference (Sensor - NOAA) (°F)") +
  theme_minimal()

# Box plot comparing influential vs non-influential
ggplot(model_data, aes(x = is_influential, y = Temp_Abs_Diff, fill = is_influential)) +
  geom_boxplot() +
  scale_fill_manual(values = c("FALSE" = "lightblue", "TRUE" = "red"),
                    labels = c("Normal Points", "Influential Points"),
                    name = "") +
  labs(title = "Absolute Temperature Error: Normal vs Influential Points",
       x = "Point Type",
       y = "Absolute Temperature Error (°F)") +
  scale_x_discrete(labels = c("FALSE" = "Normal", "TRUE" = "Influential")) +
  theme_minimal()

6.6 Remove Influential Points and Refit Model

Now let’s remove the influential points and see how it affects our models.

cat("DATA CLEANING DECISION\n")

## DATA CLEANING DECISION

cat("======================\n")

## ======================

cat("Original dataset:", nrow(model_data), "observations\n")

## Original dataset: 100 observations

cat("Influential points to remove:", sum(model_data$is_influential), "observations\n")

## Influential points to remove: 4 observations

cat("Cleaned dataset:", sum(!model_data$is_influential), "observations\n")

## Cleaned dataset: 96 observations

cat("Percentage removed:", round(sum(model_data$is_influential) / nrow(model_data) * 100, 2), "%\n\n")

## Percentage removed: 4 %

# Create cleaned dataset
model_data_clean <- model_data %>%
  filter(!is_influential)

# Refit the model with cleaned data
model1_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device, 
                   data = model_data_clean)

model2_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
                 cloudcover + solar_radiation + windspeed, 
               data = model_data_clean)

cat("MODEL COMPARISON: BEFORE vs AFTER REMOVING INFLUENTIAL POINTS\n")

## MODEL COMPARISON: BEFORE vs AFTER REMOVING INFLUENTIAL POINTS

cat("=============================================================\n\n")

## =============================================================

cat("ORIGINAL MODEL 1 (with influential points):\n")

## ORIGINAL MODEL 1 (with influential points):

cat("R-squared:", round(summary(model1)$r.squared, 4), "\n")

## R-squared: 0.252

cat("Adjusted R-squared:", round(summary(model1)$adj.r.squared, 4), "\n")

## Adjusted R-squared: 0.2122

cat("RMSE:", round(sqrt(mean(residuals(model1)^2)), 4), "°F\n")

## RMSE: 7.7378 °F

cat("Residual Std Error:", round(summary(model1)$sigma, 4), "°F\n\n")

## Residual Std Error: 7.981 °F

cat("CLEANED MODEL 1 (influential points removed):\n")

## CLEANED MODEL 1 (influential points removed):

cat("R-squared:", round(summary(model1_clean)$r.squared, 4), "\n")

## R-squared: 0.4978

cat("Adjusted R-squared:", round(summary(model1_clean)$adj.r.squared, 4), "\n")

## Adjusted R-squared: 0.4699

cat("RMSE:", round(sqrt(mean(residuals(model1_clean)^2)), 4), "°F\n")

## RMSE: 5.137 °F

cat("Residual Std Error:", round(summary(model1_clean)$sigma, 4), "°F\n\n")

## Residual Std Error: 5.3054 °F

cat("ORIGINAL MODEL 2 (with influential points):\n")

## ORIGINAL MODEL 2 (with influential points):

cat("R-squared:", round(summary(model2)$r.squared, 4), "\n")

## R-squared: 0.3245

cat("Adjusted R-squared:", round(summary(model2)$adj.r.squared, 4), "\n")

## Adjusted R-squared: 0.2651

cat("RMSE:", round(sqrt(mean(residuals(model2)^2)), 4), "°F\n")

## RMSE: 7.3531 °F

cat("Residual Std Error:", round(summary(model2)$sigma, 4), "°F\n\n")

## Residual Std Error: 7.7082 °F

cat("CLEANED MODEL 2 (influential points removed):\n")

## CLEANED MODEL 2 (influential points removed):

cat("R-squared:", round(summary(model2_clean)$r.squared, 4), "\n")

## R-squared: 0.5231

cat("Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")

## Adjusted R-squared: 0.4792

cat("RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 4), "°F\n")

## RMSE: 5.0059 °F

cat("Residual Std Error:", round(summary(model2_clean)$sigma, 4), "°F\n\n")

## Residual Std Error: 5.2585 °F

# Improvement metrics
r2_improvement <- (summary(model1_clean)$r.squared - summary(model1)$r.squared) * 100
rmse_improvement <- ((sqrt(mean(residuals(model1)^2)) - sqrt(mean(residuals(model1_clean)^2))) / 
                     sqrt(mean(residuals(model1)^2))) * 100

cat("IMPROVEMENTS MODEL 1:\n")

## IMPROVEMENTS MODEL 1:

cat("R-squared increase:", round(r2_improvement, 2), "percentage points\n")

## R-squared increase: 24.58 percentage points

cat("RMSE reduction:", round(rmse_improvement, 2), "%\n")

## RMSE reduction: 33.61 %

r2_improvement2 <- (summary(model2_clean)$r.squared - summary(model2)$r.squared) * 100
rmse_improvement2 <- ((sqrt(mean(residuals(model2)^2)) - sqrt(mean(residuals(model2_clean)^2))) / 
                     sqrt(mean(residuals(model2)^2))) * 100

cat("IMPROVEMENTS MODEL 2:\n")

## IMPROVEMENTS MODEL 2:

cat("R-squared increase:", round(r2_improvement2, 2), "percentage points\n")

## R-squared increase: 19.86 percentage points

cat("RMSE reduction:", round(rmse_improvement2, 2), "%\n")

## RMSE reduction: 31.92 %

# Show full model 1 summary
cat("\n")

summary(model1_clean)

## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device, data = model_data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4044  -3.1643  -0.7376   4.0193  14.5993 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 177.520405 294.592679   0.603   0.5483    
## Lat          -5.415407   7.363753  -0.735   0.4640    
## alt_m         0.018286   0.008077   2.264   0.0260 *  
## Hour_sin     -4.722624   0.790716  -5.973 4.58e-08 ***
## Hour_cos      3.984848   0.786164   5.069 2.13e-06 ***
## DeviceRed     2.482078   1.162244   2.136   0.0354 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.305 on 90 degrees of freedom
## Multiple R-squared:  0.4978, Adjusted R-squared:  0.4699 
## F-statistic: 17.84 on 5 and 90 DF,  p-value: 2.972e-12

model1_clean_tidy <- tidy(model1_clean, conf.int = TRUE)
knitr::kable(model1_clean_tidy, digits = 4, 
             caption = "Cleaned Model 1: Location and Time Effects on Temperature Error (Influential Points Removed)")

Cleaned Model 1: Location and Time Effects on Temperature Error (Influential Points Removed)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	177.5204	294.5927	0.6026	0.5483	-407.7394	762.7802
Lat	-5.4154	7.3638	-0.7354	0.4640	-20.0448	9.2140
alt_m	0.0183	0.0081	2.2640	0.0260	0.0022	0.0343
Hour_sin	-4.7226	0.7907	-5.9726	0.0000	-6.2935	-3.1517
Hour_cos	3.9848	0.7862	5.0687	0.0000	2.4230	5.5467
DeviceRed	2.4821	1.1622	2.1356	0.0354	0.1731	4.7911

# Show full model 2 summary
cat("\n")

summary(model2_clean)

## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device + cloudcover + solar_radiation + windspeed, data = model_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3415 -3.5583 -0.6919  3.6300 14.8179 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      85.993676 332.234158   0.259   0.7964  
## Lat              -3.515410   8.205586  -0.428   0.6694  
## alt_m             0.020995   0.008854   2.371   0.0199 *
## Hour_sin         -2.424847   1.956798  -1.239   0.2186  
## Hour_cos          2.623081   1.199107   2.188   0.0314 *
## DeviceRed         2.503957   1.163711   2.152   0.0342 *
## cloudcover        0.030269   0.020971   1.443   0.1525  
## solar_radiation   0.007797   0.005398   1.444   0.1522  
## windspeed        -0.116878   0.136331  -0.857   0.3936  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.258 on 87 degrees of freedom
## Multiple R-squared:  0.5231, Adjusted R-squared:  0.4792 
## F-statistic: 11.93 on 8 and 87 DF,  p-value: 2.456e-11

model2_clean_tidy <- tidy(model2_clean, conf.int = TRUE)
knitr::kable(model2_clean_tidy, digits = 4, 
             caption = "Cleaned Model 2: Location, Time, and Environmental Effects on Temperature Error (Influential Points Removed)")

Cleaned Model 2: Location, Time, and Environmental Effects on Temperature Error (Influential Points Removed)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	85.9937	332.2342	0.2588	0.7964	-574.3577	746.3450
Lat	-3.5154	8.2056	-0.4284	0.6694	-19.8249	12.7941
alt_m	0.0210	0.0089	2.3711	0.0199	0.0034	0.0386
Hour_sin	-2.4248	1.9568	-1.2392	0.2186	-6.3142	1.4645
Hour_cos	2.6231	1.1991	2.1875	0.0314	0.2397	5.0064
DeviceRed	2.5040	1.1637	2.1517	0.0342	0.1910	4.8170
cloudcover	0.0303	0.0210	1.4434	0.1525	-0.0114	0.0720
solar_radiation	0.0078	0.0054	1.4445	0.1522	-0.0029	0.0185
windspeed	-0.1169	0.1363	-0.8573	0.3936	-0.3879	0.1541

6.7 Verify Diagnostic Assumptions on Cleaned Model

cat("DIAGNOSTIC TESTS ON CLEANED MODEL 1\n")

## DIAGNOSTIC TESTS ON CLEANED MODEL 1

cat("==================================\n\n")

## ==================================

# Normality test
shapiro_test_clean <- shapiro.test(residuals(model1_clean))
cat("Shapiro-Wilk Normality Test:\n")

## Shapiro-Wilk Normality Test:

cat("W statistic:", round(shapiro_test_clean$statistic, 4), "\n")

## W statistic: 0.9826

cat("p-value:", format.pval(shapiro_test_clean$p.value, digits = 3), "\n")

## p-value: 0.232

if (shapiro_test_clean$p.value < 0.05) {
  cat("Result: Residuals still show departure from normality\n\n")
} else {
  cat("Result: ✓ Residuals are normally distributed\n\n")
}

## Result: ✓ Residuals are normally distributed

# Heteroscedasticity test
bp_test_clean <- bptest(model1_clean)
cat("Breusch-Pagan Heteroscedasticity Test:\n")

## Breusch-Pagan Heteroscedasticity Test:

cat("BP statistic:", round(bp_test_clean$statistic, 4), "\n")

## BP statistic: 17.4146

cat("p-value:", format.pval(bp_test_clean$p.value, digits = 3), "\n")

## p-value: 0.00378

if (bp_test_clean$p.value < 0.05) {
  cat("Result: Some heteroscedasticity still present\n\n")
} else {
  cat("Result: ✓ Homoscedasticity assumption met\n\n")
}

## Result: Some heteroscedasticity still present

# Compare test statistics
cat("COMPARISON OF DIAGNOSTIC TESTS:\n")

## COMPARISON OF DIAGNOSTIC TESTS:

cat("                           Original Model    Cleaned Model\n")

##                            Original Model    Cleaned Model

cat("Shapiro-Wilk p-value:     ", format.pval(shapiro_test$p.value, digits = 4), "      ", 
    format.pval(shapiro_test_clean$p.value, digits = 4), "\n")

## Shapiro-Wilk p-value:      2.009e-06        0.2319

cat("Breusch-Pagan p-value:    ", format.pval(bp_test$p.value, digits = 4), "      ", 
    format.pval(bp_test_clean$p.value, digits = 4), "\n")

## Breusch-Pagan p-value:     9.1e-05        0.003777

# Diagnostic plots for cleaned model
par(mfrow = c(2, 2))
plot(model1_clean, which = 1:4)

par(mfrow = c(1, 1))

cat("DIAGNOSTIC TESTS ON CLEANED MODEL 2\n")

## DIAGNOSTIC TESTS ON CLEANED MODEL 2

cat("==================================\n\n")

## ==================================

# Normality test
shapiro_test_clean2 <- shapiro.test(residuals(model2_clean))
cat("Shapiro-Wilk Normality Test:\n")

## Shapiro-Wilk Normality Test:

cat("W statistic:", round(shapiro_test_clean2$statistic, 4), "\n")

## W statistic: 0.9715

cat("p-value:", format.pval(shapiro_test_clean2$p.value, digits = 3), "\n")

## p-value: 0.0347

if (shapiro_test_clean2$p.value < 0.05) {
  cat("Result: Residuals still show departure from normality\n\n")
} else {
  cat("Result: ✓ Residuals are normally distributed\n\n")
}

## Result: Residuals still show departure from normality

# Heteroscedasticity test
bp_test_clean2 <- bptest(model2_clean)
cat("Breusch-Pagan Heteroscedasticity Test:\n")

## Breusch-Pagan Heteroscedasticity Test:

cat("BP statistic:", round(bp_test_clean2$statistic, 4), "\n")

## BP statistic: 21.2285

cat("p-value:", format.pval(bp_test_clean2$p.value, digits = 3), "\n")

## p-value: 0.00656

if (bp_test_clean2$p.value < 0.05) {
  cat("Result: Some heteroscedasticity still present\n\n")
} else {
  cat("Result: ✓ Homoscedasticity assumption met\n\n")
}

## Result: Some heteroscedasticity still present

# Compare test statistics
cat("COMPARISON OF DIAGNOSTIC TESTS:\n")

## COMPARISON OF DIAGNOSTIC TESTS:

cat("                           Original Model    Cleaned Model\n")

##                            Original Model    Cleaned Model

cat("Shapiro-Wilk p-value:     ", format.pval(shapiro_test2$p.value, digits = 4), "      ", 
    format.pval(shapiro_test_clean2$p.value, digits = 4), "\n")

## Shapiro-Wilk p-value:      4.513e-05        0.03468

cat("Breusch-Pagan p-value:    ", format.pval(bp_test2$p.value, digits = 4), "      ", 
    format.pval(bp_test_clean2$p.value, digits = 4), "\n")

## Breusch-Pagan p-value:     2.631e-05        0.006565

# Diagnostic plots for cleaned model
par(mfrow = c(2, 2))
plot(model2_clean, which = 1:4)

par(mfrow = c(1, 1))

Key Learning Points: - Removing influential points can dramatically improve model fit and assumption compliance - However, consider WHY these points are influential - they may contain important information - Document any data removal decisions and justify them scientifically - The cleaned model provides more reliable coefficient estimates - There still are issues with the normality assumption (failed Shapiro-Wilk test) - There still seem to be additional points that are still influential. Consider addressing issues with them as well.

Exercise 4

Exercise 4: Investigate the influential points. Are they measurement errors, unusual environmental conditions, or legitimate extreme events? Should they be removed or studied separately? What do you think we should do with these influential points?

When looking at the Cook’s Distance plot, most points being under the threshold shows that most data points are not influencing the model. However, out of the ones that are above the threshold, they are all red. Thus, red has greater variability at a higher observation index. This could mean that the sensor is less durable than purple, and measurements begin to have variability the longer it is collecting data. Additionally, later in our expedition we had more shade coverage, so the red device could be more sensitive to this change. These points should be removed from the data set since most of the data is under the threshold, but they should be studied separately - looking at battery life and environment.

6.8 Model Predictions and Visualization

We will use model2_clean (with the influential points removed) with the dataset model_data_clean for subsequent analysis. If you think you like a different model better, replace every instance of “model2_clean” with “model1”, “model1_clean”, or “model2_clean” and model_data_clean with model_data accordingly.

# Add predictions and residuals for model 2 clean
model_data_clean$predicted <- predict(model2_clean, newdata = model_data_clean)
model_data_clean$residuals <- residuals(model2_clean, newdata = model_data_clean)

# Predicted vs Actual for model 1
ggplot(model_data_clean, aes(x = predicted, y = Temp_Diff)) +
  geom_point(aes(color = Device), alpha = 0.6, size = 3) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
  labs(title = "Model 2 Clean: Predicted vs Actual Temperature Error",
       x = "Predicted Temperature Error (°F)",
       y = "Actual Temperature Error (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

# Residuals by device for model 1
ggplot(model_data_clean, aes(x = Device, y = residuals, fill = Device)) +
  geom_boxplot() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Model 2 Clean: Residuals by Device",
       x = "Device",
       y = "Residuals (°F)") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))

6.9 Model Summary Table

# Create comprehensive model summary
model_summary <- data.frame(
  Model = c("Model 1", if(exists("model2")) "Model 2" else NULL),
  Predictors = c("Location + Time + Device", 
                 if(exists("model2")) "Location + Time + Device + Environment" else NULL),
  N = c(nobs(model1), if(exists("model2")) nobs(model2) else NULL),
  R_squared = c(summary(model1)$r.squared, 
                if(exists("model2")) summary(model2)$r.squared else NULL),
  Adj_R_squared = c(summary(model1)$adj.r.squared, 
                    if(exists("model2")) summary(model2)$adj.r.squared else NULL),
  RMSE = c(sqrt(mean(residuals(model1)^2)), 
           if(exists("model2")) sqrt(mean(residuals(model2)^2)) else NULL),
  AIC = c(AIC(model1), if(exists("model2")) AIC(model2) else NULL),
  BIC = c(BIC(model1), if(exists("model2")) BIC(model2) else NULL)
)

knitr::kable(model_summary, digits = 4, 
             caption = "Model Performance Comparison for Full Data",
             col.names = c("Model", "Predictors", "N", "R²", "Adj R²", "RMSE", "AIC", "BIC"))

Model Performance Comparison for Full Data
Model	Predictors	N	R²	Adj R²	RMSE	AIC	BIC
Model 1	Location + Time + Device	100	0.2520	0.2122	7.7378	707.0121	725.2483
Model 2	Location + Time + Device + Environment	100	0.3245	0.2651	7.3531	702.8132	728.8649

# Create comprehensive model summary for cleaned models
model_summary2 <- data.frame(
  Model = c("Model 1 Clean", if(exists("model2_clean")) "Model 2 Clean" else NULL),
  Predictors = c("Location + Time + Device", 
                 if(exists("model2_clean")) "Location + Time + Device + Environment" else NULL),
  N = c(nobs(model1_clean), if(exists("model2_clean")) nobs(model2_clean) else NULL),
  R_squared = c(summary(model1_clean)$r.squared, 
                if(exists("model2_clean")) summary(model2_clean)$r.squared else NULL),
  Adj_R_squared = c(summary(model1_clean)$adj.r.squared, 
                    if(exists("model2_clean")) summary(model2_clean)$adj.r.squared else NULL),
  RMSE = c(sqrt(mean(residuals(model1)^2)), 
           if(exists("model2_clean")) sqrt(mean(residuals(model2_clean)^2)) else NULL),
  AIC = c(AIC(model1_clean), if(exists("model2_clean")) AIC(model2_clean) else NULL),
  BIC = c(BIC(model1_clean), if(exists("model2_clean")) BIC(model2_clean) else NULL)
)

knitr::kable(model_summary2, digits = 4, 
             caption = "Model Performance Comparison for Cleaned Data",
             col.names = c("Model", "Predictors", "N", "R²", "Adj R²", "RMSE", "AIC", "BIC"))

Model Performance Comparison for Cleaned Data
Model	Predictors	N	R²	Adj R²	RMSE	AIC	BIC
Model 1 Clean	Location + Time + Device	96	0.4978	0.4699	7.7378	600.6365	618.5869
Model 2 Clean	Location + Time + Device + Environment	96	0.5231	0.4792	5.0059	601.6759	627.3194

6.10 Coefficient Plot

# Create model 1 coefficient plot with confidence intervals
coef_plot_data <- model1_clean_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = factor(term, levels = rev(term)))

ggplot(coef_plot_data, aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
  geom_point(size = 3, color = "blue") +
  labs(title = "Model 1 (Clean) Coefficients with 95% Confidence Intervals",
       subtitle = "Effects on Temperature Error (°F)",
       x = "Coefficient Estimate",
       y = "Predictor Variable") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1))

# Create model 2 coefficient plot with confidence intervals
coef_plot_data2 <- model2_clean_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = factor(term, levels = rev(term)))

ggplot(coef_plot_data2, aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 1) +
  geom_point(size = 3, color = "blue") +
  labs(title = "Model 2 (Clean) Coefficients with 95% Confidence Intervals",
       subtitle = "Effects on Temperature Error (°F)",
       x = "Coefficient Estimate",
       y = "Predictor Variable") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1))

6.11 Interactive 3D Visualization

# 3D scatter plot: Lat, Lon, Altitude colored by temp error
plot_ly(model_data, 
        x = ~Lon, y = ~Lat, z = ~alt_m,
        color = ~Temp_Diff, 
        colors = colorRamp(c("blue", "white", "red")),
        marker = list(size = 5),
        text = ~paste0("Device: ", Device, 
                      "<br>Temp Error: ", round(Temp_Diff, 2), "°F",
                      "<br>Altitude: ", round(alt_m, 0), "m"),
        hoverinfo = "text") %>%
  add_markers() %>%
  layout(title = "Temperature Error in 3D Space",
         scene = list(
           xaxis = list(title = "Longitude"),
           yaxis = list(title = "Latitude"),
           zaxis = list(title = "Altitude (m)")
         ))

6.12 Key Findings and Interpretation

# Extract significant predictors (p < 0.05)
significant_predictors <- model2_clean_tidy %>%
  filter(p.value < 0.05, term != "(Intercept)")

cat("REGRESSION ANALYSIS SUMMARY FOR MODEL 2 CLEAN (with influential points removed)\n")

## REGRESSION ANALYSIS SUMMARY FOR MODEL 2 CLEAN (with influential points removed)

cat("======================================================\n\n")

## ======================================================

cat("Model Performance:\n")

## Model Performance:

cat("- R-squared:", round(summary(model2_clean)$r.squared, 4), 
    "(model explains", round(summary(model2_clean)$r.squared * 100, 1), "% of variance)\n")

## - R-squared: 0.5231 (model explains 52.3 % of variance)

cat("- Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")

## - Adjusted R-squared: 0.4792

cat("- RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 3), "°F\n")

## - RMSE: 5.006 °F

cat("- F-statistic:", round(summary(model2_clean)$fstatistic[1], 2), 
    "on", summary(model2_clean)$fstatistic[2], "and", summary(model2_clean)$fstatistic[3], "DF\n")

## - F-statistic: 11.93 on 8 and 87 DF

cat("- Overall p-value:", format.pval(pf(summary(model2_clean)$fstatistic[1], 
    summary(model2_clean)$fstatistic[2], summary(model2_clean)$fstatistic[3], 
    lower.tail = FALSE), digits = 3), "\n\n")

## - Overall p-value: 2.46e-11

if (nrow(significant_predictors) > 0) {
  cat("Significant Predictors (p < 0.05):\n")
  for (i in 1:nrow(significant_predictors)) {
    cat("-", significant_predictors$term[i], ":", 
        round(significant_predictors$estimate[i], 4),
        "(p =", format.pval(significant_predictors$p.value[i], digits = 3), ")\n")
  }
} else {
  cat("No significant predictors at p < 0.05 level.\n")
}

## Significant Predictors (p < 0.05):
## - alt_m : 0.021 (p = 0.0199 )
## - Hour_cos : 2.6231 (p = 0.0314 )
## - DeviceRed : 2.504 (p = 0.0342 )

cat("\nInterpretation:\n")

## 
## Interpretation:

cat("- Device effect:", 
    if("DeviceRed" %in% significant_predictors$term) {
      "Red device shows significantly different error pattern"
    } else {
      "No significant difference between devices"
    }, "\n")

## - Device effect: Red device shows significantly different error pattern

cat("- Spatial effects:", 
    if(any(c("Lat", "Lon", "alt_m") %in% significant_predictors$term)) {
      "Location/elevation significantly affects temperature error"
    } else {
      "Location/elevation does not significantly affect error"
    }, "\n")

## - Spatial effects: Location/elevation significantly affects temperature error

cat("- Temporal effects:", 
    if(any(c("Hour_sin", "Hour_cos") %in% significant_predictors$term)) {
      "Time of day significantly affects temperature error"
    } else {
      "Time of day does not significantly affect error"
    }, "\n")

## - Temporal effects: Time of day significantly affects temperature error

cat("- Environmental effects:", 
    if(any(c("cloudcover", "solar_radiation", "windspeed") %in% significant_predictors$term)) {
      "Environment significantly affects temperature error"
    } else {
      "Environment does not significantly affect error"
    }, "\n")

## - Environmental effects: Environment does not significantly affect error

Exercise 5

Exercise 5: Based on the regression results, what are the most important factors influencing temperature measurement errors and how did you identify them?

With red’s influential points removed, there is a temperature bias of 2.5 degrees higher than NOAA which is significant as the p-value is 0.034. Temperature increases with elevation, with the altitude of 0.021 degrees per meter and a p-value of 0.020, which is significant. Thus, the sensor calibration could vary under higher altitude conditions. Additionally, the variation occurs with time of day in a consistent pattern with a coefficient of 2.62 degrees modulation and a significant p-value of 0.031.This could be due to the sensor getting too hot. Thus, overall, red has a consistent error related to the device, not outside effects. Model r squared is 0.52, which means half of this variation in temperature error can be explained by these factors.

7. Advanced Diagnostic Plots

7.1 Residuals vs Fitted by Device

# Enhanced residual plot
ggplot(model_data_clean, aes(x = predicted, y = residuals, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  geom_smooth(method = "loess", se = TRUE, linewidth = 1) +
  labs(title = "Residual Analysis by Device",
       x = "Fitted Values (°F)",
       y = "Residuals (°F)") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.2 Q-Q Plot by Device

ggplot(model_data_clean, aes(sample = residuals, color = Device)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~Device) +
  labs(title = "Normal Q-Q Plots by Device",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.3 Scale-Location Plot

model_data_clean$sqrt_abs_residuals <- sqrt(abs(model_data_clean$residuals))

ggplot(model_data_clean, aes(x = predicted, y = sqrt_abs_residuals, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "loess", se = TRUE) +
  labs(title = "Scale-Location Plot",
       subtitle = "Check for homoscedasticity",
       x = "Fitted Values (°F)",
       y = expression(sqrt("|Standardized Residuals|"))) +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.4 Influence Plot

# Calculate Cook's distance
model_data_clean$cooks_d <- cooks.distance(model2_clean)

# Find influential points (Cook's D > 4/n)
threshold <- 4 / nrow(model_data_clean)
influential_points <- model_data_clean %>%
  filter(cooks_d > threshold) %>%
  arrange(desc(cooks_d))

ggplot(model_data_clean, aes(x = seq_along(cooks_d), y = cooks_d, color = Device)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_hline(yintercept = threshold, linetype = "dashed", color = "red") +
  labs(title = "Cook's Distance - Influence Plot",
       subtitle = paste("Red line indicates threshold (4/n =", round(threshold, 4), ")"),
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

if (nrow(influential_points) > 0) {
  cat("\nInfluential observations (Cook's D >", round(threshold, 4), "):\n")
  cat("Number of influential points:", nrow(influential_points), "\n")
  cat("Top 5 most influential:\n")
  print(head(influential_points %>% 
               select(TimeStampFull, Device, Temp_Diff, predicted, cooks_d), 5))
}

## 
## Influential observations (Cook's D > 0.0417 ):
## Number of influential points: 6 
## Top 5 most influential:
## # A tibble: 5 × 5
##   TimeStampFull       Device Temp_Diff predicted cooks_d
##   <dttm>              <chr>      <dbl>     <dbl>   <dbl>
## 1 2025-09-10 21:49:17 Red         1.22      9.75  0.0763
## 2 2025-09-11 00:30:00 Red        -6.38      2.96  0.0611
## 3 2025-09-10 18:43:20 Purple      9.3       1.21  0.0553
## 4 2025-09-04 22:02:11 Red        24.4       9.54  0.0480
## 5 2025-09-04 22:49:14 Red        24.4       9.54  0.0480

8. Spatial Regression Analysis

8.1 Error Hotspot Map

# Create error magnitude categories
model_data_clean <- model_data_clean %>%
  mutate(
    Error_Magnitude = cut(Temp_Abs_Diff, 
                          breaks = c(0, 1, 3, 5, Inf),
                          labels = c("Very Low (<1°F)", "Low (1-3°F)", 
                                   "Medium (3-5°F)", "High (>5°F)"))
  )

# Color palette
error_pal <- colorFactor(
  palette = c("#2ecc71", "#f1c40f", "#e67e22", "#e74c3c"),
  domain = model_data_clean$Error_Magnitude
)

hotspot_map <- leaflet(model_data_clean) %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addProviderTiles(providers$Esri.WorldTopoMap, group = "Topo") %>%
  addWMSTiles(
    baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
    layers = "0",
    options = WMSTileOptions(format = "image/png", transparent = TRUE),
    attribution = "USGS The National Map",
    group = "USGS Topo"  
  ) %>%
  
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = ~Temp_Abs_Diff * 2,  # Size proportional to error magnitude
    color = ~error_pal(Error_Magnitude),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 1,
    label = ~paste0("Error: ", round(Temp_Diff, 2), "°F | Magnitude: ", Error_Magnitude),
    popup = ~paste0(
      "<b>Error Hotspot Analysis</b><br>",
      "<b>Device:</b> ", Device, "<br>",
      "<b>Temperature Error:</b> ", round(Temp_Diff, 2), "°F<br>",
      "<b>Absolute Error:</b> ", round(Temp_Abs_Diff, 2), "°F<br>",
      "<b>Magnitude:</b> ", Error_Magnitude, "<br>",
      "<b>Model Prediction:</b> ", round(predicted, 2), "°F<br>",
      "<b>Residual:</b> ", round(residuals, 2), "°F<br>",
      "<hr>",
      "<b>Location:</b><br>",
      "Lat: ", round(Lat, 5), "<br>",
      "Lon: ", round(Lon, 5), "<br>",
      "Altitude: ", round(alt_m, 0), "m<br>",
      "<b>Time:</b> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  addLayersControl(
    baseGroups = c("Satellite", "Topo", "USGS Topo"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = error_pal,
    values = ~Error_Magnitude,
    title = "Error Magnitude",
    opacity = 1
  ) %>%
  
  addScaleBar(position = "bottomleft")

hotspot_map

8.2 Model Residuals Map

# Color palette for residuals
residual_pal <- colorNumeric(
  palette = "RdBu",
  domain = c(-max(abs(model_data_clean$residuals)), max(abs(model_data_clean$residuals))),
  reverse = TRUE
)

residuals_map <- leaflet(model_data_clean) %>%
  addProviderTiles("Esri.WorldTopoMap") %>%
  
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 6,
    color = ~residual_pal(residuals),
    fillOpacity = 0.8,
    stroke = TRUE,
    weight = 1,
    label = ~paste0("Residual: ", round(residuals, 2), "°F"),
    popup = ~paste0(
      "<b>Model Residual Analysis</b><br>",
      "<b>Observed:</b> ", round(Temp_Diff, 2), "°F<br>",
      "<b>Predicted:</b> ", round(predicted, 2), "°F<br>",
      "<b>Residual:</b> ", round(residuals, 2), "°F<br>",
      "<b>Device:</b> ", Device, "<br>",
      "<b>Time:</b> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = residual_pal,
    values = ~residuals,
    title = "Residuals (°F)<br>Red = Over-predicted<br>Blue = Under-predicted",
    opacity = 1
  )

residuals_map

9. Summary and Conclusions

9.1 Key Statistical Findings

cat("STATISTICAL ANALYSIS SUMMARY\n")

## STATISTICAL ANALYSIS SUMMARY

cat("============================\n\n")

## ============================

cat("1. BLAND-ALTMAN ANALYSIS\n")

## 1. BLAND-ALTMAN ANALYSIS

cat("   Temperature:\n")

##    Temperature:

cat("   - Mean Bias:", round(temp_bias, 3), "°F\n")

##    - Mean Bias: -0.124 °F

cat("   - 95% Limits of Agreement:", round(temp_lower_loa, 3), "to", round(temp_upper_loa, 3), "°F\n")

##    - 95% Limits of Agreement: -17.748 to 17.5 °F

cat("   - Within limits:", 
    round(sum(comparison_data$Temp_Diff >= temp_lower_loa & 
              comparison_data$Temp_Diff <= temp_upper_loa, na.rm = TRUE) / 
          sum(!is.na(comparison_data$Temp_Diff)) * 100, 1), "%\n\n")

##    - Within limits: 92 %

cat("2. PAIRED T-TESTS\n")

## 2. PAIRED T-TESTS

cat("   Temperature: t =", round(temp_ttest$statistic, 3), 
    ", p =", format.pval(temp_ttest$p.value, digits = 3), "\n")

##    Temperature: t = -0.138 , p = 0.891

cat("   Humidity: t =", round(hum_ttest$statistic, 3), 
    ", p =", format.pval(hum_ttest$p.value, digits = 3), "\n")

##    Humidity: t = 4.615 , p = 1.18e-05

cat("   Pressure: t =", round(press_ttest$statistic, 3), 
    ", p =", format.pval(press_ttest$p.value, digits = 3), "\n\n")

##    Pressure: t = -8.519 , p = 1.83e-13

cat("3. REGRESSION MODEL 2 CLEAN (with influential points removed)\n")

## 3. REGRESSION MODEL 2 CLEAN (with influential points removed)

cat("   R-squared:", round(summary(model2_clean)$r.squared, 4), "\n")

##    R-squared: 0.5231

cat("   Adjusted R-squared:", round(summary(model2_clean)$adj.r.squared, 4), "\n")

##    Adjusted R-squared: 0.4792

cat("   RMSE:", round(sqrt(mean(residuals(model2_clean)^2)), 3), "°F\n")

##    RMSE: 5.006 °F

cat("   Number of significant predictors:", nrow(significant_predictors), "\n")

##    Number of significant predictors: 3

9.2 Recommendations

Based on the statistical analysis:

Sensor Accuracy:
- Evaluate the magnitude of bias from Bland-Altman plots
- Consider sensor calibration if systematic bias is detected
- Assess if errors are within acceptable limits for your application

The temperature bias is only -0.124 degrees, which is very small, and so the devices don’t differ from the NOAA data on average. The error is wide with 95% of the limits ranging from -17.7 to 17.5 degrees, so even though the data is unbiased, the individual readings differ substantially. The paired t-tests show that humidity and pressure both are biased and have significant p-values. The humidity measurements are consistently higher than the NOAA measurements and the pressure measurements are consistently lower than NPAA. This means that these devices need calibration corrections. The data is likely ok for general use in outdoor settings, but for research purposes, correction equations should be used.

Influential Factors:
- Review regression coefficients to identify key error drivers
- Consider environmental corrections for cloud cover
- Account for time-of-day effects in data interpretation

For the influential factors, the red device tends to give warmer temperature readings than the purple device and has a bias of 2.5 degrees. This accuracy decreases in higher altitudes and as the day goes on, possibly meaning that sun exposure affects the readings.

Model Improvement:
- Deal with influential data points. Where there days/time when the sensors were obviously not working correctly or were there errors in the data set?
- Consider transformations on the dependent variable to correct for non-normality of residuals
- Consider non-linear relationships or interactions
- Consider a mixed-effects model, where data points are grouped into time/space clusters
- Validate model on independent test data

The sensors had larger temperature errors over the threshold later in the data collection period, possibly meaning that as it was being used more the measurements became less accurate. Additionally, battery life was getting lower which could also affect the accuracy. Lastly, only red was collecting data in the last few days and purple would no longer connect to the phones, indicating further that there was problems with the sensors as they remained in use for an extended period of time.

9.3 Further Analysis Suggestions

Exercise 6

Exercise 6: Design and implement one of the following extensions and put your code in the chunk below.

Easier Extensions:

Perform separate regression models for each device
Remove some variables from model2_clean that are not significant. Try to get the Baysian Information Criterion (BIC) or Akaike’s Information Criterion (AIC) as low as possible.

Harder Extensions:

Deal with the issue of influential points. There are still influential points in the model made from the cleaned data. Are there errors in the data set?
Deal with the issue of non-normality (failed Shapiro-Wilk test). Ideas: inspect distribution, maybe do a transform on the dependent variable (logarithm?), or use robust regression.
Deal with the issue of heteroscedasticity (failed Breusch-Pagan test). Ideas: weighted least squares, or robust standard errors.
Try a mixed effect model (try function lmer from package lme4 instead of lm). Create data clusters based on time/space bins and a term “+ (1 | cluster_variable_name)” to your model. This might also help deal with the issues of non-normality and/or heteroscedasticity

#put your code for exercise 5 extension below
model2_clean <- lm(Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + Device +
                     cloudcover + solar_radiation + windspeed, 
                   data = model_data_clean)

summary(model2_clean)

## 
## Call:
## lm(formula = Temp_Diff ~ Lat + alt_m + Hour_sin + Hour_cos + 
##     Device + cloudcover + solar_radiation + windspeed, data = model_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3415 -3.5583 -0.6919  3.6300 14.8179 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      85.993676 332.234158   0.259   0.7964  
## Lat              -3.515410   8.205586  -0.428   0.6694  
## alt_m             0.020995   0.008854   2.371   0.0199 *
## Hour_sin         -2.424847   1.956798  -1.239   0.2186  
## Hour_cos          2.623081   1.199107   2.188   0.0314 *
## DeviceRed         2.503957   1.163711   2.152   0.0342 *
## cloudcover        0.030269   0.020971   1.443   0.1525  
## solar_radiation   0.007797   0.005398   1.444   0.1522  
## windspeed        -0.116878   0.136331  -0.857   0.3936  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.258 on 87 degrees of freedom
## Multiple R-squared:  0.5231, Adjusted R-squared:  0.4792 
## F-statistic: 11.93 on 8 and 87 DF,  p-value: 2.456e-11

AIC(model2_clean)

## [1] 601.6759

BIC(model2_clean)

## [1] 627.3194

model_reduced <- lm(Temp_Diff ~ alt_m + Hour_cos + Device, 
                    data = model_data_clean)

summary(model_reduced)

## 
## Call:
## lm(formula = Temp_Diff ~ alt_m + Hour_cos + Device, data = model_data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4542  -3.9073  -0.9581   3.3844  16.9495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -48.411019  13.402736  -3.612 0.000495 ***
## alt_m         0.016422   0.004506   3.644 0.000444 ***
## Hour_cos      4.379344   0.902665   4.852 4.97e-06 ***
## DeviceRed     1.712755   1.310806   1.307 0.194591    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.217 on 92 degrees of freedom
## Multiple R-squared:  0.2951, Adjusted R-squared:  0.2721 
## F-statistic: 12.84 on 3 and 92 DF,  p-value: 4.436e-07

AIC(model_reduced)

## [1] 629.1873

BIC(model_reduced)

## [1] 642.009

cat("Model 2 Clean AIC:", AIC(model2_clean), "\n")

## Model 2 Clean AIC: 601.6759

cat("Reduced Model AIC:", AIC(model_reduced), "\n\n")

## Reduced Model AIC: 629.1873

cat("Model 2 Clean BIC:", BIC(model2_clean), "\n")

## Model 2 Clean BIC: 627.3194

cat("Reduced Model BIC:", BIC(model_reduced), "\n")

## Reduced Model BIC: 642.009

I simplified the model by removing what were not statistically significant in Model 2 Clean and refit a reduced model. The reduced model has alt, time of day, and device but it showed a higher AIC, BIC, and a lower adjusted r2 with increased error. This, the non-significant predictors should remain in the model and so model 2 clean should be used.

Publishing to RPubs:

Knit this document to HTML
Click “Publish” in the preview window
Select RPubs and follow prompts
Submit your statistical analysis by pasting the url in Canvas

Data Export:

# Export enhanced dataset with model predictions
 write.csv(model_data, "weather_modeling_results.csv", row.names = FALSE)

cat("\nAnalysis complete! Enhanced dataset includes:\n")

## 
## Analysis complete! Enhanced dataset includes:

cat("- NOAA reference data\n")

## - NOAA reference data

cat("- Extended environmental variables\n")

## - Extended environmental variables

cat("- Model predictions and residuals\n")

## - Model predictions and residuals

cat("- Statistical test results\n")

## - Statistical test results

Weather Data Statistical Modeling

Bridget Willson

2025-11-10