Introduction

This tutorial demonstrates how to import, clean, analyze, and visualize meteorological tracking data in R. We’ll compare sensor measurements against NOAA weather data to evaluate sensor accuracy and determine which device performs better.

1. Load Required Packages

First, we’ll install and load all necessary packages:

# Install packages if needed (uncomment the lines below)
# install.packages(c("readxl", "dplyr", "ggplot2", "leaflet", "lubridate", "summarytools", "httr", "jsonlite", "plotly"))

library(readxl)      # For reading Excel files
library(dplyr)       # For data manipulation
library(ggplot2)     # For plotting
library(leaflet)     # For interactive maps
library(lubridate)   # For date-time manipulation
library(summarytools)# For summary statistics
library(httr)        # For API calls
library(jsonlite)    # For JSON parsing
library(plotly)      # For interactive plots

2. Import and Prepare Data

2.1 Import the Dataset

# Import the Excel file
weather_data <- read_excel("finalMTdata.xlsx")

# View the first few rows
head(weather_data)

## # A tibble: 6 × 10
##   TimeStampFull   Date                  Lat   Lon alt_m Temp_F   Hum Press    DP
##   <chr>           <dttm>              <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 2025-09-01T09:… 2025-09-01 00:00:00  43.0 -110.  2826   22.4    20   732  -1.5
## 2 2025-09-01T09:… 2025-09-01 00:00:00  43.0 -110.  2825   22.4    20   732  -1.5
## 3 2025-09-01T09:… 2025-09-01 00:00:00  43.0 -110.  2826   22.4    20   732  -1.5
## 4 2025-09-01T09:… 2025-09-01 00:00:00  43.0 -110.  2826   22.4    20   732  -1.5
## 5 2025-09-01T09:… 2025-09-01 00:00:00  43.0 -110.  2825   22.4    20   732  -1.3
## 6 2025-09-01T09:… 2025-09-01 00:00:00  43.0 -110.  2824   22.4    20   732  -1.1
## # ℹ 1 more variable: Device <chr>

2.2 Assign Correct Data Types

Pay special attention to the timestamp, date, and time variables:

# Convert data types
weather_data <- weather_data %>%
  mutate(
    # Parse ISO 8601 timestamp with timezone
    TimeStampFull = ymd_hms(TimeStampFull, tz = "UTC"),
    
    # Parse date
    Date = ymd(Date),
    
    
    # Ensure numeric variables are numeric
    Lat = as.numeric(Lat),
    Lon = as.numeric(Lon),
    alt_m = as.numeric(alt_m),
    Hum = as.numeric(Hum),
    Press = as.numeric(Press),
    DP = as.numeric(DP),
    
    # Ensure Device is a factor
    Device = as.factor(Device)
  )

# Check if temperature is actually in Celsius (common issue with sensor data)
# If the column is named Temp_F but values are small (typical for C), convert to F
if (mean(weather_data$Temp_F, na.rm = TRUE) < 30) {
  cat("NOTE: Temperature values appear to be in Celsius. Converting to Fahrenheit...\n")
  weather_data <- weather_data %>%
    mutate(
      Temp_C = Temp_F,  # Save original Celsius values
      Temp_F = (Temp_F * 9/5) + 32  # Convert to Fahrenheit
    )
  cat("Conversion complete. Original Celsius values saved in 'Temp_C' column.\n\n")
} else {
  # Ensure Temp_F is numeric
  weather_data$Temp_F <- as.numeric(weather_data$Temp_F)
}

## NOTE: Temperature values appear to be in Celsius. Converting to Fahrenheit...
## Conversion complete. Original Celsius values saved in 'Temp_C' column.

# Check the structure
str(weather_data)

## tibble [14,441 × 11] (S3: tbl_df/tbl/data.frame)
##  $ TimeStampFull: POSIXct[1:14441], format: "2025-09-01 15:48:18" "2025-09-01 15:48:21" ...
##  $ Date         : Date[1:14441], format: "2025-09-01" "2025-09-01" ...
##  $ Lat          : num [1:14441] 43 43 43 43 43 ...
##  $ Lon          : num [1:14441] -110 -110 -110 -110 -110 ...
##  $ alt_m        : num [1:14441] 2826 2825 2826 2826 2825 ...
##  $ Temp_F       : num [1:14441] 72.3 72.3 72.3 72.3 72.3 ...
##  $ Hum          : num [1:14441] 20 20 20 20 20 20 20 20 20 22 ...
##  $ Press        : num [1:14441] 732 732 732 732 732 732 732 732 733 732 ...
##  $ DP           : num [1:14441] -1.5 -1.5 -1.5 -1.5 -1.3 -1.1 -1.9 -2.3 -2.6 -2.1 ...
##  $ Device       : Factor w/ 3 levels "Device","Purple",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Temp_C       : num [1:14441] 22.4 22.4 22.4 22.4 22.4 22.4 22 21.5 20.6 19.9 ...

# Display temperature range to verify
cat("\nTemperature range after conversion:\n")

## 
## Temperature range after conversion:

cat("Min:", round(min(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")

## Min: 29.1 °F

cat("Max:", round(max(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")

## Max: 92.5 °F

cat("Mean:", round(mean(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")

## Mean: 52.1 °F

3. Mapping The Data

# Load necessary library
library(leaflet)

# Clean up and standardize the Device column
weather_data$Device <- trimws(weather_data$Device) # remove extra spaces
weather_data$Device <- factor(weather_data$Device, levels = c("Red", "Purple"))

# Create color palette for devices (Red and Purple)
device_colors <- colorFactor(
  palette = c("red", "purple"),
  domain = levels(weather_data$Device)
)

# Create the leaflet map
leaflet_map <- leaflet(weather_data) %>%
  # Add multiple base layers
  addProviderTiles(providers$OpenTopoMap, group = "OpenTopoMap") %>%
  addProviderTiles(providers$Esri.WorldTopoMap, group = "ESRI Topo") %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addWMSTiles(
    baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
    layers = "0",
    options = WMSTileOptions(format = "image/png", transparent = TRUE),
    attribution = "USGS The National Map",
    group = "USGS Topo"
  ) %>%
  addProviderTiles(providers$OpenStreetMap, group = "OpenStreetMap") %>%
  
  # Add markers with popups and labels
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 5,
    color = ~device_colors(Device),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 1,
    label = ~paste0(
      "Device: ", Device, 
      " | Temp: ", round(Temp_F, 1), "°F",
      " | Hum: ", Hum, "%",
      " | Press: ", Press, " hPa",
      " | DP: ", round(DP, 1), "°F"
    ),
    popup = ~paste0(
      "<strong>Device:</strong> ", Device, "<br/>",
      "<strong>Temperature:</strong> ", round(Temp_F, 1), "°F<br/>",
      "<strong>Humidity:</strong> ", Hum, "%<br/>",
      "<strong>Pressure:</strong> ", Press, " hPa<br/>",
      "<strong>Dew Point:</strong> ", round(DP, 1), "°F<br/>",
      "<strong>Altitude:</strong> ", round(alt_m, 1), " m<br/>",
      "<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M:%S")
    )
  ) %>%
  
  # Add layer control
  addLayersControl(
    baseGroups = c("USGS Topo","OpenTopoMap", "ESRI Topo", "Satellite",  "OpenStreetMap"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  # Add legend
  addLegend(
    position = "bottomright",
    pal = device_colors,
    values = ~Device,
    title = "Device",
    opacity = 1
  ) %>%
  
  # Add scale bar
  addScaleBar(position = "bottomleft")

# Display the map
leaflet_map

Exercise 1

Exercise 1: In the map above, try changing the base layer that loads by default. Currently it loads OpenTopoMap first - can you make it load a different layer first instead? (Hint: The order in which you add the provider tiles matters!)

4. Basic Data Exploration

4.1 Dataset Overview

# Display dataset dimensions
cat("Dataset dimensions:", nrow(weather_data), "rows x", ncol(weather_data), "columns\n\n")

## Dataset dimensions: 14441 rows x 11 columns

# Check date range
cat("Dataset date range:\n")

## Dataset date range:

cat("Earliest:", as.character(min(weather_data$Date, na.rm = TRUE)), "\n")

## Earliest: 2025-09-01

cat("Latest:", as.character(max(weather_data$Date, na.rm = TRUE)), "\n\n")

## Latest: 2025-09-11

# Summary of all variables
summary(weather_data)

##  TimeStampFull                      Date                 Lat       
##  Min.   :2025-09-01 15:48:18   Min.   :2025-09-01   Min.   :42.52  
##  1st Qu.:2025-09-03 05:09:43   1st Qu.:2025-09-02   1st Qu.:42.64  
##  Median :2025-09-04 22:42:13   Median :2025-09-04   Median :42.86  
##  Mean   :2025-09-06 06:06:15   Mean   :2025-09-05   Mean   :42.82  
##  3rd Qu.:2025-09-09 20:54:02   3rd Qu.:2025-09-09   3rd Qu.:42.98  
##  Max.   :2025-09-12 00:47:54   Max.   :2025-09-11   Max.   :43.02  
##  NA's   :1                     NA's   :1            NA's   :1      
##       Lon             alt_m          Temp_F           Hum       
##  Min.   :-109.8   Min.   :2669   Min.   :29.12   Min.   :13.00  
##  1st Qu.:-109.6   1st Qu.:2839   1st Qu.:41.00   1st Qu.:32.00  
##  Median :-109.4   Median :2991   Median :50.90   Median :53.00  
##  Mean   :-109.4   Mean   :2976   Mean   :52.06   Mean   :52.78  
##  3rd Qu.:-109.2   3rd Qu.:3119   3rd Qu.:63.55   3rd Qu.:71.00  
##  Max.   :-109.0   Max.   :3320   Max.   :92.48   Max.   :96.00  
##  NA's   :1        NA's   :1      NA's   :1       NA's   :1      
##      Press             DP             Device         Temp_C     
##  Min.   :691.0   Min.   :-5.6000   Red   :8576   Min.   :-1.60  
##  1st Qu.:708.0   1st Qu.:-1.3000   Purple:5864   1st Qu.: 5.00  
##  Median :716.0   Median : 0.2000   NA's  :   1   Median :10.50  
##  Mean   :717.4   Mean   : 0.5241                 Mean   :11.14  
##  3rd Qu.:727.0   3rd Qu.: 2.3000                 3rd Qu.:17.52  
##  Max.   :740.0   Max.   :11.8000                 Max.   :33.60  
##  NA's   :1       NA's   :1                       NA's   :1

4.2 Summary Statistics by Device

# Summary statistics for numeric variables by device
weather_data %>%
  group_by(Device) %>%
  summarise(
    n = n(),
    Mean_Temp_F = mean(Temp_F, na.rm = TRUE),
    SD_Temp_F = sd(Temp_F, na.rm = TRUE),
    Mean_Humidity = mean(Hum, na.rm = TRUE),
    SD_Humidity = sd(Hum, na.rm = TRUE),
    Mean_Pressure = mean(Press, na.rm = TRUE),
    SD_Pressure = sd(Press, na.rm = TRUE),
    Mean_Altitude_m = mean(alt_m, na.rm = TRUE),
    Min_Altitude_m = min(alt_m, na.rm = TRUE),
    Max_Altitude_m = max(alt_m, na.rm = TRUE)
  ) %>%
  knitr::kable(digits = 2, caption = "Summary Statistics by Device")

Summary Statistics by Device
Device	n	Mean_Temp_F	SD_Temp_F	Mean_Humidity	SD_Humidity	Mean_Pressure	SD_Pressure	Mean_Altitude_m	Min_Altitude_m	Max_Altitude_m
Red	8576	51.23	12.75	53.07	20.94	717.93	10.09	2974.97	2766	3320
Purple	5864	53.27	13.74	52.36	22.83	716.74	10.90	2976.57	2669	3318
NA	1	NaN	NA	NaN	NA	NaN	NA	NaN	Inf	-Inf

5. Integrate NOAA Weather Data

5.1 Function to Fetch NOAA Data

We’ll use the Open-Meteo API to retrieve historical weather data. There is a limit on how much data you can take, if you exceed this you will be cut-off for a time period. If we want to eventually download all the data corresponding for all of our points we will need to be a bit strategic about doing it slower or in chunks. Here are the terms for the free tier of Open_Meteo: Less than 10’000 API calls per day, 5’000 per hour and 600 per minute. For more information visit: https://open-meteo.com/

# Function to get weather data from Open-Meteo API
get_weather_data <- function(lat, lon, datetime, debug = FALSE) {
  # Convert datetime to date
  date <- as.Date(datetime)
  
  # Build API URL
  base_url <- "https://archive-api.open-meteo.com/v1/archive"
  
  url <- paste0(
    base_url,
    "?latitude=", lat,
    "&longitude=", lon,
    "&start_date=", date,
    "&end_date=", date,
    "&hourly=temperature_2m,relative_humidity_2m,surface_pressure",
    "&temperature_unit=fahrenheit"
  )
  
  # Default return values
  default_result <- list(
    NOAA_Temp_F = NA_real_,
    NOAA_Humidity = NA_real_,
    NOAA_Pressure_hPa = NA_real_
  )
  
  # Make API request
  result <- tryCatch({
    response <- GET(url)
    
    if (debug) {
      cat("URL:", url, "\n")
      cat("Status:", status_code(response), "\n")
    }
    
    if (status_code(response) == 200) {
      data <- content(response, "parsed")
      
      # Check if data exists
      if (is.null(data$hourly)) {
        if (debug) cat("No hourly data in response\n")
        return(default_result)
      }
      
      if (is.null(data$hourly$time) || length(data$hourly$time) == 0) {
        if (debug) cat("No time data\n")
        return(default_result)
      }
      
      if (is.null(data$hourly$temperature_2m) || length(data$hourly$temperature_2m) == 0) {
        if (debug) cat("No temperature data\n")
        return(default_result)
      }
      
hourly_data <- data$hourly

# Unwrap lists if needed
hourly_data$time <- unlist(hourly_data$time)
hourly_data$temperature_2m <- unlist(hourly_data$temperature_2m)
hourly_data$relative_humidity_2m <- unlist(hourly_data$relative_humidity_2m)
hourly_data$surface_pressure <- unlist(hourly_data$surface_pressure)

if (debug) {
  cat("Number of hourly records:", length(hourly_data$time), "\n")
  cat("First time value:", hourly_data$time[1], "\n")
  cat("Target datetime:", as.character(datetime), "\n")
  cat("Target hour:", hour(datetime), "\n")
}
      
      # Get the hour from our datetime (0-23)
      target_hour <- hour(datetime)
      
      # The API returns 24 hours (0-23) for the requested date
      # Index is target_hour + 1 (since R is 1-indexed)
      idx <- target_hour + 1
      
      if (debug) {
        cat("Using index:", idx, "\n")
        cat("Temperature at index:", hourly_data$temperature_2m[idx], "\n")
      }
      
      # Safely extract values with bounds checking
      temp_val <- if (idx > 0 && idx <= length(hourly_data$temperature_2m)) {
        val <- hourly_data$temperature_2m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      hum_val <- if (idx > 0 && idx <= length(hourly_data$relative_humidity_2m)) {
        val <- hourly_data$relative_humidity_2m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      press_val <- if (idx > 0 && idx <= length(hourly_data$surface_pressure)) {
        val <- hourly_data$surface_pressure[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      if (debug) {
        cat("Extracted - Temp:", temp_val, "Hum:", hum_val, "Press:", press_val, "\n\n")
      }
      
      return(list(
        NOAA_Temp_F = temp_val,
        NOAA_Humidity = hum_val,
        NOAA_Pressure_hPa = press_val
      ))
    } else {
      if (debug) cat("API returned error status\n")
      return(default_result)
    }
  }, error = function(e) {
    if (debug) cat("Error:", e$message, "\n")
    return(default_result)
  })
  
  return(result)
}

5.2 Fetch NOAA Data for Sample Points

# Sample 1000 points to demonstrate (to avoid excessive API calls and wait time)
set.seed(123)
sample_indices <- sample(1:nrow(weather_data), min(100, nrow(weather_data)))

# Initialize columns
weather_data$NOAA_Temp_F <- NA
weather_data$NOAA_Humidity <- NA
weather_data$NOAA_Pressure_hPa <- NA

# Test with sample point to see if API is working
cat("Testing API with first sample point...\n")

## Testing API with first sample point...

test_result <- get_weather_data(
  weather_data$Lat[sample_indices[1]],
  weather_data$Lon[sample_indices[1]],
  weather_data$TimeStampFull[sample_indices[1]],
  debug = TRUE
)

## URL: https://archive-api.open-meteo.com/v1/archive?latitude=42.983403&longitude=-109.58012&start_date=2025-09-03&end_date=2025-09-03&hourly=temperature_2m,relative_humidity_2m,surface_pressure&temperature_unit=fahrenheit 
## Status: 200 
## Number of hourly records: 24 
## First time value: 2025-09-03T00:00 
## Target datetime: 2025-09-03 03:30:02 
## Target hour: 3 
## Using index: 4 
## Temperature at index: 46.1 
## Extracted - Temp: 46.1 Hum: 61 Press: 710.4

if (!is.na(test_result$NOAA_Temp_F)) {
  cat("SUCCESS! API is returning data.\n\n")
} else {
  cat("WARNING: API test failed. Check the debug output above.\n\n")
}

## SUCCESS! API is returning data.

# Fetch NOAA data for sample points
cat("Fetching NOAA weather data for", length(sample_indices), "sample points...\n")

## Fetching NOAA weather data for 100 sample points...

cat("This will take approximately", length(sample_indices) * 0.5, "seconds.\n\n")

## This will take approximately 50 seconds.

# Counter for successful retrievals
success_count <- 0
fail_count <- 0
for (i in sample_indices) {
  res <- get_weather_data(
    weather_data$Lat[i],
    weather_data$Lon[i],
    weather_data$TimeStampFull[i]
  )
  
  weather_data$NOAA_Temp_F[i] <- res$NOAA_Temp_F
  weather_data$NOAA_Humidity[i] <- res$NOAA_Humidity
  weather_data$NOAA_Pressure_hPa[i] <- res$NOAA_Pressure_hPa
  
    # Track success/failure
  if (!is.na(weather_data$NOAA_Temp_F[i])) {
    success_count <- success_count + 1
  } else {
    fail_count <- fail_count + 1
  }
  
   # add countdown feedback to know how long data retrieval is taking
  if (i %% 10 == 0) {
    cat("Processing point", which(sample_indices == i), "of", length(sample_indices),
        "(", round(which(sample_indices == i) / length(sample_indices) * 100, 1), "% complete )\n")
  }
  
  Sys.sleep(0.5)
}

## Processing point 26 of 100 ( 26 % complete )
## Processing point 32 of 100 ( 32 % complete )
## Processing point 39 of 100 ( 39 % complete )
## Processing point 58 of 100 ( 58 % complete )
## Processing point 61 of 100 ( 61 % complete )
## Processing point 99 of 100 ( 99 % complete )

cat("\nNOAA data fetch complete!\n")

## 
## NOAA data fetch complete!

cat("Successfully retrieved:", success_count, "of", length(sample_indices), "sample points\n")

## Successfully retrieved: 100 of 100 sample points

cat("Failed to retrieve:", fail_count, "points\n\n")

## Failed to retrieve: 0 points

# Create comparison dataset with only points that have NOAA data
comparison_data <- weather_data %>%
  filter(!is.na(NOAA_Temp_F)) %>%
  mutate(
    Temp_Diff = Temp_F - NOAA_Temp_F,
    Temp_Pct_Diff = ((Temp_F - NOAA_Temp_F) / NOAA_Temp_F) * 100,
    Temp_Abs_Diff = abs(Temp_Diff),
    Hum_Diff = Hum - NOAA_Humidity,
    Hum_Abs_Diff = abs(Hum_Diff),
    Press_Diff = Press - NOAA_Pressure_hPa,
    Press_Abs_Diff = abs(Press_Diff)
  )

cat("Comparison dataset created with", nrow(comparison_data), "points\n")

## Comparison dataset created with 100 points

Note: We’re sampling 100 points to demonstrate the API integration without excessive wait times. To fetch NOAA data for all points, change the sample line accordingly.

6. Sensor Accuracy Analysis

Now let’s analyze how well each sensor performs compared to NOAA data.

6.1 Overall Accuracy Summary

#Option to Load in comparison data instead of making API call if you have the csv file saved to your working directory
# library(readr)
# comparison_data <- read_csv("CDT_Weather_Comparison.csv")

accuracy_summary <- comparison_data %>%
  group_by(Device) %>%
  summarise(
    n = n(),
    # Temperature accuracy
    Mean_Temp_Error = mean(Temp_Diff, na.rm = TRUE),
    MAE_Temp = mean(Temp_Abs_Diff, na.rm = TRUE),
    RMSE_Temp = sqrt(mean(Temp_Diff^2, na.rm = TRUE)),
    # Humidity accuracy
    Mean_Hum_Error = mean(Hum_Diff, na.rm = TRUE),
    MAE_Hum = mean(Hum_Abs_Diff, na.rm = TRUE),
    # Pressure accuracy
    Mean_Press_Error = mean(Press_Diff, na.rm = TRUE),
    MAE_Press = mean(Press_Abs_Diff, na.rm = TRUE)
  )

knitr::kable(accuracy_summary, digits = 2, 
             caption = "Sensor Accuracy Comparison (Lower values = better accuracy)")

Sensor Accuracy Comparison (Lower values = better accuracy)
Device	n	Mean_Temp_Error	MAE_Temp	RMSE_Temp	Mean_Hum_Error	MAE_Hum	Mean_Press_Error	MAE_Press
Red	62	0.10	6.76	10.10	4.37	12.11	-4.75	4.93
Purple	38	-0.48	5.54	6.66	11.97	14.82	-4.17	4.28

Interpretation: - Mean Error: Average difference (positive = sensor reads higher than NOAA) - MAE (Mean Absolute Error): Average magnitude of errors (ignoring direction) - RMSE (Root Mean Square Error): Emphasizes larger errors

6.2 Humidity Error Distribution

ggplot(comparison_data, aes(x = Hum_Diff, color = Device, fill = Device)) +
  geom_density(alpha = 0.3, linewidth = 1.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1) +
  labs(title = "Humidity Error Distribution (Sensor - NOAA)",
       subtitle = "Curves centered near 0 indicate accurate measurements",
       x = "Humidity Error (%)",
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple")) +
  scale_color_manual(values = c("Red" = "darkred", "Purple" = "darkviolet"))

6.3 Temperature Error Distribution

ggplot(comparison_data, aes(x = Temp_Diff, color = Device, fill = Device)) +
  geom_density(alpha = 0.3, linewidth = 1.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1) +
  labs(title = "Temperature Error Distribution (Sensor - NOAA)",
       subtitle = "Curves centered near 0 indicate accurate measurements",
       x = "Temperature Error (°F)",
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "purple" = "purple")) +
  scale_color_manual(values = c("Red" = "darkred", "purple" = "darkviolet"))

Exercise 2

Exercise 2: Modify the code above to create Pressure difference distributions instead of Temperature. Change the x-axis variable and update the title and x-axis label accordingly. Hint: look at what the column names for comparison_data are.

## put your exercise 1 code here


ggplot(comparison_data, aes(x = Press_Diff, color = Device, fill = Device)) +
  geom_density(alpha = 0.3, linewidth = 1.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1) +
  labs(title = "Pressure Error Distribution (Sensor - NOAA)",
       subtitle = "Curves centered near 0 indicate accurate measurements",
       x = "Pressure Error (mBar)",
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple")) +
  scale_color_manual(values = c("Red" = "darkred", "Purple" = "darkviolet"))

6.4 Temperature Error Box Plot

ggplot(comparison_data, aes(x = Device, y = Temp_Abs_Diff, fill = Device)) +
  geom_boxplot() +
  labs(title = "Absolute Temperature Error by Device",
       subtitle = "Lower values and tighter boxes indicate better accuracy",
       x = "Device",
       y = "Absolute Temperature Error (°F)") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))

Exercise 3

Exercise 3: Modify the code above to create a box plot showing the Pressure absolute difference distribution instead of Temperature. Change the y-axis variable and update the title and y-axis label accordingly.

## put your exercise 2 code here
ggplot(comparison_data, aes(x = Device, y = Press_Abs_Diff, fill = Device)) +
  geom_boxplot() +
  labs(title = "Absolute Pressure Error by Device",
       subtitle = "Lower values and tighter boxes indicate better accuracy",
       x = "Device",
       y = "Absolute Pressure Error (mBar)") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))

6.5 Error Count by Device

# Categorize errors as small medium and large. change according to what might make the most sense
comparison_data <- comparison_data %>%
  mutate(
    Temp_Error_Category = case_when(
      Temp_Abs_Diff < 3 ~ "Small (<3°F)",
      Temp_Abs_Diff < 6 ~ "Medium (3-6°F)",
      TRUE ~ "Large (>6°F)"
    )
  )

error_counts <- comparison_data %>%
  count(Device, Temp_Error_Category) %>%
  group_by(Device) %>%
  mutate(Percentage = (n / sum(n)) * 100)

ggplot(error_counts, aes(x = Device, y = Percentage, fill = Temp_Error_Category)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  labs(title = "Temperature Error Categories by Device",
       subtitle = "Higher percentage of 'Small' errors indicates better sensor",
       x = "Device",
       y = "Percentage of Measurements",
       fill = "Error Size") +
  theme_minimal() +
  scale_fill_brewer(palette = "RdYlGn", direction = -1)

# Categorize errors as small medium and large. change according to what might make the most sense
comparison_data <- comparison_data %>%
  mutate(
    Hum_Error_Category = case_when(
      Hum_Abs_Diff < 5 ~ "Small (<5%)",
      Hum_Abs_Diff < 10 ~ "Medium (5-10%)",
      TRUE ~ "Large (>10%)"
    )
  )

error_counts <- comparison_data %>%
  count(Device, Hum_Error_Category) %>%
  group_by(Device) %>%
  mutate(Percentage = (n / sum(n)) * 100)

ggplot(error_counts, aes(x = Device, y = Percentage, fill = Hum_Error_Category)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  labs(title = "Humidity Error Categories by Device",
       subtitle = "Higher percentage of 'Small' errors indicates better sensor",
       x = "Device",
       y = "Percentage of Measurements",
       fill = "Error Size") +
  theme_minimal() +
  scale_fill_brewer(palette = "RdYlGn", direction = -1)

Exercise 4

Exercise 4: Modify the code above to create bar charts showing the error sizes for Pressure absolute differences. Update the title accordingly.

## put your exercise 3 code here
comparison_data <- comparison_data %>%
  mutate(
   Press_Error_Category = case_when(
      Press_Abs_Diff < 2 ~ "Small (<3mBar)",
      Press_Abs_Diff < 6 ~ "Medium (3-6mBar)",
      TRUE ~ "Large (>6mBar)"
    )
  )

error_counts <- comparison_data %>%
  count(Device, Press_Error_Category) %>%
  group_by(Device) %>%
  mutate(Percentage = (n / sum(n)) * 100)

ggplot(error_counts, aes(x = Device, y = Percentage, fill = Press_Error_Category)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  labs(title = "Pressure Error Categories by Device",
       subtitle = "Higher percentage of 'Small' errors indicates better sensor",
       x = "Device",
       y = "Percentage of Measurements",
       fill = "Error Size") +
  theme_minimal() +
  scale_fill_brewer(palette = "RdYlGn", direction = -1)

6.6 Interactive Temperature Comparison

# Create interactive scatter plot
p <- ggplot(comparison_data, aes(x = NOAA_Temp_F, y = Temp_F, color = Device,
                                 text = paste0(
                                   "Device: ", Device, "<br>",
                                   "Sensor Temp: ", round(Temp_F, 2), "°F<br>",
                                   "NOAA Temp: ", round(NOAA_Temp_F, 2), "°F<br>",
                                   "Difference: ", round(Temp_Diff, 2), "°F<br>",
                                   "% Difference: ", round(Temp_Pct_Diff, 2), "%<br>",
                                   "Date/Time: ", format(TimeStampFull, "%Y-%m-%d %H:%M")
                                 ))) +
  geom_point(size = 3, alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", linewidth = 1) +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple")) +
  theme_minimal() +
   labs(title = "Sensor Temperature vs NOAA Temperature",
       subtitle = "Points closer to the diagonal line indicate better accuracy",
       x = "NOAA Temperature (°F)",
       y = "Sensor Temperature (°F)")

ggplotly(p, tooltip = "text") %>%
  layout(hovermode = "closest")

Exercise 5

Exercise 5: Create a similar interactive scatter plot for Humidity (Sensor Humidity vs NOAA Humidity).

## put your exercise 4 code here
p <- ggplot(comparison_data, aes(x = NOAA_Humidity, y = Hum, color = Device,
                                 text = paste0(
                                   "Device: ", Device, "<br>",
                                   "Sensor Hum: ", round(Hum, 2), "%<br>",
                                   "NOAA Hum: ", round(NOAA_Humidity, 2), "%<br>",
                                   "Difference: ", round(Hum_Diff, 2), "%<br>",
                                   "% Difference: ", round(Hum_Diff, 2), "%<br>",
                                   "Date/Time: ", format(TimeStampFull, "%Y-%m-%d %H:%M")
                                 ))) +
  geom_point(size = 3, alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", linewidth = 1) +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple")) +
  theme_minimal() +
   labs(title = "Sensor Humidity vs NOAA Humidity",
       subtitle = "Points closer to the diagonal line indicate better accuracy",
       x = "NOAA Humidity (%)",
       y = "Sensor Humidity (%)")

ggplotly(p, tooltip = "text") %>%
  layout(hovermode = "closest")

## put your exercise 4 code here
p <- ggplot(comparison_data, aes(x = NOAA_Pressure_hPa, y = Press, color = Device,
                                 text = paste0(
                                   "Device: ", Device, "<br>",
                                   "Sensor Press: ", round(Hum, 2), "hPa<br>",
                                   "NOAA hPa: ", round(NOAA_Pressure_hPa, 2), "hPa<br>",
                                   "Difference: ", round(Press_Diff, 2), "hPa<br>",
                                   "% Difference: ", round(Press_Diff, 2), "%<br>",
                                   "Date/Time: ", format(TimeStampFull, "%Y-%m-%d %H:%M")
                                 ))) +
  geom_point(size = 3, alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", linewidth = 1) +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple")) +
  theme_minimal() +
   labs(title = "Sensor Pressure vs NOAA Pressure",
       subtitle = "Points closer to the diagonal line indicate better accuracy",
       x = "NOAA Pressure (hPa)",
       y = "Sensor Pressure (hPa)")

ggplotly(p, tooltip = "text") %>%
  layout(hovermode = "closest")

7. Error Pattern Analysis

Let’s investigate if there are patterns to the measurement errors based on date. Do the sensors underestimate or overestimate the noaa weather data?

7.1 Errors by Date

# Group by date to see if certain days have more errors
daily_errors <- comparison_data %>%
  filter(Device =="Red") %>%
  group_by(Date, Device) %>%
  summarise(
    Mean_Temp_Error = mean(Temp_Diff, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

ggplot(daily_errors, aes(x = Date, y = Mean_Temp_Error, color = Device, size = n)) +
  geom_point(alpha = 0.7) +
  geom_line(aes(group = Device), linewidth = 1) +
  labs(title = "Mean Temperature Error by Date",
       subtitle = "Point size indicates number of measurements that day",
       x = "Date",
       y = "Mean Absolute Error (°F)",
       size = "Number of\nMeasurements") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

Exercise 6

Exercise 6: Create a similar mean absolute temperature error by date plot for the purple device.

## put your exercise 5 code here
daily_errors <- comparison_data %>%
  filter(Device =="Purple") %>%
  group_by(Date, Device) %>%
  summarise(
    Mean_Temp_Error = mean(Temp_Diff, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

ggplot(daily_errors, aes(x = Date, y = Mean_Temp_Error, color = Device, size = n)) +
  geom_point(alpha = 0.7) +
  geom_line(aes(group = Device), linewidth = 1) +
  labs(title = "Mean Temperature Error by Date",
       subtitle = "Point size indicates number of measurements that day",
       x = "Date",
       y = "Mean Absolute Error (°F)",
       size = "Number of\nMeasurements") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

7.2 Positive vs Negative Error Analysis

# Count positive (overestimate) vs negative (underestimate) errors
error_direction <- comparison_data %>%
  mutate(Error_Direction = ifelse(Temp_Diff > 0, "Overestimate", "Underestimate")) %>%
  count(Device, Error_Direction) %>%
  group_by(Device) %>%
  mutate(Percentage = (n / sum(n)) * 100)

ggplot(error_direction, aes(x = Device, y = Percentage, fill = Error_Direction)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(title = "Direction of Temperature Errors",
       subtitle = "Does the sensor tend to read high or low?",
       x = "Device",
       y = "Percentage of Measurements",
       fill = "Error Direction") +
  theme_minimal() +
  scale_fill_manual(values = c("Overestimate" = "red", "Underestimate" = "blue"))

8. Error Visualization Map

Now let’s create an interactive map that shows over/underestimates for temperature, humidity, and pressure:

# Create error classification for visualization
comparison_data <- comparison_data %>%
  mutate(
    Temp_Error_Type = case_when(
      Temp_Diff > 1 ~ "Overestimate",
      Temp_Diff < -1 ~ "Underestimate",
      TRUE ~ "Accurate"
    ),
    Hum_Error_Type = case_when(
      Hum_Diff > 5 ~ "Overestimate",
      Hum_Diff < -5 ~ "Underestimate",
      TRUE ~ "Accurate"
    ),
    Press_Error_Type = case_when(
      Press_Diff > 2 ~ "Overestimate",
      Press_Diff < -2 ~ "Underestimate",
      TRUE ~ "Accurate"
    )
  )

# Create color palettes for each measurement type
temp_error_colors <- colorFactor(
  palette = c("blue", "green", "red"),
  domain = c("Underestimate", "Accurate", "Overestimate")
)

hum_error_colors <- colorFactor(
  palette = c("blue", "green", "red"),
  domain = c("Underestimate", "Accurate", "Overestimate")
)

press_error_colors <- colorFactor(
  palette = c("blue", "green", "red"),
  domain = c("Underestimate", "Accurate", "Overestimate")
)

# Create the interactive error map
error_map <- leaflet(comparison_data) %>%
  # Add base layers
  addProviderTiles(providers$Esri.WorldTopoMap, group = "ESRI Topo") %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addWMSTiles(
    baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
    layers = "0",
    options = WMSTileOptions(format = "image/png", transparent = TRUE),
    attribution = "USGS The National Map",
    group = "USGS Topo"  
  ) %>%
  addProviderTiles(providers$OpenStreetMap, group = "OpenStreetMap") %>%
  
  # Temperature error markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 7,
    color = ~temp_error_colors(Temp_Error_Type),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 2,
    group = "Temperature Errors",
    label = ~paste0(
      "Device: ", Device, 
      " | ", Temp_Error_Type,
      " | Error: ", round(Temp_Diff, 1), "°F"
    ),
    popup = ~paste0(
      "<strong>Device:</strong> ", Device, "<br/>",
      "<strong>Temperature Error:</strong> ", Temp_Error_Type, "<br/>",
      "Sensor: ", round(Temp_F, 1), "°F<br/>",
      "NOAA: ", round(NOAA_Temp_F, 1), "°F<br/>",
      "Difference: ", round(Temp_Diff, 1), "°F<br/>",
      "<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  # Humidity error markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 7,
    color = ~hum_error_colors(Hum_Error_Type),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 2,
    group = "Humidity Errors",
    label = ~paste0(
      "Device: ", Device, 
      " | ", Hum_Error_Type,
      " | Error: ", round(Hum_Diff, 1), "%"
    ),
    popup = ~paste0(
      "<strong>Device:</strong> ", Device, "<br/>",
      "<strong>Humidity Error:</strong> ", Hum_Error_Type, "<br/>",
      "Sensor: ", Hum, "%<br/>",
      "NOAA: ", round(NOAA_Humidity, 1), "%<br/>",
      "Difference: ", round(Hum_Diff, 1), "%<br/>",
      "<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  # Pressure error markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 7,
    color = ~press_error_colors(Press_Error_Type),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 2,
    group = "Pressure Errors",
    label = ~paste0(
      "Device: ", Device, 
      " | ", Press_Error_Type,
      " | Error: ", round(Press_Diff, 1), " hPa"
    ),
    popup = ~paste0(
      "<strong>Device:</strong> ", Device, "<br/>",
      "<strong>Pressure Error:</strong> ", Press_Error_Type, "<br/>",
      "Sensor: ", Press, " hPa<br/>",
      "NOAA: ", round(NOAA_Pressure_hPa, 1), " hPa<br/>",
      "Difference: ", round(Press_Diff, 1), " hPa<br/>",
      "<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  # Add layer controls
  addLayersControl(
    baseGroups = c( "USGS Topo", "ESRI Topo", "Satellite", "USGS Topo", "OpenStreetMap"),
    overlayGroups = c("Temperature Errors", "Humidity Errors", "Pressure Errors"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  # Show only temperature errors by default
  hideGroup(c("Humidity Errors", "Pressure Errors")) %>%
  
  # Add legends
  addLegend(
    position = "bottomright",
    pal = temp_error_colors,
    values = ~Temp_Error_Type,
    title = "Error Type",
    opacity = 1,
    group = "Temperature Errors"
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = hum_error_colors,
    values = ~Hum_Error_Type,
    title = "Error Type",
    opacity = 1,
    group = "Humidity Errors"
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = press_error_colors,
    values = ~Press_Error_Type,
    title = "Error Type",
    opacity = 1,
    group = "Pressure Errors"
  ) %>%
  
  # Add scale bar
  addScaleBar(position = "bottomleft")

# Display the map
error_map

This interactive map allows you to: - Switch base layers: Choose between ESRI Topo, Satellite, USGS Topo, or OpenStreetMap - Toggle error types: Check/uncheck boxes to show Temperature, Humidity, or Pressure errors - Color coding: - Blue = Underestimate (sensor reads too low) - Green = Accurate (within threshold) - Red = Overestimate (sensor reads too high) - Hover labels: Quick view of device, error type, and magnitude - Click popups: Detailed comparison data

9. Export Enhanced Dataset

# Save the enhanced dataset with NOAA data
# write.csv(comparison_data, "CDT_Weather_Comparison.csv", row.names = FALSE)
# write_xlsx(comparison_data, "CDT_Weather_Comparison.xlsx")

cat("Analysis complete!\n")

## Analysis complete!

cat("Comparison dataset contains", nrow(comparison_data), "points with NOAA data.\n")

## Comparison dataset contains 100 points with NOAA data.

10. Summary and Conclusions

This tutorial demonstrated:

Data Import and Cleaning: Loading Excel data and properly formatting timestamps and temperatures. Checking for data accuracy and proper data labeling and making corrections as necessary
External Data Integration: Fetching NOAA weather data via API
Sensor Accuracy Analysis: Comparing sensor measurements against reference data
Statistical Analysis: Computing error metrics (MAE, RMSE) to evaluate sensor performance
Visualization: Creating distribution curves, box plots, scatter plots, interactive plots to explore sensor accuracy
Interactive Mapping: Using Leaflet with multiple base layers including satellite and USGS topo maps and also hover labels and click labels

Key Findings

Based on the comparison with NOAA data: - Total comparison points: 100 - Red device measurements: 62 - Purple device measurements: 38

Exercise 7

Exercise 7: Review the analysis comparing the red and purple meteo devices. Type your answers below each question. Provide at least three sentences per response:

7a: Which device would you say was more accurate compared to NOAA weather data (Temperature, Pressure, Humidity) and why would you say so? Be sure to mention what graph(s) or statistic(s) support your conclusion. In the ‘Pressure Error Categories by Device’ bar graph, the purple device has a higher percentage of both small and large error when compared to red. However, the small error percentage is a larger difference than the larger error percentage, meaning, overall, it is likely more accurate than the red device. In the ‘Humidity Error Categories by Device’ bar graph, the purple device has a lower percentage of small error and a higher percentage of large error when compared to red, so the red device is more accurate in this category. In ‘Humidity Error Categories by Device’ bar graph, the purple device has a smaller, but likely comparable, percentage of error compared to red, but a larger percentage of error when compared to red and a smaller percentage of medium error when compared to red, so red is likely the more accurate device in the temperature category. Thus even though it has higher error in pressure measurements, red is likely the more accurate sensor overall

7b: What was something interesting you noticed about the difference between device data and the NOAA temperature data? Where there any patterns you noticed in this descriptive statistical analysis regarding the potential accuracy of these devices compared to NOAA historical data? Mention what graph(s) or statistic(s) helped you notice this pattern. On the scatter plot ‘Sensor Temperature vs NOAA Temperature,’ red has more clear outliers than purple. This is interesting because in the ‘Temperature Error Categories by Device’ bar graph purple has a higher percentage of large errors. This means that, on the scatter plot, the red data is, overall, closer to the line of best fit than the purple even though it has at least four major outlines. This confirms that red is the more accurate device for temperature data.

To submit this assignment, publish on RPubs and post link url on Canvas

Click the “Knit” button in RStudio
Once the HTML is generated, click “Publish” in the preview window
Select “RPubs” as the destination
Sign in or create an RPubs account
Give your document a title and click “Publish”
Copy and paste your html web address for your completed project posted on RPubs and upload the web address to canvas.

weather data analysis with CDT MT data

Bridget Willson

2025-11-03