Introduction

This tutorial demonstrates how to import, clean, analyze, and visualize meteorological tracking data in R. We’ll compare sensor measurements against NOAA weather data to evaluate sensor accuracy and determine which device performs better.

1. Load Required Packages

First, we’ll install and load all necessary packages:

# Install packages if needed (uncomment the lines below)
# install.packages(c("readxl", "dplyr", "ggplot2", "leaflet", "lubridate", "summarytools", "httr", "jsonlite", "plotly"))

library(readxl)      # For reading Excel files
library(dplyr)       # For data manipulation
library(ggplot2)     # For plotting
library(leaflet)     # For interactive maps
library(lubridate)   # For date-time manipulation
library(summarytools)# For summary statistics
library(httr)        # For API calls
library(jsonlite)    # For JSON parsing
library(plotly)      # For interactive plots

2. Import and Prepare Data

2.1 Import the Dataset

# Import the Excel file
weather_data <- read_excel("CDT_Meteo_Tracker_Data.xlsx")

# View the first few rows
head(weather_data)

## # A tibble: 6 × 11
##   TimeStampFull Date                Time                  Lat   Lon alt_m Temp_F
##   <chr>         <dttm>              <dttm>              <dbl> <dbl> <dbl>  <dbl>
## 1 2025-09-09T0… 2025-09-09 00:00:00 1899-12-31 08:53:32  42.7 -109.  2779   10.6
## 2 2025-09-09T0… 2025-09-09 00:00:00 1899-12-31 08:53:35  42.7 -109.  2778   10.6
## 3 2025-09-09T0… 2025-09-09 00:00:00 1899-12-31 08:53:38  42.7 -109.  2779   10.6
## 4 2025-09-09T0… 2025-09-09 00:00:00 1899-12-31 08:53:41  42.7 -109.  2779   10.6
## 5 2025-09-09T0… 2025-09-09 00:00:00 1899-12-31 08:53:44  42.7 -109.  2778   10.6
## 6 2025-09-09T0… 2025-09-09 00:00:00 1899-12-31 08:53:47  42.7 -109.  2779   10.6
## # ℹ 4 more variables: Hum <dbl>, Press <dbl>, DP <dbl>, Device <chr>

2.2 Assign Correct Data Types

Pay special attention to the timestamp, date, and time variables:

# Convert data types
weather_data <- weather_data %>%
  mutate(
    # Parse ISO 8601 timestamp with timezone
    TimeStampFull = ymd_hms(TimeStampFull, tz = "UTC"),
    
    # Parse date
    Date = ymd(Date),
    
    # Parse time (HMS format)
    Time = hms::as_hms(Time),
    
    # Ensure numeric variables are numeric
    Lat = as.numeric(Lat),
    Lon = as.numeric(Lon),
    alt_m = as.numeric(alt_m),
    Hum = as.numeric(Hum),
    Press = as.numeric(Press),
    DP = as.numeric(DP),
    
    # Ensure Device is a factor
    Device = as.factor(Device)
  )

# Check if temperature is actually in Celsius (common issue with sensor data)
# If the column is named Temp_F but values are small (typical for C), convert to F
if (mean(weather_data$Temp_F, na.rm = TRUE) < 30) {
  cat("NOTE: Temperature values appear to be in Celsius. Converting to Fahrenheit...\n")
  weather_data <- weather_data %>%
    mutate(
      Temp_C = Temp_F,  # Save original Celsius values
      Temp_F = (Temp_F * 9/5) + 32  # Convert to Fahrenheit
    )
  cat("Conversion complete. Original Celsius values saved in 'Temp_C' column.\n\n")
} else {
  # Ensure Temp_F is numeric
  weather_data$Temp_F <- as.numeric(weather_data$Temp_F)
}

## NOTE: Temperature values appear to be in Celsius. Converting to Fahrenheit...
## Conversion complete. Original Celsius values saved in 'Temp_C' column.

# Check the structure
str(weather_data)

## tibble [13,647 × 12] (S3: tbl_df/tbl/data.frame)
##  $ TimeStampFull: POSIXct[1:13647], format: "2025-09-09 14:53:32" "2025-09-09 14:53:35" ...
##  $ Date         : Date[1:13647], format: "2025-09-09" "2025-09-09" ...
##  $ Time         : 'hms' num [1:13647] 08:53:32 08:53:35 08:53:38 08:53:41 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Lat          : num [1:13647] 42.7 42.7 42.7 42.7 42.7 ...
##  $ Lon          : num [1:13647] -109 -109 -109 -109 -109 ...
##  $ alt_m        : num [1:13647] 2779 2778 2779 2779 2778 ...
##  $ Temp_F       : num [1:13647] 51.1 51.1 51.1 51.1 51.1 ...
##  $ Hum          : num [1:13647] 69 69 68 68 68 68 68 67 66 66 ...
##  $ Press        : num [1:13647] 733 733 733 733 733 733 733 733 733 733 ...
##  $ DP           : num [1:13647] 5.3 5.2 5.1 5.1 5.1 5.1 5.1 4.9 4.6 4.6 ...
##  $ Device       : Factor w/ 2 levels "Purple","Red": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Temp_C       : num [1:13647] 10.6 10.6 10.6 10.6 10.6 10.6 10.7 10.7 10.6 10.5 ...

# Display temperature range to verify
cat("\nTemperature range after conversion:\n")

## 
## Temperature range after conversion:

cat("Min:", round(min(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")

## Min: 29.7 °F

cat("Max:", round(max(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")

## Max: 92.5 °F

cat("Mean:", round(mean(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")

## Mean: 52.9 °F

3. Mapping The Data

# Create color palette for devices (Red and Purple)
device_colors <- colorFactor(palette = c("red", "purple"), 
                              domain = weather_data$Device)

# Create the leaflet map
leaflet_map <- leaflet(weather_data) %>%
  # Add multiple base layers
  addProviderTiles(providers$OpenTopoMap, group = "OpenTopoMap") %>%
  addProviderTiles(providers$Esri.WorldTopoMap, group = "ESRI Topo") %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addWMSTiles(
    baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
    layers = "0",
    options = WMSTileOptions(format = "image/png", transparent = TRUE),
    attribution = "USGS The National Map",
    group = "USGS Topo"  
  ) %>%
  addProviderTiles(providers$OpenStreetMap, group = "OpenStreetMap") %>%
  
  # Add markers with popups and labels
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 5,
    color = ~device_colors(Device),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 1,
    
    # Hover label - expanded to include all key variables
    label = ~paste0(
      "Device: ", Device, 
      " | Temp: ", round(Temp_F, 1), "°F",
      " | Hum: ", Hum, "%",
      " | Press: ", Press, " hPa",
      " | DP: ", round(DP, 1), "°F"
    ),
    
    # Click popup
    popup = ~paste0(
      "<strong>Device:</strong> ", Device, "<br/>",
      "<strong>Temperature:</strong> ", round(Temp_F, 1), "°F<br/>",
      "<strong>Humidity:</strong> ", Hum, "%<br/>",
      "<strong>Pressure:</strong> ", Press, " hPa<br/>",
      "<strong>Dew Point:</strong> ", round(DP, 1), "°F<br/>",
      "<strong>Altitude:</strong> ", round(alt_m, 1), " m<br/>",
      "<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M:%S")
    )
  ) %>%
  
  # Add layer control with radio buttons
  addLayersControl(
    baseGroups = c("OpenTopoMap", "ESRI Topo", "Satellite", "USGS Topo", "OpenStreetMap"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  # Add legend
  addLegend(
    position = "bottomright",
    pal = device_colors,
    values = ~Device,
    title = "Device",
    opacity = 1
  ) %>%
  
  # Add scale bar
  addScaleBar(position = "bottomleft")

# Display the map
leaflet_map

Exercise 1

Exercise 1: In the map above, try changing the base layer that loads by default. Currently it loads OpenTopoMap first - can you make it load a different layer first instead? (Hint: The order in which you add the provider tiles matters!)

4. Basic Data Exploration

4.1 Dataset Overview

# Display dataset dimensions
cat("Dataset dimensions:", nrow(weather_data), "rows x", ncol(weather_data), "columns\n\n")

## Dataset dimensions: 13647 rows x 12 columns

# Check date range
cat("Dataset date range:\n")

## Dataset date range:

cat("Earliest:", as.character(min(weather_data$Date, na.rm = TRUE)), "\n")

## Earliest: 2024-09-05

cat("Latest:", as.character(max(weather_data$Date, na.rm = TRUE)), "\n\n")

## Latest: 2025-09-11

# Summary of all variables
summary(weather_data)

##  TimeStampFull                      Date                 Time                
##  Min.   :2024-09-05 17:27:21   Min.   :2024-09-05   Min.   :00:00:20.000000  
##  1st Qu.:2024-09-15 19:31:45   1st Qu.:2024-09-15   1st Qu.:08:38:12.500000  
##  Median :2025-09-03 03:10:32   Median :2025-09-02   Median :13:31:44.000000  
##  Mean   :2025-05-03 09:33:49   Mean   :2025-05-02   Mean   :12:49:47.503041  
##  3rd Qu.:2025-09-07 13:52:33   3rd Qu.:2025-09-07   3rd Qu.:17:18:42.000000  
##  Max.   :2025-09-12 00:47:54   Max.   :2025-09-11   Max.   :23:59:43.000000  
##       Lat             Lon             alt_m          Temp_F     
##  Min.   :42.52   Min.   :-109.8   Min.   :2178   Min.   :29.66  
##  1st Qu.:42.70   1st Qu.:-109.5   1st Qu.:2824   1st Qu.:42.26  
##  Median :42.79   Median :-109.3   Median :2985   Median :52.88  
##  Mean   :42.80   Mean   :-109.3   Mean   :2950   Mean   :52.93  
##  3rd Qu.:42.93   3rd Qu.:-109.1   3rd Qu.:3118   3rd Qu.:62.60  
##  Max.   :43.02   Max.   :-108.9   Max.   :3561   Max.   :92.48  
##       Hum            Press             DP             Device    
##  Min.   :13.00   Min.   :660.0   Min.   :-5.6000   Purple:4788  
##  1st Qu.:32.00   1st Qu.:708.0   1st Qu.:-1.5000   Red   :8859  
##  Median :49.00   Median :718.0   Median : 0.5000                
##  Mean   :51.54   Mean   :718.8   Mean   : 0.6537                
##  3rd Qu.:70.00   3rd Qu.:728.0   3rd Qu.: 2.6000                
##  Max.   :99.00   Max.   :780.0   Max.   :13.4000                
##      Temp_C     
##  Min.   :-1.30  
##  1st Qu.: 5.70  
##  Median :11.60  
##  Mean   :11.63  
##  3rd Qu.:17.00  
##  Max.   :33.60

4.2 Summary Statistics by Device

# Summary statistics for numeric variables by device
weather_data %>%
  group_by(Device) %>%
  summarise(
    n = n(),
    Mean_Temp_F = mean(Temp_F, na.rm = TRUE),
    SD_Temp_F = sd(Temp_F, na.rm = TRUE),
    Mean_Humidity = mean(Hum, na.rm = TRUE),
    SD_Humidity = sd(Hum, na.rm = TRUE),
    Mean_Pressure = mean(Press, na.rm = TRUE),
    SD_Pressure = sd(Press, na.rm = TRUE),
    Mean_Altitude_m = mean(alt_m, na.rm = TRUE),
    Min_Altitude_m = min(alt_m, na.rm = TRUE),
    Max_Altitude_m = max(alt_m, na.rm = TRUE)
  ) %>%
  knitr::kable(digits = 2, caption = "Summary Statistics by Device")

Summary Statistics by Device
Device	n	Mean_Temp_F	SD_Temp_F	Mean_Humidity	SD_Humidity	Mean_Pressure	SD_Pressure	Mean_Altitude_m	Min_Altitude_m	Max_Altitude_m
Purple	4788	54.72	11.62	50.43	21.83	719.40	26.97	2915.23	2178	3561
Red	8859	51.97	13.20	52.14	21.24	718.43	10.30	2968.11	2749	3320

5. Integrate NOAA Weather Data

5.1 Function to Fetch NOAA Data

We’ll use the Open-Meteo API to retrieve historical weather data. There is a limit on how much data you can take, if you exceed this you will be cut-off for a time period. If we want to eventually download all the data corresponding for all of our points we will need to be a bit strategic about doing it slower or in chunks. Here are the terms for the free tier of Open_Meteo: Less than 10’000 API calls per day, 5’000 per hour and 600 per minute. For more information visit: https://open-meteo.com/

# Function to get weather data from Open-Meteo API
get_weather_data <- function(lat, lon, datetime, debug = FALSE) {
  # Convert datetime to date
  date <- as.Date(datetime)
  
  # Build API URL
  base_url <- "https://archive-api.open-meteo.com/v1/archive"
  
  url <- paste0(
    base_url,
    "?latitude=", lat,
    "&longitude=", lon,
    "&start_date=", date,
    "&end_date=", date,
    "&hourly=temperature_2m,relative_humidity_2m,surface_pressure",
    "&temperature_unit=fahrenheit"
  )
  
  # Default return values
  default_result <- list(
    NOAA_Temp_F = NA_real_,
    NOAA_Humidity = NA_real_,
    NOAA_Pressure_hPa = NA_real_
  )
  
  # Make API request
  result <- tryCatch({
    response <- GET(url)
    
    if (debug) {
      cat("URL:", url, "\n")
      cat("Status:", status_code(response), "\n")
    }
    
    if (status_code(response) == 200) {
      data <- content(response, "parsed")
      
      # Check if data exists
      if (is.null(data$hourly)) {
        if (debug) cat("No hourly data in response\n")
        return(default_result)
      }
      
      if (is.null(data$hourly$time) || length(data$hourly$time) == 0) {
        if (debug) cat("No time data\n")
        return(default_result)
      }
      
      if (is.null(data$hourly$temperature_2m) || length(data$hourly$temperature_2m) == 0) {
        if (debug) cat("No temperature data\n")
        return(default_result)
      }
      
hourly_data <- data$hourly

# Unwrap lists if needed
hourly_data$time <- unlist(hourly_data$time)
hourly_data$temperature_2m <- unlist(hourly_data$temperature_2m)
hourly_data$relative_humidity_2m <- unlist(hourly_data$relative_humidity_2m)
hourly_data$surface_pressure <- unlist(hourly_data$surface_pressure)

if (debug) {
  cat("Number of hourly records:", length(hourly_data$time), "\n")
  cat("First time value:", hourly_data$time[1], "\n")
  cat("Target datetime:", as.character(datetime), "\n")
  cat("Target hour:", hour(datetime), "\n")
}
      
      # Get the hour from our datetime (0-23)
      target_hour <- hour(datetime)
      
      # The API returns 24 hours (0-23) for the requested date
      # Index is target_hour + 1 (since R is 1-indexed)
      idx <- target_hour + 1
      
      if (debug) {
        cat("Using index:", idx, "\n")
        cat("Temperature at index:", hourly_data$temperature_2m[idx], "\n")
      }
      
      # Safely extract values with bounds checking
      temp_val <- if (idx > 0 && idx <= length(hourly_data$temperature_2m)) {
        val <- hourly_data$temperature_2m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      hum_val <- if (idx > 0 && idx <= length(hourly_data$relative_humidity_2m)) {
        val <- hourly_data$relative_humidity_2m[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      press_val <- if (idx > 0 && idx <= length(hourly_data$surface_pressure)) {
        val <- hourly_data$surface_pressure[idx]
        if (!is.null(val)) as.numeric(val) else NA_real_
      } else NA_real_
      
      if (debug) {
        cat("Extracted - Temp:", temp_val, "Hum:", hum_val, "Press:", press_val, "\n\n")
      }
      
      return(list(
        NOAA_Temp_F = temp_val,
        NOAA_Humidity = hum_val,
        NOAA_Pressure_hPa = press_val
      ))
    } else {
      if (debug) cat("API returned error status\n")
      return(default_result)
    }
  }, error = function(e) {
    if (debug) cat("Error:", e$message, "\n")
    return(default_result)
  })
  
  return(result)
}

5.2 Fetch NOAA Data for Sample Points

# Sample 100 points to demonstrate (to avoid excessive API calls and wait time)
set.seed(123)
sample_indices <- sample(1:nrow(weather_data), min(100, nrow(weather_data)))

# Initialize columns
weather_data$NOAA_Temp_F <- NA
weather_data$NOAA_Humidity <- NA
weather_data$NOAA_Pressure_hPa <- NA

# Test with sample point to see if API is working
cat("Testing API with first sample point...\n")

## Testing API with first sample point...

test_result <- get_weather_data(
  weather_data$Lat[sample_indices[1]],
  weather_data$Lon[sample_indices[1]],
  weather_data$TimeStampFull[sample_indices[1]],
  debug = TRUE
)

## URL: https://archive-api.open-meteo.com/v1/archive?latitude=42.585104&longitude=-109.086524&start_date=2025-09-11&end_date=2025-09-11&hourly=temperature_2m,relative_humidity_2m,surface_pressure&temperature_unit=fahrenheit 
## Status: 200 
## Number of hourly records: 24 
## First time value: 2025-09-11T00:00 
## Target datetime: 2025-09-11 16:30:02 
## Target hour: 16 
## Using index: 17 
## Temperature at index: 55.3 
## Extracted - Temp: 55.3 Hum: 51 Press: 742.8

if (!is.na(test_result$NOAA_Temp_F)) {
  cat("SUCCESS! API is returning data.\n\n")
} else {
  cat("WARNING: API test failed. Check the debug output above.\n\n")
}

## SUCCESS! API is returning data.

# Fetch NOAA data for sample points
cat("Fetching NOAA weather data for", length(sample_indices), "sample points...\n")

## Fetching NOAA weather data for 100 sample points...

cat("This will take approximately", length(sample_indices) * 0.5, "seconds.\n\n")

## This will take approximately 50 seconds.

# Counter for successful retrievals
success_count <- 0
fail_count <- 0
for (i in sample_indices) {
  res <- get_weather_data(
    weather_data$Lat[i],
    weather_data$Lon[i],
    weather_data$TimeStampFull[i]
  )
  
  weather_data$NOAA_Temp_F[i] <- res$NOAA_Temp_F
  weather_data$NOAA_Humidity[i] <- res$NOAA_Humidity
  weather_data$NOAA_Pressure_hPa[i] <- res$NOAA_Pressure_hPa
  
    # Track success/failure
  if (!is.na(weather_data$NOAA_Temp_F[i])) {
    success_count <- success_count + 1
  } else {
    fail_count <- fail_count + 1
  }
  
   # add countdown feedback to know how long data retrieval is taking
  if (i %% 10 == 0) {
    cat("Processing point", which(sample_indices == i), "of", length(sample_indices),
        "(", round(which(sample_indices == i) / length(sample_indices) * 100, 1), "% complete )\n")
  }
  
  Sys.sleep(0.5)
}

## Processing point 25 of 100 ( 25 % complete )
## Processing point 30 of 100 ( 30 % complete )
## Processing point 36 of 100 ( 36 % complete )
## Processing point 54 of 100 ( 54 % complete )
## Processing point 57 of 100 ( 57 % complete )
## Processing point 93 of 100 ( 93 % complete )
## Processing point 96 of 100 ( 96 % complete )
## Processing point 99 of 100 ( 99 % complete )

cat("\nNOAA data fetch complete!\n")

## 
## NOAA data fetch complete!

cat("Successfully retrieved:", success_count, "of", length(sample_indices), "sample points\n")

## Successfully retrieved: 100 of 100 sample points

cat("Failed to retrieve:", fail_count, "points\n\n")

## Failed to retrieve: 0 points

# Create comparison dataset with only points that have NOAA data
comparison_data <- weather_data %>%
  filter(!is.na(NOAA_Temp_F)) %>%
  mutate(
    Temp_Diff = Temp_F - NOAA_Temp_F,
    Temp_Pct_Diff = ((Temp_F - NOAA_Temp_F) / NOAA_Temp_F) * 100,
    Temp_Abs_Diff = abs(Temp_Diff),
    Hum_Diff = Hum - NOAA_Humidity,
    Hum_Abs_Diff = abs(Hum_Diff),
    Press_Diff = Press - NOAA_Pressure_hPa,
    Press_Abs_Diff = abs(Press_Diff)
  )

cat("Comparison dataset created with", nrow(comparison_data), "points\n")

## Comparison dataset created with 100 points

Note: We’re sampling 100 points to demonstrate the API integration without excessive wait times. To fetch NOAA data for all points, change the sample line accordingly.

6. Sensor Accuracy Analysis

Now let’s analyze how well each sensor performs compared to NOAA data.

6.1 Overall Accuracy Summary

#Option to Load in comparison data instead of making API call if you have the csv file saved to your working directory
# library(readr)
# comparison_data <- read_csv("CDT_Weather_Comparison.csv")

accuracy_summary <- comparison_data %>%
  group_by(Device) %>%
  summarise(
    n = n(),
    # Temperature accuracy
    Mean_Temp_Error = mean(Temp_Diff, na.rm = TRUE),
    MAE_Temp = mean(Temp_Abs_Diff, na.rm = TRUE),
    RMSE_Temp = sqrt(mean(Temp_Diff^2, na.rm = TRUE)),
    # Humidity accuracy
    Mean_Hum_Error = mean(Hum_Diff, na.rm = TRUE),
    MAE_Hum = mean(Hum_Abs_Diff, na.rm = TRUE),
    # Pressure accuracy
    Mean_Press_Error = mean(Press_Diff, na.rm = TRUE),
    MAE_Press = mean(Press_Abs_Diff, na.rm = TRUE)
  )

knitr::kable(accuracy_summary, digits = 2, 
             caption = "Sensor Accuracy Comparison (Lower values = better accuracy)")

Sensor Accuracy Comparison (Lower values = better accuracy)
Device	n	Mean_Temp_Error	MAE_Temp	RMSE_Temp	Mean_Hum_Error	MAE_Hum	Mean_Press_Error	MAE_Press
Purple	45	0.84	3.89	4.65	6.71	9.56	-4.23	4.31
Red	55	-0.83	8.30	11.59	3.64	11.93	-7.84	7.98

Interpretation: - Mean Error: Average difference (positive = sensor reads higher than NOAA) - MAE (Mean Absolute Error): Average magnitude of errors (ignoring direction) - RMSE (Root Mean Square Error): Emphasizes larger errors

6.2 Humidity Error Distribution

ggplot(comparison_data, aes(x = Hum_Diff, color = Device, fill = Device)) +
  geom_density(alpha = 0.3, linewidth = 1.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1) +
  labs(title = "Humidity Error Distribution (Sensor - NOAA)",
       subtitle = "Curves centered near 0 indicate accurate measurements",
       x = "Humidity Error (%)",
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple")) +
  scale_color_manual(values = c("Red" = "darkred", "Purple" = "darkviolet"))

6.3 Temperature Error Distribution

ggplot(comparison_data, aes(x = Temp_Diff, color = Device, fill = Device)) +
  geom_density(alpha = 0.3, linewidth = 1.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1) +
  labs(title = "Temperature Error Distribution (Sensor - NOAA)",
       subtitle = "Curves centered near 0 indicate accurate measurements",
       x = "Temperature Error (°F)",
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple")) +
  scale_color_manual(values = c("Red" = "darkred", "Purple" = "darkviolet"))

Exercise 2

Exercise 2: Modify the code above to create Pressure difference distributions instead of Temperature. Change the x-axis variable and update the title and x-axis label accordingly. Hint: look at what the column names for comparison_data are.

## put your exercise 1 code here

6.4 Temperature Error Box Plot

ggplot(comparison_data, aes(x = Device, y = Temp_Abs_Diff, fill = Device)) +
  geom_boxplot() +
  labs(title = "Absolute Temperature Error by Device",
       subtitle = "Lower values and tighter boxes indicate better accuracy",
       x = "Device",
       y = "Absolute Temperature Error (°F)") +
  theme_minimal() +
  scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))

Exercise 3

Exercise 3: Modify the code above to create a box plot showing the Pressure absolute difference distribution instead of Temperature. Change the y-axis variable and update the title and y-axis label accordingly.

## put your exercise 2 code here

6.5 Error Count by Device

# Categorize errors as small medium and large. change according to what might make the most sense
comparison_data <- comparison_data %>%
  mutate(
    Temp_Error_Category = case_when(
      Temp_Abs_Diff < 3 ~ "Small (<3°F)",
      Temp_Abs_Diff < 6 ~ "Medium (3-6°F)",
      TRUE ~ "Large (>6°F)"
    )
  )

error_counts <- comparison_data %>%
  count(Device, Temp_Error_Category) %>%
  group_by(Device) %>%
  mutate(Percentage = (n / sum(n)) * 100)

ggplot(error_counts, aes(x = Device, y = Percentage, fill = Temp_Error_Category)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  labs(title = "Temperature Error Categories by Device",
       subtitle = "Higher percentage of 'Small' errors indicates better sensor",
       x = "Device",
       y = "Percentage of Measurements",
       fill = "Error Size") +
  theme_minimal() +
  scale_fill_brewer(palette = "RdYlGn", direction = -1)

Exercise 4

Exercise 4: Modify the code above to create bar charts showing the error sizes for Pressure absolute differences. Update the title accordingly.

## put your exercise 3 code here

6.6 Interactive Temperature Comparison

# Create interactive scatter plot
p <- ggplot(comparison_data, aes(x = NOAA_Temp_F, y = Temp_F, color = Device,
                                 text = paste0(
                                   "Device: ", Device, "<br>",
                                   "Sensor Temp: ", round(Temp_F, 2), "°F<br>",
                                   "NOAA Temp: ", round(NOAA_Temp_F, 2), "°F<br>",
                                   "Difference: ", round(Temp_Diff, 2), "°F<br>",
                                   "% Difference: ", round(Temp_Pct_Diff, 2), "%<br>",
                                   "Date/Time: ", format(TimeStampFull, "%Y-%m-%d %H:%M")
                                 ))) +
  geom_point(size = 3, alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", linewidth = 1) +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple")) +
  theme_minimal() +
   labs(title = "Sensor Temperature vs NOAA Temperature",
       subtitle = "Points closer to the diagonal line indicate better accuracy",
       x = "NOAA Temperature (°F)",
       y = "Sensor Temperature (°F)")

ggplotly(p, tooltip = "text") %>%
  layout(hovermode = "closest")

Exercise 5

Exercise 5: Create a similar interactive scatter plot for Humidity (Sensor Humidity vs NOAA Humidity).

## put your exercise 4 code here

7. Error Pattern Analysis

Let’s investigate if there are patterns to the measurement errors based on date. Do the sensors underestimate or overestimate the noaa weather data?

7.1 Errors by Date

# Group by date to see if certain days have more errors
daily_errors <- comparison_data %>%
  filter(Device =="Red") %>%
  group_by(Date, Device) %>%
  summarise(
    Mean_Temp_Error = mean(Temp_Diff, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

ggplot(daily_errors, aes(x = Date, y = Mean_Temp_Error, color = Device, size = n)) +
  geom_point(alpha = 0.7) +
  geom_line(aes(group = Device), linewidth = 1) +
  labs(title = "Mean Temperature Error by Date",
       subtitle = "Point size indicates number of measurements that day",
       x = "Date",
       y = "Mean Absolute Error (°F)",
       size = "Number of\nMeasurements") +
  theme_minimal() +
  scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))

Exercise 6

Exercise 6: Create a similar mean absolute temperature error by date plot for the Purple device.

## put your exercise 5 code here

7.2 Positive vs Negative Error Analysis

# Count positive (overestimate) vs negative (underestimate) errors
error_direction <- comparison_data %>%
  mutate(Error_Direction = ifelse(Temp_Diff > 0, "Overestimate", "Underestimate")) %>%
  count(Device, Error_Direction) %>%
  group_by(Device) %>%
  mutate(Percentage = (n / sum(n)) * 100)

ggplot(error_direction, aes(x = Device, y = Percentage, fill = Error_Direction)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(title = "Direction of Temperature Errors",
       subtitle = "Does the sensor tend to read high or low?",
       x = "Device",
       y = "Percentage of Measurements",
       fill = "Error Direction") +
  theme_minimal() +
  scale_fill_manual(values = c("Overestimate" = "red", "Underestimate" = "blue"))

8. Error Visualization Map

Now let’s create an interactive map that shows over/underestimates for temperature, humidity, and pressure:

# Create error classification for visualization
comparison_data <- comparison_data %>%
  mutate(
    Temp_Error_Type = case_when(
      Temp_Diff > 1 ~ "Overestimate",
      Temp_Diff < -1 ~ "Underestimate",
      TRUE ~ "Accurate"
    ),
    Hum_Error_Type = case_when(
      Hum_Diff > 5 ~ "Overestimate",
      Hum_Diff < -5 ~ "Underestimate",
      TRUE ~ "Accurate"
    ),
    Press_Error_Type = case_when(
      Press_Diff > 2 ~ "Overestimate",
      Press_Diff < -2 ~ "Underestimate",
      TRUE ~ "Accurate"
    )
  )

# Create color palettes for each measurement type
temp_error_colors <- colorFactor(
  palette = c("blue", "green", "red"),
  domain = c("Underestimate", "Accurate", "Overestimate")
)

hum_error_colors <- colorFactor(
  palette = c("blue", "green", "red"),
  domain = c("Underestimate", "Accurate", "Overestimate")
)

press_error_colors <- colorFactor(
  palette = c("blue", "green", "red"),
  domain = c("Underestimate", "Accurate", "Overestimate")
)

# Create the interactive error map
error_map <- leaflet(comparison_data) %>%
  # Add base layers
  addProviderTiles(providers$Esri.WorldTopoMap, group = "ESRI Topo") %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addWMSTiles(
    baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
    layers = "0",
    options = WMSTileOptions(format = "image/png", transparent = TRUE),
    attribution = "USGS The National Map",
    group = "USGS Topo"  
  ) %>%
  addProviderTiles(providers$OpenStreetMap, group = "OpenStreetMap") %>%
  
  # Temperature error markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 7,
    color = ~temp_error_colors(Temp_Error_Type),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 2,
    group = "Temperature Errors",
    label = ~paste0(
      "Device: ", Device, 
      " | ", Temp_Error_Type,
      " | Error: ", round(Temp_Diff, 1), "°F"
    ),
    popup = ~paste0(
      "<strong>Device:</strong> ", Device, "<br/>",
      "<strong>Temperature Error:</strong> ", Temp_Error_Type, "<br/>",
      "Sensor: ", round(Temp_F, 1), "°F<br/>",
      "NOAA: ", round(NOAA_Temp_F, 1), "°F<br/>",
      "Difference: ", round(Temp_Diff, 1), "°F<br/>",
      "<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  # Humidity error markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 7,
    color = ~hum_error_colors(Hum_Error_Type),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 2,
    group = "Humidity Errors",
    label = ~paste0(
      "Device: ", Device, 
      " | ", Hum_Error_Type,
      " | Error: ", round(Hum_Diff, 1), "%"
    ),
    popup = ~paste0(
      "<strong>Device:</strong> ", Device, "<br/>",
      "<strong>Humidity Error:</strong> ", Hum_Error_Type, "<br/>",
      "Sensor: ", Hum, "%<br/>",
      "NOAA: ", round(NOAA_Humidity, 1), "%<br/>",
      "Difference: ", round(Hum_Diff, 1), "%<br/>",
      "<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  # Pressure error markers
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = 7,
    color = ~press_error_colors(Press_Error_Type),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 2,
    group = "Pressure Errors",
    label = ~paste0(
      "Device: ", Device, 
      " | ", Press_Error_Type,
      " | Error: ", round(Press_Diff, 1), " hPa"
    ),
    popup = ~paste0(
      "<strong>Device:</strong> ", Device, "<br/>",
      "<strong>Pressure Error:</strong> ", Press_Error_Type, "<br/>",
      "Sensor: ", Press, " hPa<br/>",
      "NOAA: ", round(NOAA_Pressure_hPa, 1), " hPa<br/>",
      "Difference: ", round(Press_Diff, 1), " hPa<br/>",
      "<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
    )
  ) %>%
  
  # Add layer controls
  addLayersControl(
    baseGroups = c("ESRI Topo", "Satellite", "USGS Topo", "OpenStreetMap"),
    overlayGroups = c("Temperature Errors", "Humidity Errors", "Pressure Errors"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  # Show only temperature errors by default
  hideGroup(c("Humidity Errors", "Pressure Errors")) %>%
  
  # Add legends
  addLegend(
    position = "bottomright",
    pal = temp_error_colors,
    values = ~Temp_Error_Type,
    title = "Error Type",
    opacity = 1,
    group = "Temperature Errors"
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = hum_error_colors,
    values = ~Hum_Error_Type,
    title = "Error Type",
    opacity = 1,
    group = "Humidity Errors"
  ) %>%
  
  addLegend(
    position = "bottomright",
    pal = press_error_colors,
    values = ~Press_Error_Type,
    title = "Error Type",
    opacity = 1,
    group = "Pressure Errors"
  ) %>%
  
  # Add scale bar
  addScaleBar(position = "bottomleft")

# Display the map
error_map

This interactive map allows you to: - Switch base layers: Choose between ESRI Topo, Satellite, USGS Topo, or OpenStreetMap - Toggle error types: Check/uncheck boxes to show Temperature, Humidity, or Pressure errors - Color coding: - Blue = Underestimate (sensor reads too low) - Green = Accurate (within threshold) - Red = Overestimate (sensor reads too high) - Hover labels: Quick view of device, error type, and magnitude - Click popups: Detailed comparison data

9. Export Enhanced Dataset

# Save the enhanced dataset with NOAA data
# write.csv(comparison_data, "CDT_Weather_Comparison.csv", row.names = FALSE)
# write_xlsx(comparison_data, "CDT_Weather_Comparison.xlsx")

cat("Analysis complete!\n")

## Analysis complete!

cat("Comparison dataset contains", nrow(comparison_data), "points with NOAA data.\n")

## Comparison dataset contains 100 points with NOAA data.

10. Summary and Conclusions

This tutorial demonstrated:

Data Import and Cleaning: Loading Excel data and properly formatting timestamps and temperatures. Checking for data accuracy and proper data labeling and making corrections as necessary
External Data Integration: Fetching NOAA weather data via API
Sensor Accuracy Analysis: Comparing sensor measurements against reference data
Statistical Analysis: Computing error metrics (MAE, RMSE) to evaluate sensor performance
Visualization: Creating distribution curves, box plots, scatter plots, interactive plots to explore sensor accuracy
Interactive Mapping: Using Leaflet with multiple base layers including satellite and USGS topo maps and also hover labels and click labels

Key Findings

Based on the comparison with NOAA data: - Total comparison points: 100 - Red device measurements: 55 - Purple device measurements: 45

Exercise 7

Exercise 7: Review the analysis comparing the red and purple meteo devices. Type your answers below each question. Provide at least three sentences per response:

7a: Which device would you say was more accurate compared to NOAA weather data (Temperature, Pressure, Humidity) and why would you say so? Be sure to mention what graph(s) or statistic(s) support your conclusion.

7b: What was something interesting you noticed about the difference between device data and the NOAA temperature data? Where there any patterns you noticed in this descriptive statistical analysis regarding the potential accuracy of these devices compared to NOAA historical data? Mention what graph(s) or statistic(s) helped you notice this pattern.

To submit this assignment, publish on RPubs and post link url on Canvas

Click the “Knit” button in RStudio
Once the HTML is generated, click “Publish” in the preview window
Select “RPubs” as the destination
Sign in or create an RPubs account
Give your document a title and click “Publish”
Copy and paste your html web address for your completed project posted on RPubs and upload the web address to canvas.

Weather Data Analysis with R

Type Your Name Here

2025-10-24