This tutorial demonstrates how to import, clean, analyze, and visualize meteorological tracking data in R. We’ll compare sensor measurements against NOAA weather data to evaluate sensor accuracy and determine which device performs better.
First, we’ll install and load all necessary packages:
# Install packages if needed (uncomment the lines below)
# install.packages(c("readxl", "dplyr", "ggplot2", "leaflet", "lubridate", "summarytools", "httr", "jsonlite", "plotly"))
library(readxl) # For reading Excel files
library(dplyr) # For data manipulation
library(ggplot2) # For plotting
library(leaflet) # For interactive maps
library(lubridate) # For date-time manipulation
library(summarytools)# For summary statistics
library(httr) # For API calls
library(jsonlite) # For JSON parsing
library(plotly) # For interactive plots
# Import the Excel file
weather_data <- read_excel("finalMTdata.xlsx")
# View the first few rows
head(weather_data)
## # A tibble: 6 × 10
## TimeStampFull Date Lat Lon alt_m Temp_F Hum Press DP
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2025-09-01T09:… 2025-09-01 00:00:00 43.0 -110. 2826 22.4 20 732 -1.5
## 2 2025-09-01T09:… 2025-09-01 00:00:00 43.0 -110. 2825 22.4 20 732 -1.5
## 3 2025-09-01T09:… 2025-09-01 00:00:00 43.0 -110. 2826 22.4 20 732 -1.5
## 4 2025-09-01T09:… 2025-09-01 00:00:00 43.0 -110. 2826 22.4 20 732 -1.5
## 5 2025-09-01T09:… 2025-09-01 00:00:00 43.0 -110. 2825 22.4 20 732 -1.3
## 6 2025-09-01T09:… 2025-09-01 00:00:00 43.0 -110. 2824 22.4 20 732 -1.1
## # ℹ 1 more variable: Device <chr>
Pay special attention to the timestamp, date, and time variables:
# Convert data types
weather_data <- weather_data %>%
mutate(
# Parse ISO 8601 timestamp with timezone
TimeStampFull = ymd_hms(TimeStampFull, tz = "UTC"),
# Parse date
Date = ymd(Date),
# Ensure numeric variables are numeric
Lat = as.numeric(Lat),
Lon = as.numeric(Lon),
alt_m = as.numeric(alt_m),
Hum = as.numeric(Hum),
Press = as.numeric(Press),
DP = as.numeric(DP),
# Ensure Device is a factor
Device = as.factor(Device)
)
# Check if temperature is actually in Celsius (common issue with sensor data)
# If the column is named Temp_F but values are small (typical for C), convert to F
if (mean(weather_data$Temp_F, na.rm = TRUE) < 30) {
cat("NOTE: Temperature values appear to be in Celsius. Converting to Fahrenheit...\n")
weather_data <- weather_data %>%
mutate(
Temp_C = Temp_F, # Save original Celsius values
Temp_F = (Temp_F * 9/5) + 32 # Convert to Fahrenheit
)
cat("Conversion complete. Original Celsius values saved in 'Temp_C' column.\n\n")
} else {
# Ensure Temp_F is numeric
weather_data$Temp_F <- as.numeric(weather_data$Temp_F)
}
## NOTE: Temperature values appear to be in Celsius. Converting to Fahrenheit...
## Conversion complete. Original Celsius values saved in 'Temp_C' column.
# Check the structure
str(weather_data)
## tibble [14,441 × 11] (S3: tbl_df/tbl/data.frame)
## $ TimeStampFull: POSIXct[1:14441], format: "2025-09-01 15:48:18" "2025-09-01 15:48:21" ...
## $ Date : Date[1:14441], format: "2025-09-01" "2025-09-01" ...
## $ Lat : num [1:14441] 43 43 43 43 43 ...
## $ Lon : num [1:14441] -110 -110 -110 -110 -110 ...
## $ alt_m : num [1:14441] 2826 2825 2826 2826 2825 ...
## $ Temp_F : num [1:14441] 72.3 72.3 72.3 72.3 72.3 ...
## $ Hum : num [1:14441] 20 20 20 20 20 20 20 20 20 22 ...
## $ Press : num [1:14441] 732 732 732 732 732 732 732 732 733 732 ...
## $ DP : num [1:14441] -1.5 -1.5 -1.5 -1.5 -1.3 -1.1 -1.9 -2.3 -2.6 -2.1 ...
## $ Device : Factor w/ 3 levels "Device","Purple",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Temp_C : num [1:14441] 22.4 22.4 22.4 22.4 22.4 22.4 22 21.5 20.6 19.9 ...
# Display temperature range to verify
cat("\nTemperature range after conversion:\n")
##
## Temperature range after conversion:
cat("Min:", round(min(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")
## Min: 29.1 °F
cat("Max:", round(max(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")
## Max: 92.5 °F
cat("Mean:", round(mean(weather_data$Temp_F, na.rm = TRUE), 1), "°F\n")
## Mean: 52.1 °F
# Load necessary library
library(leaflet)
# Clean up and standardize the Device column
weather_data$Device <- trimws(weather_data$Device) # remove extra spaces
weather_data$Device <- factor(weather_data$Device, levels = c("Red", "Purple"))
# Create color palette for devices (Red and Purple)
device_colors <- colorFactor(
palette = c("red", "purple"),
domain = levels(weather_data$Device)
)
# Create the leaflet map
leaflet_map <- leaflet(weather_data) %>%
# Add multiple base layers
addProviderTiles(providers$OpenTopoMap, group = "OpenTopoMap") %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "ESRI Topo") %>%
addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
addWMSTiles(
baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
layers = "0",
options = WMSTileOptions(format = "image/png", transparent = TRUE),
attribution = "USGS The National Map",
group = "USGS Topo"
) %>%
addProviderTiles(providers$OpenStreetMap, group = "OpenStreetMap") %>%
# Add markers with popups and labels
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = 5,
color = ~device_colors(Device),
fillOpacity = 0.7,
stroke = TRUE,
weight = 1,
label = ~paste0(
"Device: ", Device,
" | Temp: ", round(Temp_F, 1), "°F",
" | Hum: ", Hum, "%",
" | Press: ", Press, " hPa",
" | DP: ", round(DP, 1), "°F"
),
popup = ~paste0(
"<strong>Device:</strong> ", Device, "<br/>",
"<strong>Temperature:</strong> ", round(Temp_F, 1), "°F<br/>",
"<strong>Humidity:</strong> ", Hum, "%<br/>",
"<strong>Pressure:</strong> ", Press, " hPa<br/>",
"<strong>Dew Point:</strong> ", round(DP, 1), "°F<br/>",
"<strong>Altitude:</strong> ", round(alt_m, 1), " m<br/>",
"<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M:%S")
)
) %>%
# Add layer control
addLayersControl(
baseGroups = c("USGS Topo","OpenTopoMap", "ESRI Topo", "Satellite", "OpenStreetMap"),
options = layersControlOptions(collapsed = FALSE)
) %>%
# Add legend
addLegend(
position = "bottomright",
pal = device_colors,
values = ~Device,
title = "Device",
opacity = 1
) %>%
# Add scale bar
addScaleBar(position = "bottomleft")
# Display the map
leaflet_map
Exercise 1: In the map above, try changing the base layer that loads by default. Currently it loads OpenTopoMap first - can you make it load a different layer first instead? (Hint: The order in which you add the provider tiles matters!)
# Display dataset dimensions
cat("Dataset dimensions:", nrow(weather_data), "rows x", ncol(weather_data), "columns\n\n")
## Dataset dimensions: 14441 rows x 11 columns
# Check date range
cat("Dataset date range:\n")
## Dataset date range:
cat("Earliest:", as.character(min(weather_data$Date, na.rm = TRUE)), "\n")
## Earliest: 2025-09-01
cat("Latest:", as.character(max(weather_data$Date, na.rm = TRUE)), "\n\n")
## Latest: 2025-09-11
# Summary of all variables
summary(weather_data)
## TimeStampFull Date Lat
## Min. :2025-09-01 15:48:18 Min. :2025-09-01 Min. :42.52
## 1st Qu.:2025-09-03 05:09:43 1st Qu.:2025-09-02 1st Qu.:42.64
## Median :2025-09-04 22:42:13 Median :2025-09-04 Median :42.86
## Mean :2025-09-06 06:06:15 Mean :2025-09-05 Mean :42.82
## 3rd Qu.:2025-09-09 20:54:02 3rd Qu.:2025-09-09 3rd Qu.:42.98
## Max. :2025-09-12 00:47:54 Max. :2025-09-11 Max. :43.02
## NA's :1 NA's :1 NA's :1
## Lon alt_m Temp_F Hum
## Min. :-109.8 Min. :2669 Min. :29.12 Min. :13.00
## 1st Qu.:-109.6 1st Qu.:2839 1st Qu.:41.00 1st Qu.:32.00
## Median :-109.4 Median :2991 Median :50.90 Median :53.00
## Mean :-109.4 Mean :2976 Mean :52.06 Mean :52.78
## 3rd Qu.:-109.2 3rd Qu.:3119 3rd Qu.:63.55 3rd Qu.:71.00
## Max. :-109.0 Max. :3320 Max. :92.48 Max. :96.00
## NA's :1 NA's :1 NA's :1 NA's :1
## Press DP Device Temp_C
## Min. :691.0 Min. :-5.6000 Red :8576 Min. :-1.60
## 1st Qu.:708.0 1st Qu.:-1.3000 Purple:5864 1st Qu.: 5.00
## Median :716.0 Median : 0.2000 NA's : 1 Median :10.50
## Mean :717.4 Mean : 0.5241 Mean :11.14
## 3rd Qu.:727.0 3rd Qu.: 2.3000 3rd Qu.:17.52
## Max. :740.0 Max. :11.8000 Max. :33.60
## NA's :1 NA's :1 NA's :1
# Summary statistics for numeric variables by device
weather_data %>%
group_by(Device) %>%
summarise(
n = n(),
Mean_Temp_F = mean(Temp_F, na.rm = TRUE),
SD_Temp_F = sd(Temp_F, na.rm = TRUE),
Mean_Humidity = mean(Hum, na.rm = TRUE),
SD_Humidity = sd(Hum, na.rm = TRUE),
Mean_Pressure = mean(Press, na.rm = TRUE),
SD_Pressure = sd(Press, na.rm = TRUE),
Mean_Altitude_m = mean(alt_m, na.rm = TRUE),
Min_Altitude_m = min(alt_m, na.rm = TRUE),
Max_Altitude_m = max(alt_m, na.rm = TRUE)
) %>%
knitr::kable(digits = 2, caption = "Summary Statistics by Device")
| Device | n | Mean_Temp_F | SD_Temp_F | Mean_Humidity | SD_Humidity | Mean_Pressure | SD_Pressure | Mean_Altitude_m | Min_Altitude_m | Max_Altitude_m |
|---|---|---|---|---|---|---|---|---|---|---|
| Red | 8576 | 51.23 | 12.75 | 53.07 | 20.94 | 717.93 | 10.09 | 2974.97 | 2766 | 3320 |
| Purple | 5864 | 53.27 | 13.74 | 52.36 | 22.83 | 716.74 | 10.90 | 2976.57 | 2669 | 3318 |
| NA | 1 | NaN | NA | NaN | NA | NaN | NA | NaN | Inf | -Inf |
We’ll use the Open-Meteo API to retrieve historical weather data. There is a limit on how much data you can take, if you exceed this you will be cut-off for a time period. If we want to eventually download all the data corresponding for all of our points we will need to be a bit strategic about doing it slower or in chunks. Here are the terms for the free tier of Open_Meteo: Less than 10’000 API calls per day, 5’000 per hour and 600 per minute. For more information visit: https://open-meteo.com/
# Function to get weather data from Open-Meteo API
get_weather_data <- function(lat, lon, datetime, debug = FALSE) {
# Convert datetime to date
date <- as.Date(datetime)
# Build API URL
base_url <- "https://archive-api.open-meteo.com/v1/archive"
url <- paste0(
base_url,
"?latitude=", lat,
"&longitude=", lon,
"&start_date=", date,
"&end_date=", date,
"&hourly=temperature_2m,relative_humidity_2m,surface_pressure",
"&temperature_unit=fahrenheit"
)
# Default return values
default_result <- list(
NOAA_Temp_F = NA_real_,
NOAA_Humidity = NA_real_,
NOAA_Pressure_hPa = NA_real_
)
# Make API request
result <- tryCatch({
response <- GET(url)
if (debug) {
cat("URL:", url, "\n")
cat("Status:", status_code(response), "\n")
}
if (status_code(response) == 200) {
data <- content(response, "parsed")
# Check if data exists
if (is.null(data$hourly)) {
if (debug) cat("No hourly data in response\n")
return(default_result)
}
if (is.null(data$hourly$time) || length(data$hourly$time) == 0) {
if (debug) cat("No time data\n")
return(default_result)
}
if (is.null(data$hourly$temperature_2m) || length(data$hourly$temperature_2m) == 0) {
if (debug) cat("No temperature data\n")
return(default_result)
}
hourly_data <- data$hourly
# Unwrap lists if needed
hourly_data$time <- unlist(hourly_data$time)
hourly_data$temperature_2m <- unlist(hourly_data$temperature_2m)
hourly_data$relative_humidity_2m <- unlist(hourly_data$relative_humidity_2m)
hourly_data$surface_pressure <- unlist(hourly_data$surface_pressure)
if (debug) {
cat("Number of hourly records:", length(hourly_data$time), "\n")
cat("First time value:", hourly_data$time[1], "\n")
cat("Target datetime:", as.character(datetime), "\n")
cat("Target hour:", hour(datetime), "\n")
}
# Get the hour from our datetime (0-23)
target_hour <- hour(datetime)
# The API returns 24 hours (0-23) for the requested date
# Index is target_hour + 1 (since R is 1-indexed)
idx <- target_hour + 1
if (debug) {
cat("Using index:", idx, "\n")
cat("Temperature at index:", hourly_data$temperature_2m[idx], "\n")
}
# Safely extract values with bounds checking
temp_val <- if (idx > 0 && idx <= length(hourly_data$temperature_2m)) {
val <- hourly_data$temperature_2m[idx]
if (!is.null(val)) as.numeric(val) else NA_real_
} else NA_real_
hum_val <- if (idx > 0 && idx <= length(hourly_data$relative_humidity_2m)) {
val <- hourly_data$relative_humidity_2m[idx]
if (!is.null(val)) as.numeric(val) else NA_real_
} else NA_real_
press_val <- if (idx > 0 && idx <= length(hourly_data$surface_pressure)) {
val <- hourly_data$surface_pressure[idx]
if (!is.null(val)) as.numeric(val) else NA_real_
} else NA_real_
if (debug) {
cat("Extracted - Temp:", temp_val, "Hum:", hum_val, "Press:", press_val, "\n\n")
}
return(list(
NOAA_Temp_F = temp_val,
NOAA_Humidity = hum_val,
NOAA_Pressure_hPa = press_val
))
} else {
if (debug) cat("API returned error status\n")
return(default_result)
}
}, error = function(e) {
if (debug) cat("Error:", e$message, "\n")
return(default_result)
})
return(result)
}
# Sample 1000 points to demonstrate (to avoid excessive API calls and wait time)
set.seed(123)
sample_indices <- sample(1:nrow(weather_data), min(100, nrow(weather_data)))
# Initialize columns
weather_data$NOAA_Temp_F <- NA
weather_data$NOAA_Humidity <- NA
weather_data$NOAA_Pressure_hPa <- NA
# Test with sample point to see if API is working
cat("Testing API with first sample point...\n")
## Testing API with first sample point...
test_result <- get_weather_data(
weather_data$Lat[sample_indices[1]],
weather_data$Lon[sample_indices[1]],
weather_data$TimeStampFull[sample_indices[1]],
debug = TRUE
)
## URL: https://archive-api.open-meteo.com/v1/archive?latitude=42.983403&longitude=-109.58012&start_date=2025-09-03&end_date=2025-09-03&hourly=temperature_2m,relative_humidity_2m,surface_pressure&temperature_unit=fahrenheit
## Status: 200
## Number of hourly records: 24
## First time value: 2025-09-03T00:00
## Target datetime: 2025-09-03 03:30:02
## Target hour: 3
## Using index: 4
## Temperature at index: 46.1
## Extracted - Temp: 46.1 Hum: 61 Press: 710.4
if (!is.na(test_result$NOAA_Temp_F)) {
cat("SUCCESS! API is returning data.\n\n")
} else {
cat("WARNING: API test failed. Check the debug output above.\n\n")
}
## SUCCESS! API is returning data.
# Fetch NOAA data for sample points
cat("Fetching NOAA weather data for", length(sample_indices), "sample points...\n")
## Fetching NOAA weather data for 100 sample points...
cat("This will take approximately", length(sample_indices) * 0.5, "seconds.\n\n")
## This will take approximately 50 seconds.
# Counter for successful retrievals
success_count <- 0
fail_count <- 0
for (i in sample_indices) {
res <- get_weather_data(
weather_data$Lat[i],
weather_data$Lon[i],
weather_data$TimeStampFull[i]
)
weather_data$NOAA_Temp_F[i] <- res$NOAA_Temp_F
weather_data$NOAA_Humidity[i] <- res$NOAA_Humidity
weather_data$NOAA_Pressure_hPa[i] <- res$NOAA_Pressure_hPa
# Track success/failure
if (!is.na(weather_data$NOAA_Temp_F[i])) {
success_count <- success_count + 1
} else {
fail_count <- fail_count + 1
}
# add countdown feedback to know how long data retrieval is taking
if (i %% 10 == 0) {
cat("Processing point", which(sample_indices == i), "of", length(sample_indices),
"(", round(which(sample_indices == i) / length(sample_indices) * 100, 1), "% complete )\n")
}
Sys.sleep(0.5)
}
## Processing point 26 of 100 ( 26 % complete )
## Processing point 32 of 100 ( 32 % complete )
## Processing point 39 of 100 ( 39 % complete )
## Processing point 58 of 100 ( 58 % complete )
## Processing point 61 of 100 ( 61 % complete )
## Processing point 99 of 100 ( 99 % complete )
cat("\nNOAA data fetch complete!\n")
##
## NOAA data fetch complete!
cat("Successfully retrieved:", success_count, "of", length(sample_indices), "sample points\n")
## Successfully retrieved: 100 of 100 sample points
cat("Failed to retrieve:", fail_count, "points\n\n")
## Failed to retrieve: 0 points
# Create comparison dataset with only points that have NOAA data
comparison_data <- weather_data %>%
filter(!is.na(NOAA_Temp_F)) %>%
mutate(
Temp_Diff = Temp_F - NOAA_Temp_F,
Temp_Pct_Diff = ((Temp_F - NOAA_Temp_F) / NOAA_Temp_F) * 100,
Temp_Abs_Diff = abs(Temp_Diff),
Hum_Diff = Hum - NOAA_Humidity,
Hum_Abs_Diff = abs(Hum_Diff),
Press_Diff = Press - NOAA_Pressure_hPa,
Press_Abs_Diff = abs(Press_Diff)
)
cat("Comparison dataset created with", nrow(comparison_data), "points\n")
## Comparison dataset created with 100 points
Note: We’re sampling 100 points to demonstrate the API integration without excessive wait times. To fetch NOAA data for all points, change the sample line accordingly.
Now let’s analyze how well each sensor performs compared to NOAA data.
#Option to Load in comparison data instead of making API call if you have the csv file saved to your working directory
# library(readr)
# comparison_data <- read_csv("CDT_Weather_Comparison.csv")
accuracy_summary <- comparison_data %>%
group_by(Device) %>%
summarise(
n = n(),
# Temperature accuracy
Mean_Temp_Error = mean(Temp_Diff, na.rm = TRUE),
MAE_Temp = mean(Temp_Abs_Diff, na.rm = TRUE),
RMSE_Temp = sqrt(mean(Temp_Diff^2, na.rm = TRUE)),
# Humidity accuracy
Mean_Hum_Error = mean(Hum_Diff, na.rm = TRUE),
MAE_Hum = mean(Hum_Abs_Diff, na.rm = TRUE),
# Pressure accuracy
Mean_Press_Error = mean(Press_Diff, na.rm = TRUE),
MAE_Press = mean(Press_Abs_Diff, na.rm = TRUE)
)
knitr::kable(accuracy_summary, digits = 2,
caption = "Sensor Accuracy Comparison (Lower values = better accuracy)")
| Device | n | Mean_Temp_Error | MAE_Temp | RMSE_Temp | Mean_Hum_Error | MAE_Hum | Mean_Press_Error | MAE_Press |
|---|---|---|---|---|---|---|---|---|
| Red | 62 | 0.10 | 6.76 | 10.10 | 4.37 | 12.11 | -4.75 | 4.93 |
| Purple | 38 | -0.48 | 5.54 | 6.66 | 11.97 | 14.82 | -4.17 | 4.28 |
Interpretation: - Mean Error: Average difference (positive = sensor reads higher than NOAA) - MAE (Mean Absolute Error): Average magnitude of errors (ignoring direction) - RMSE (Root Mean Square Error): Emphasizes larger errors
ggplot(comparison_data, aes(x = Hum_Diff, color = Device, fill = Device)) +
geom_density(alpha = 0.3, linewidth = 1.2) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1) +
labs(title = "Humidity Error Distribution (Sensor - NOAA)",
subtitle = "Curves centered near 0 indicate accurate measurements",
x = "Humidity Error (%)",
y = "Density") +
theme_minimal() +
scale_fill_manual(values = c("Red" = "red", "Purple" = "purple")) +
scale_color_manual(values = c("Red" = "darkred", "Purple" = "darkviolet"))
ggplot(comparison_data, aes(x = Temp_Diff, color = Device, fill = Device)) +
geom_density(alpha = 0.3, linewidth = 1.2) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1) +
labs(title = "Temperature Error Distribution (Sensor - NOAA)",
subtitle = "Curves centered near 0 indicate accurate measurements",
x = "Temperature Error (°F)",
y = "Density") +
theme_minimal() +
scale_fill_manual(values = c("Red" = "red", "purple" = "purple")) +
scale_color_manual(values = c("Red" = "darkred", "purple" = "darkviolet"))
Exercise 2: Modify the code above to create Pressure difference distributions instead of Temperature. Change the x-axis variable and update the title and x-axis label accordingly. Hint: look at what the column names for comparison_data are.
## put your exercise 1 code here
ggplot(comparison_data, aes(x = Press_Diff, color = Device, fill = Device)) +
geom_density(alpha = 0.3, linewidth = 1.2) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1) +
labs(title = "Pressure Error Distribution (Sensor - NOAA)",
subtitle = "Curves centered near 0 indicate accurate measurements",
x = "Pressure Error (mBar)",
y = "Density") +
theme_minimal() +
scale_fill_manual(values = c("Red" = "red", "Purple" = "purple")) +
scale_color_manual(values = c("Red" = "darkred", "Purple" = "darkviolet"))
ggplot(comparison_data, aes(x = Device, y = Temp_Abs_Diff, fill = Device)) +
geom_boxplot() +
labs(title = "Absolute Temperature Error by Device",
subtitle = "Lower values and tighter boxes indicate better accuracy",
x = "Device",
y = "Absolute Temperature Error (°F)") +
theme_minimal() +
scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))
Exercise 3: Modify the code above to create a box plot showing the Pressure absolute difference distribution instead of Temperature. Change the y-axis variable and update the title and y-axis label accordingly.
## put your exercise 2 code here
ggplot(comparison_data, aes(x = Device, y = Press_Abs_Diff, fill = Device)) +
geom_boxplot() +
labs(title = "Absolute Pressure Error by Device",
subtitle = "Lower values and tighter boxes indicate better accuracy",
x = "Device",
y = "Absolute Pressure Error (mBar)") +
theme_minimal() +
scale_fill_manual(values = c("Red" = "red", "Purple" = "purple"))
# Categorize errors as small medium and large. change according to what might make the most sense
comparison_data <- comparison_data %>%
mutate(
Temp_Error_Category = case_when(
Temp_Abs_Diff < 3 ~ "Small (<3°F)",
Temp_Abs_Diff < 6 ~ "Medium (3-6°F)",
TRUE ~ "Large (>6°F)"
)
)
error_counts <- comparison_data %>%
count(Device, Temp_Error_Category) %>%
group_by(Device) %>%
mutate(Percentage = (n / sum(n)) * 100)
ggplot(error_counts, aes(x = Device, y = Percentage, fill = Temp_Error_Category)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")),
position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
labs(title = "Temperature Error Categories by Device",
subtitle = "Higher percentage of 'Small' errors indicates better sensor",
x = "Device",
y = "Percentage of Measurements",
fill = "Error Size") +
theme_minimal() +
scale_fill_brewer(palette = "RdYlGn", direction = -1)
# Categorize errors as small medium and large. change according to what might make the most sense
comparison_data <- comparison_data %>%
mutate(
Hum_Error_Category = case_when(
Hum_Abs_Diff < 5 ~ "Small (<5%)",
Hum_Abs_Diff < 10 ~ "Medium (5-10%)",
TRUE ~ "Large (>10%)"
)
)
error_counts <- comparison_data %>%
count(Device, Hum_Error_Category) %>%
group_by(Device) %>%
mutate(Percentage = (n / sum(n)) * 100)
ggplot(error_counts, aes(x = Device, y = Percentage, fill = Hum_Error_Category)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")),
position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
labs(title = "Humidity Error Categories by Device",
subtitle = "Higher percentage of 'Small' errors indicates better sensor",
x = "Device",
y = "Percentage of Measurements",
fill = "Error Size") +
theme_minimal() +
scale_fill_brewer(palette = "RdYlGn", direction = -1)
Exercise 4: Modify the code above to create bar charts showing the error sizes for Pressure absolute differences. Update the title accordingly.
## put your exercise 3 code here
comparison_data <- comparison_data %>%
mutate(
Press_Error_Category = case_when(
Press_Abs_Diff < 2 ~ "Small (<3mBar)",
Press_Abs_Diff < 6 ~ "Medium (3-6mBar)",
TRUE ~ "Large (>6mBar)"
)
)
error_counts <- comparison_data %>%
count(Device, Press_Error_Category) %>%
group_by(Device) %>%
mutate(Percentage = (n / sum(n)) * 100)
ggplot(error_counts, aes(x = Device, y = Percentage, fill = Press_Error_Category)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")),
position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
labs(title = "Pressure Error Categories by Device",
subtitle = "Higher percentage of 'Small' errors indicates better sensor",
x = "Device",
y = "Percentage of Measurements",
fill = "Error Size") +
theme_minimal() +
scale_fill_brewer(palette = "RdYlGn", direction = -1)
# Create interactive scatter plot
p <- ggplot(comparison_data, aes(x = NOAA_Temp_F, y = Temp_F, color = Device,
text = paste0(
"Device: ", Device, "<br>",
"Sensor Temp: ", round(Temp_F, 2), "°F<br>",
"NOAA Temp: ", round(NOAA_Temp_F, 2), "°F<br>",
"Difference: ", round(Temp_Diff, 2), "°F<br>",
"% Difference: ", round(Temp_Pct_Diff, 2), "%<br>",
"Date/Time: ", format(TimeStampFull, "%Y-%m-%d %H:%M")
))) +
geom_point(size = 3, alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", linewidth = 1) +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple")) +
theme_minimal() +
labs(title = "Sensor Temperature vs NOAA Temperature",
subtitle = "Points closer to the diagonal line indicate better accuracy",
x = "NOAA Temperature (°F)",
y = "Sensor Temperature (°F)")
ggplotly(p, tooltip = "text") %>%
layout(hovermode = "closest")
Exercise 5: Create a similar interactive scatter plot for Humidity (Sensor Humidity vs NOAA Humidity).
## put your exercise 4 code here
p <- ggplot(comparison_data, aes(x = NOAA_Humidity, y = Hum, color = Device,
text = paste0(
"Device: ", Device, "<br>",
"Sensor Hum: ", round(Hum, 2), "%<br>",
"NOAA Hum: ", round(NOAA_Humidity, 2), "%<br>",
"Difference: ", round(Hum_Diff, 2), "%<br>",
"% Difference: ", round(Hum_Diff, 2), "%<br>",
"Date/Time: ", format(TimeStampFull, "%Y-%m-%d %H:%M")
))) +
geom_point(size = 3, alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", linewidth = 1) +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple")) +
theme_minimal() +
labs(title = "Sensor Humidity vs NOAA Humidity",
subtitle = "Points closer to the diagonal line indicate better accuracy",
x = "NOAA Humidity (%)",
y = "Sensor Humidity (%)")
ggplotly(p, tooltip = "text") %>%
layout(hovermode = "closest")
## put your exercise 4 code here
p <- ggplot(comparison_data, aes(x = NOAA_Pressure_hPa, y = Press, color = Device,
text = paste0(
"Device: ", Device, "<br>",
"Sensor Press: ", round(Hum, 2), "hPa<br>",
"NOAA hPa: ", round(NOAA_Pressure_hPa, 2), "hPa<br>",
"Difference: ", round(Press_Diff, 2), "hPa<br>",
"% Difference: ", round(Press_Diff, 2), "%<br>",
"Date/Time: ", format(TimeStampFull, "%Y-%m-%d %H:%M")
))) +
geom_point(size = 3, alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", linewidth = 1) +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple")) +
theme_minimal() +
labs(title = "Sensor Pressure vs NOAA Pressure",
subtitle = "Points closer to the diagonal line indicate better accuracy",
x = "NOAA Pressure (hPa)",
y = "Sensor Pressure (hPa)")
ggplotly(p, tooltip = "text") %>%
layout(hovermode = "closest")
Let’s investigate if there are patterns to the measurement errors based on date. Do the sensors underestimate or overestimate the noaa weather data?
# Group by date to see if certain days have more errors
daily_errors <- comparison_data %>%
filter(Device =="Red") %>%
group_by(Date, Device) %>%
summarise(
Mean_Temp_Error = mean(Temp_Diff, na.rm = TRUE),
n = n(),
.groups = "drop"
)
ggplot(daily_errors, aes(x = Date, y = Mean_Temp_Error, color = Device, size = n)) +
geom_point(alpha = 0.7) +
geom_line(aes(group = Device), linewidth = 1) +
labs(title = "Mean Temperature Error by Date",
subtitle = "Point size indicates number of measurements that day",
x = "Date",
y = "Mean Absolute Error (°F)",
size = "Number of\nMeasurements") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
Exercise 6: Create a similar mean absolute temperature error by date plot for the purple device.
## put your exercise 5 code here
daily_errors <- comparison_data %>%
filter(Device =="Purple") %>%
group_by(Date, Device) %>%
summarise(
Mean_Temp_Error = mean(Temp_Diff, na.rm = TRUE),
n = n(),
.groups = "drop"
)
ggplot(daily_errors, aes(x = Date, y = Mean_Temp_Error, color = Device, size = n)) +
geom_point(alpha = 0.7) +
geom_line(aes(group = Device), linewidth = 1) +
labs(title = "Mean Temperature Error by Date",
subtitle = "Point size indicates number of measurements that day",
x = "Date",
y = "Mean Absolute Error (°F)",
size = "Number of\nMeasurements") +
theme_minimal() +
scale_color_manual(values = c("Red" = "red", "Purple" = "purple"))
# Count positive (overestimate) vs negative (underestimate) errors
error_direction <- comparison_data %>%
mutate(Error_Direction = ifelse(Temp_Diff > 0, "Overestimate", "Underestimate")) %>%
count(Device, Error_Direction) %>%
group_by(Device) %>%
mutate(Percentage = (n / sum(n)) * 100)
ggplot(error_direction, aes(x = Device, y = Percentage, fill = Error_Direction)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")),
position = position_dodge(width = 0.9), vjust = -0.5) +
labs(title = "Direction of Temperature Errors",
subtitle = "Does the sensor tend to read high or low?",
x = "Device",
y = "Percentage of Measurements",
fill = "Error Direction") +
theme_minimal() +
scale_fill_manual(values = c("Overestimate" = "red", "Underestimate" = "blue"))
Now let’s create an interactive map that shows over/underestimates for temperature, humidity, and pressure:
# Create error classification for visualization
comparison_data <- comparison_data %>%
mutate(
Temp_Error_Type = case_when(
Temp_Diff > 1 ~ "Overestimate",
Temp_Diff < -1 ~ "Underestimate",
TRUE ~ "Accurate"
),
Hum_Error_Type = case_when(
Hum_Diff > 5 ~ "Overestimate",
Hum_Diff < -5 ~ "Underestimate",
TRUE ~ "Accurate"
),
Press_Error_Type = case_when(
Press_Diff > 2 ~ "Overestimate",
Press_Diff < -2 ~ "Underestimate",
TRUE ~ "Accurate"
)
)
# Create color palettes for each measurement type
temp_error_colors <- colorFactor(
palette = c("blue", "green", "red"),
domain = c("Underestimate", "Accurate", "Overestimate")
)
hum_error_colors <- colorFactor(
palette = c("blue", "green", "red"),
domain = c("Underestimate", "Accurate", "Overestimate")
)
press_error_colors <- colorFactor(
palette = c("blue", "green", "red"),
domain = c("Underestimate", "Accurate", "Overestimate")
)
# Create the interactive error map
error_map <- leaflet(comparison_data) %>%
# Add base layers
addProviderTiles(providers$Esri.WorldTopoMap, group = "ESRI Topo") %>%
addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
addWMSTiles(
baseUrl = "https://basemap.nationalmap.gov/arcgis/services/USGSTopo/MapServer/WMSServer?",
layers = "0",
options = WMSTileOptions(format = "image/png", transparent = TRUE),
attribution = "USGS The National Map",
group = "USGS Topo"
) %>%
addProviderTiles(providers$OpenStreetMap, group = "OpenStreetMap") %>%
# Temperature error markers
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = 7,
color = ~temp_error_colors(Temp_Error_Type),
fillOpacity = 0.7,
stroke = TRUE,
weight = 2,
group = "Temperature Errors",
label = ~paste0(
"Device: ", Device,
" | ", Temp_Error_Type,
" | Error: ", round(Temp_Diff, 1), "°F"
),
popup = ~paste0(
"<strong>Device:</strong> ", Device, "<br/>",
"<strong>Temperature Error:</strong> ", Temp_Error_Type, "<br/>",
"Sensor: ", round(Temp_F, 1), "°F<br/>",
"NOAA: ", round(NOAA_Temp_F, 1), "°F<br/>",
"Difference: ", round(Temp_Diff, 1), "°F<br/>",
"<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
)
) %>%
# Humidity error markers
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = 7,
color = ~hum_error_colors(Hum_Error_Type),
fillOpacity = 0.7,
stroke = TRUE,
weight = 2,
group = "Humidity Errors",
label = ~paste0(
"Device: ", Device,
" | ", Hum_Error_Type,
" | Error: ", round(Hum_Diff, 1), "%"
),
popup = ~paste0(
"<strong>Device:</strong> ", Device, "<br/>",
"<strong>Humidity Error:</strong> ", Hum_Error_Type, "<br/>",
"Sensor: ", Hum, "%<br/>",
"NOAA: ", round(NOAA_Humidity, 1), "%<br/>",
"Difference: ", round(Hum_Diff, 1), "%<br/>",
"<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
)
) %>%
# Pressure error markers
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = 7,
color = ~press_error_colors(Press_Error_Type),
fillOpacity = 0.7,
stroke = TRUE,
weight = 2,
group = "Pressure Errors",
label = ~paste0(
"Device: ", Device,
" | ", Press_Error_Type,
" | Error: ", round(Press_Diff, 1), " hPa"
),
popup = ~paste0(
"<strong>Device:</strong> ", Device, "<br/>",
"<strong>Pressure Error:</strong> ", Press_Error_Type, "<br/>",
"Sensor: ", Press, " hPa<br/>",
"NOAA: ", round(NOAA_Pressure_hPa, 1), " hPa<br/>",
"Difference: ", round(Press_Diff, 1), " hPa<br/>",
"<strong>Time:</strong> ", format(TimeStampFull, "%Y-%m-%d %H:%M")
)
) %>%
# Add layer controls
addLayersControl(
baseGroups = c( "USGS Topo", "ESRI Topo", "Satellite", "USGS Topo", "OpenStreetMap"),
overlayGroups = c("Temperature Errors", "Humidity Errors", "Pressure Errors"),
options = layersControlOptions(collapsed = FALSE)
) %>%
# Show only temperature errors by default
hideGroup(c("Humidity Errors", "Pressure Errors")) %>%
# Add legends
addLegend(
position = "bottomright",
pal = temp_error_colors,
values = ~Temp_Error_Type,
title = "Error Type",
opacity = 1,
group = "Temperature Errors"
) %>%
addLegend(
position = "bottomright",
pal = hum_error_colors,
values = ~Hum_Error_Type,
title = "Error Type",
opacity = 1,
group = "Humidity Errors"
) %>%
addLegend(
position = "bottomright",
pal = press_error_colors,
values = ~Press_Error_Type,
title = "Error Type",
opacity = 1,
group = "Pressure Errors"
) %>%
# Add scale bar
addScaleBar(position = "bottomleft")
# Display the map
error_map
This interactive map allows you to: - Switch base layers: Choose between ESRI Topo, Satellite, USGS Topo, or OpenStreetMap - Toggle error types: Check/uncheck boxes to show Temperature, Humidity, or Pressure errors - Color coding: - Blue = Underestimate (sensor reads too low) - Green = Accurate (within threshold) - Red = Overestimate (sensor reads too high) - Hover labels: Quick view of device, error type, and magnitude - Click popups: Detailed comparison data
# Save the enhanced dataset with NOAA data
# write.csv(comparison_data, "CDT_Weather_Comparison.csv", row.names = FALSE)
# write_xlsx(comparison_data, "CDT_Weather_Comparison.xlsx")
cat("Analysis complete!\n")
## Analysis complete!
cat("Comparison dataset contains", nrow(comparison_data), "points with NOAA data.\n")
## Comparison dataset contains 100 points with NOAA data.
This tutorial demonstrated:
Based on the comparison with NOAA data: - Total comparison points: 100 - Red device measurements: 62 - Purple device measurements: 38
Exercise 7: Review the analysis comparing the red and purple meteo devices. Type your answers below each question. Provide at least three sentences per response:
7a: Which device would you say was more accurate compared to NOAA weather data (Temperature, Pressure, Humidity) and why would you say so? Be sure to mention what graph(s) or statistic(s) support your conclusion. In the ‘Pressure Error Categories by Device’ bar graph, the purple device has a higher percentage of both small and large error when compared to red. However, the small error percentage is a larger difference than the larger error percentage, meaning, overall, it is likely more accurate than the red device. In the ‘Humidity Error Categories by Device’ bar graph, the purple device has a lower percentage of small error and a higher percentage of large error when compared to red, so the red device is more accurate in this category. In ‘Humidity Error Categories by Device’ bar graph, the purple device has a smaller, but likely comparable, percentage of error compared to red, but a larger percentage of error when compared to red and a smaller percentage of medium error when compared to red, so red is likely the more accurate device in the temperature category. Thus even though it has higher error in pressure measurements, red is likely the more accurate sensor overall
7b: What was something interesting you noticed about the difference between device data and the NOAA temperature data? Where there any patterns you noticed in this descriptive statistical analysis regarding the potential accuracy of these devices compared to NOAA historical data? Mention what graph(s) or statistic(s) helped you notice this pattern. On the scatter plot ‘Sensor Temperature vs NOAA Temperature,’ red has more clear outliers than purple. This is interesting because in the ‘Temperature Error Categories by Device’ bar graph purple has a higher percentage of large errors. This means that, on the scatter plot, the red data is, overall, closer to the line of best fit than the purple even though it has at least four major outlines. This confirms that red is the more accurate device for temperature data.
To submit this assignment, publish on RPubs and post link url on Canvas