Updated by JLF 9/30/24
You have been hired as a consultant by Disney to create a location for a new amusement park. Your job is to analyze weather data from different locations to pick out the best option.
Begin by looking at the climate normal data. Normal data is the predicted weather for a specific date and location. It is not tied to an individual year.
Load your climate normal datafiles. You will need to do some clean-up. Be sure to look at the data carefully. Below are a list of suggested dplyr activities.
Suggested tasks:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
url_ghcn_daily <- 'https://raw.githubusercontent.com/papafre11/ACCT426RProject/refs/heads/main/GHCN_Daily_v2.csv'
url_cnd <- 'https://raw.githubusercontent.com/papafre11/ACCT426RProject/refs/heads/main/ClimateNormalData_v2.csv'
url_cnd_stations <- 'https://raw.githubusercontent.com/papafre11/ACCT426RProject/refs/heads/main/ClimateNormalData_stations_v2.csv'
t_raw_ghcn_daily <- read_csv(file = url_ghcn_daily)
## Rows: 88644 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): station, date_as_text
## dbl (1): tmax_actual_in_c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
t_raw_cnd <- read_csv(file = url_cnd)
## New names:
## Rows: 1463 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): US Climate Normal Data, ...2, ...3, ...4
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
t_raw_cnd_stations <- read_csv(file = url_cnd_stations)
## New names:
## Rows: 6 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): Stations, ...2, ...3, ...4, ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
colnames(t_raw_cnd) <- t_raw_cnd[3,]
colnames(t_raw_cnd_stations) <- t_raw_cnd_stations[2,]
t_slice_raw_cnd <- t_raw_cnd %>%
slice(-c(1,2,3))
t_slice_raw_cnd_stations <- t_raw_cnd_stations %>%
slice(-c(1,2))
t_ghcn_daily <- t_raw_ghcn_daily %>%
janitor::clean_names()
t_cnd <- t_slice_raw_cnd %>%
janitor::clean_names()
t_cnd_stations <- t_slice_raw_cnd_stations %>%
janitor::clean_names()
t_cnd_string_date <- t_cnd %>%
mutate(date_string = paste(month, day, 2023, sep = "/"))
t_cnd_date <- t_cnd_string_date %>%
mutate(date = mdy(date_string))
t_cnd_all <- left_join(t_cnd_date, t_cnd_stations, by = 'station')
t_cnd_necessary <- t_cnd_all %>%
select(-month, -day, -date_string)
t_cnd_ordered <- t_cnd_necessary %>%
select(station, name, date, tmax, latitude, longitude, elevation)
Now that you have the data, create some basic summary data. Show a table with the average temperature by month and location. Have the stations as rows, and the month as columns.
Hint: you may need to use dplyr pivot. First group, then summarise, then pivot. You should end up with a table showing the name for each row, and then each month as a column. You may want to use dplyr and lubridate to create a month column.
t_cnd_needed <- t_cnd_ordered %>%
select(station, date, tmax)
t_cnd_needed_month <- t_cnd_needed %>%
mutate(month = month(date, label = TRUE))
t_cnd_monthly_avg_tmax <- t_cnd_needed_month %>%
mutate(tmax = as.numeric(tmax)) %>%
group_by(station, month) %>%
summarize(avg_tmax = mean(tmax), .groups = "drop")
t_cnd_stations_monthly_avg_tmax <- t_cnd_monthly_avg_tmax %>%
pivot_wider(names_from = month, values_from = avg_tmax)
We want to find the best location for an amusement park that isn’t too hot, or too cold.
Define an appropriate temperature range where it is comfortable to be outside. Then, create a graph showing how different locations meet your temperature requirement.
Hint: use mutate to create a new field using ifelse (and some temperature range). Set this value to either 1 (for good) or 0 (for bad). Then look at how much of your dataset falls into this ‘good’ range for each station.
Write a brief 2-3 sentence explanation of your findings.
too_hot <- 85
too_cold <- 65
comfort_range <- c(too_cold, too_hot)
t_cnd_comfort <- t_cnd_monthly_avg_tmax %>%
mutate(comfort = ifelse(avg_tmax <= too_hot & avg_tmax >= too_cold, 1, 0))
t_station_comfort <- t_cnd_comfort %>%
group_by(station) %>%
summarize(comfortable = sum(comfort == 1),
uncomfortable = sum(comfort == 0))
#Based on my range of comfortable temperatures of 65 - 85 degrees, station USW00014762, which is Pittsburgh, is the best station to house an amusement park. Station USW00014762 has 5 out of 12 months in the year where the average max temperature was within the temperature range.This rate is better than the other stations, which had either 4 or 0 comfortable months.
Now, you need to figure out how much the average daily weather for your best site varies from the climate normals for 2023.
Load up the GHCN_daily dataset. You’ll want to filter it down to your chosen site, and then turn the date_as_text column into a proper date. Then, join it to your climate normals (again, filtered to your chosen site) using the date.
Note that tmax is stored as Celsius. You’ll need to convert it.
Create two predictions.
First, compare the actual tmax versus predicted tmax. What is the error? Graph your results and give a 2-3 sentence explanation.
Second, compare the number of days that are predicted to be nice, versus the actual number of days that were nice. Use the same definition as the prior question.
What is the accuracy, precision, and recall of the climate normal data? Give a 2-3 sentence explanation of your results.
url_ghcn_daily <- 'https://raw.githubusercontent.com/papafre11/ACCT426RProject/refs/heads/main/GHCN_Daily_v2.csv'
t_raw_ghcn_daily <- read_csv(file = url_ghcn_daily)
## Rows: 88644 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): station, date_as_text
## dbl (1): tmax_actual_in_c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
t_ghcn_daily <- t_raw_ghcn_daily %>%
janitor::clean_names()
t_ghcn_daily_2023 <- t_ghcn_daily %>%
filter(grepl("2023$", date_as_text) & station == "USW00014762")
t_ghcn_daily_fixed <- t_ghcn_daily %>%
summarize(tmax_in_f = (1.8*tmax_actual_in_c + 32)) %>%
mutate(date_string = t_ghcn_daily$`date_as_text`) %>%
mutate(station = t_ghcn_daily$`station`)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
t_ghcn_daily_fixed_date <- t_ghcn_daily_fixed %>%
mutate(date = mdy(date_string))
t_ghcn_correct <- t_ghcn_daily_fixed_date %>%
select(station, date, tmax_in_f) %>%
filter(year(date) == 2023 & station == "USW00014762")
t_cnd_correct <- t_cnd_needed %>%
filter(station == "USW00014762")
t_cnd_ghcn_join <- t_cnd_correct %>%
left_join(t_ghcn_correct, by = "date") %>%
arrange(date)
t_cnd_ghcn_join_needed <- t_cnd_ghcn_join %>%
select(-station.y) %>%
rename(station = station.x,
actual = tmax_in_f,
predicted = tmax)
t_cnd_ghcn_join_needed_2 <- t_cnd_ghcn_join_needed %>%
mutate(predicted_numeric = as.numeric(predicted)) %>%
filter(!is.na(actual))
t_cnd_ghcn_comparison <- t_cnd_ghcn_join_needed_2 %>%
mutate(actual_minus_predicted = actual - predicted_numeric) %>%
mutate(abs_actual_minus_predicted = abs(actual - predicted_numeric))
t_cnd_ghcn_comparison_needed <- t_cnd_ghcn_comparison %>%
select(-predicted)
avg_actual_minus_predicted <- mean(t_cnd_ghcn_comparison_needed$actual_minus_predicted, na.rm = TRUE)
avg_abs_actual_minus_predicted <- mean(t_cnd_ghcn_comparison_needed$abs_actual_minus_predicted, na.rm = TRUE)
colnames(t_cnd_ghcn_comparison_needed)
## [1] "station" "date"
## [3] "actual" "predicted_numeric"
## [5] "actual_minus_predicted" "abs_actual_minus_predicted"
ggplot(t_cnd_ghcn_comparison_needed, aes(x = predicted_numeric, y = actual_minus_predicted)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(title = "Residual Plot",
x = "Predicted Values",
y = "Residuals (Actual - Predicted)") +
theme_minimal()
# The error on average is 5 degrees, and the absolute error on average is about 8 degrees. This means that the actual temperature is usually 5 degrees above the predicted temperature and that there is an 8 degree error on average, whether in the positive or negative direction, from the predicted temperature.
too_hot <- 85
too_cold <- 65
comfort_range <- c(too_cold, too_hot)
t_cnd_ghcn_comfort <- t_cnd_ghcn_comparison_needed %>%
mutate(comfort_actual = ifelse(actual <= too_hot & actual >= too_cold, 1, 0)) %>%
mutate(comfort_predicted_numeric = ifelse(predicted_numeric <= too_hot & predicted_numeric >= too_cold, 1, 0))
t_cnd_ghcn_comfort_correct_wrong <- t_cnd_ghcn_comfort %>%
mutate(
predicted_1_equals_1 = ifelse(comfort_predicted_numeric == 1 & comfort_actual == 1, 1, 0),
predicted_1_equals_0 = ifelse(comfort_predicted_numeric == 1 & comfort_actual == 0, 1, 0),
predicted_0_equals_1 = ifelse(comfort_predicted_numeric == 0 & comfort_actual == 1, 1, 0),
predicted_0_equals_0 = ifelse(comfort_predicted_numeric == 0 & comfort_actual == 0, 1, 0)
)
summary_table <- t_cnd_ghcn_comfort_correct_wrong %>%
summarise(
true_positives = sum(predicted_1_equals_1),
false_positives = sum(predicted_1_equals_0),
false_negatives = sum(predicted_0_equals_1),
true_negatives = sum(predicted_0_equals_0))
total <- summary_table$true_positives + summary_table$false_positives + summary_table$true_negatives + summary_table$false_negatives
accuracy <- (summary_table$true_positives + summary_table$true_negatives)/ total
precision <- summary_table$true_positives/(summary_table$true_positives + summary_table$false_positives)
recall <- summary_table$true_positives/(summary_table$true_positives + summary_table$false_negatives)
# The accuracy of the prediction is 82.64%, which means that the model accurately predicts whether a day is within or outside of the comfortable range roughly 82.64% of the time. The precision of the prediction is 80.74%, which means that 80.74% of the time that the model predicts that the day is within the comfortable range, the temperature is actually within the comfortable range. The recall of the prediction is 80.25%, which means that whenever the actual temperature is within the comfortable range, the prediction model predicts it 80.25% of the time. Just as a heads up, there was missing data on July 29th and July 30th of 2023, so that data was left out.