Project 1

Updated by NDG 9/18/24

Description

You have been hired as a consultant by Disney to create a location for a new amusement park. Your job is to analyze weather data from different locations to pick out the best option.

Part 1: Normal data cleanup (5 points)

Begin by looking at the climate normal data. Normal data is the predicted weather for a specific date and location. It is not tied to an individual year.

Load your climate normal datafiles. You will need to do some clean-up. Be sure to look at the data carefully. Below are a list of suggested dplyr activities.

Suggested tasks:

Load both csv normal files as raw data
Cleanup the data
Use janitor to clean field names
Create a new string with paste or paste0 that has m/d/y (use 2023 for year)
Use mdy from lubridate to create an actual date column
Join the two tables together to get the titles for each station.
Remove any fields you don’t want to use.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)

t_raw <- read_csv('ClimateNormalData_v2.csv', skip = 3)

## Rows: 1460 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): STATION
## dbl (3): month, day, tmax
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

t_raw_2 <- read_csv('ClimateNormalData_stations_v2.csv', skip = 2)

## Rows: 4 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): STATION, NAME
## dbl (3): LATITUDE, LONGITUDE, ELEVATION
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

t_normal <- t_raw %>% 
  janitor::clean_names() %>% 
  mutate(date_as_text = paste0(month, '/',day, '/', '2023'), 
         date = mdy(date_as_text)) %>%
  inner_join(t_raw_2, by = c("station" = "STATION")) %>% 
  group_by(NAME, date) %>% 
  summarise(tmax, date, NAME)

## `summarise()` has grouped output by 'NAME'. You can override using the
## `.groups` argument.

Part 2: Summary table (5 points)

Now that you have the data, create some basic summary data. Show a table with the average temperature by month and location. Have the stations as rows, and the month as columns.

Hint: you may need to use dplyr pivot. First group, then summarise, then pivot. You should end up with a table showing the name for each row, and then each month as a column. You may want to use dplyr and lubridate to create a month column.

t_summary <- t_normal %>% 
  mutate(m = month(date)) %>% 
  group_by(NAME, m) %>% 
  summarise(avg_temp = mean(tmax)) %>% 
  pivot_wider(names_from = m, values_from = avg_temp)

## `summarise()` has grouped output by 'NAME'. You can override using the
## `.groups` argument.

Part 3: Best location for amusement park (5 points)

We want to find the best location for an amusement park that isn’t too hot, or too cold.

Define an appropriate temperature range where it is comfortable to be outside. Then, create a graph showing how different locations meet your temperature requirement.

Hint: use mutate to create a new field using ifelse (and some temperature range). Set this value to either 1 (for good) or 0 (for bad). Then look at how much of your dataset falls into this ‘good’ range for each station.

Write a brief 2-3 sentence explanation of your findings.

t_best <-  t_normal %>% 
  mutate(m = month(date)) %>% 
  group_by(NAME, m) %>% 
  summarise(avg_temp = mean(tmax)) %>%  
  mutate(range_temp = ifelse(avg_temp > 55 & avg_temp < 85, 1, 0))

## `summarise()` has grouped output by 'NAME'. You can override using the
## `.groups` argument.

library(ggplot2)

ggplot(t_best) +
 aes(x = m, y = avg_temp, colour = range_temp) +
 geom_line(linewidth = 1.15) +
 scale_color_gradient(low = "#F80404", 
 high = "#299D0F") +
 theme_minimal() +
 theme(axis.text.y = element_text(face = "bold", size = 12L), 
 axis.text.x = element_text(face = "bold", size = 12L)) +
 facet_wrap(vars(NAME)) +
  scale_x_continuous(breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)) +
  scale_y_continuous(breaks = c(40, 55, 70, 85))

## Findings: The chosen range is from 55 degrees to 85 degrees because I felt that in these temperatures, people are comfortable walking around outside wearing the appropriate attire. The parks that Disney currently has often see very high temperatures, so I felt that 85 was reasonable, and since we are considering the highest temperature of the day, 55 is a reasonable low temperature because the rest of the day would be colder than that temperature. With this range, Austin, TX has the most predicted months' average temperature in the comfortable range, with Alaska being the least, as shown by the green(good) and red(bad) in the graphs.

Part 4: Prediction (5 points)

Now, you need to figure out how much the average daily weather for your best site varies from the climate normals for 2023.

Load up the GHCN_daily dataset. You’ll want to filter it down to your chosen site, and then turn the date_as_text column into a proper date. Then, join it to your climate normals (again, filtered to your chosen site) using the date.

Note that tmax is stored as Celsius. You’ll need to convert it.

Create two predictions.

First, compare the actual tmax versus predicted tmax. What is the error? Graph your results and give a 2-3 sentence explanation.

Second, compare the number of days that are predicted to be nice, versus the actual number of days that were nice. Use the same definition as the prior question.

What is the accuracy, precision, and recall of the climate normal data? Give a 2-3 sentence explanation of your results.

t_raw_3 <- read_csv('GHCN_Daily_v2.csv')

## Rows: 88644 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): station, date_as_text
## dbl (1): tmax_actual_in_c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

t_daily <- t_raw_3 %>%
  mutate(date = mdy(date_as_text),
         tmax_actual = tmax_actual_in_c * (9/5) + 32) %>% 
  inner_join(t_normal, by = 'date') %>% 
  filter(NAME == 'AUSTIN BERGSTROM AP, TX US', station == 'USW00013904') %>% 
  group_by(date) %>% 
  summarise(date, NAME, station, tmax, tmax_actual)

## Warning in inner_join(., t_normal, by = "date"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 26155 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

t_predict_1 <- t_daily %>% 
  mutate(difference = tmax_actual - tmax, 
         abs_value_of_difference = abs(difference))
avg_temp_error <- mean(t_predict_1$abs_value_of_difference)


library(ggplot2)

ggplot(t_predict_1) +
 aes(x = difference) +
 geom_histogram(bins = 30L, fill = "#4682B4") +
 labs(title = "Distribution of Error") +
 theme_minimal() +
 theme(plot.title = element_text(size = 20L, face = "bold", hjust = 0.5))

## The error on average is 7.70 degrees Fahrenheit. This means that, on average, the prediction varies from the actual temperature by 7-8 degrees. There is a wide distribution of this error, with the normal data prediction being 30.62 degrees to high on one end, and 23.24 degrees too low on the other end of the distribution. 

t_predict_2 <- t_daily %>% 
   mutate(nice_day_predict = ifelse(tmax > 55 & tmax < 85, 1, 0), 
          nice_day_actual = ifelse(tmax_actual > 55 & tmax_actual < 85, 1, 0))

table(t_predict_2$nice_day_predict, t_predict_2$nice_day_actual)

##    
##       0   1
##   0 135  10
##   1  33 187

accuracy <- (187 + 135) / (187 + 33 + 135 + 10)
precision <- 187 / (187 + 33)
recall <- 187 / (187 + 10)

## The accuracy shows that the normal data predicted whether it would be a nice day or not correctly roughly 88% of the time. The precision shows that out of all of the days predicted to be nice by the normal data, 85% were correct and were actually nice days. The recall shows that out of all of the nice days that there actually were, the normal data predicted 95% of them.

Project 1

Kylie Carr

Project 1

Description

Part 1: Normal data cleanup (5 points)

Part 2: Summary table (5 points)

Part 3: Best location for amusement park (5 points)

Part 4: Prediction (5 points)