Updated by CSH 10/1/24

Description

You have been hired as a consultant by Disney to create a location for a new amusement park. Your job is to analyze weather data from different locations to pick out the best option.

Part 1: Normal data cleanup (5 points)

Begin by looking at the climate normal data. Normal data is the predicted weather for a specific date and location. It is not tied to an individual year.

Load your climate normal datafiles. You will need to do some clean-up. Be sure to look at the data carefully. Below are a list of suggested dplyr activities.

Suggested tasks:

Part 2: Summary table (5 points)

Now that you have the data, create some basic summary data. Show a table with the average temperature by month and location. Have the stations as rows, and the month as columns.

Hint: you may need to use dplyr pivot. First group, then summarise, then pivot. You should end up with a table showing the name for each row, and then each month as a column. You may want to use dplyr and lubridate to create a month column.

## # A tibble: 4 × 13
## # Groups:   station [4]
##   station       `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`
##   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 USW00004725  27.7  29.0  39    52.4  64.2  72.0  77.0  74.4  68.0  56.0  43.2
## 2 USW00013904  61.2  64.9  72.8  78.8  85.4  92.0  94.3  95.9  89.3  81.5  70.8
## 3 USW00014762  34.0  35.7  47.5  60.0  69.8  77.5  80.8  79.4  73.8  61.5  49.0
## 4 USW00025309  31.9  33.4  37.3  46.4  55.9  60.2  62.0  60.5  54.9  45.7  35.5
## # ℹ 1 more variable: `12` <dbl>

Part 3: Best location for amusement park (5 points)

We want to find the best location for an amusement park that isn’t too hot, or too cold.

Define an appropriate temperature range where it is comfortable to be outside. Then, create a graph showing how different locations meet your temperature requirement.

Hint: use mutate to create a new field using ifelse (and some temperature range). Set this value to either 1 (for good) or 0 (for bad). Then look at how much of your dataset falls into this ‘good’ range for each station.

## # A tibble: 4 × 6
## # Groups:   station [4]
##   station     name              count_good_days count_bad_days percent_good_days
##   <chr>       <chr>                       <dbl>          <dbl>             <dbl>
## 1 USW00014762 PITTSBURGH ALLEG…             166            199              45.5
## 2 USW00013904 AUSTIN BERGSTROM…             148            217              40.5
## 3 USW00004725 BINGHAMTON, NY US             133            232              36.4
## 4 USW00025309 JUNEAU INTL AP, …               0            365               0  
## # ℹ 1 more variable: percent_bad_days <dbl>

Write a brief 2-3 sentence explanation of your findings.

Pittsburgh is the best location for an amusement park, based on a temperature range of 64 degrees to 85 degrees. It has the highest percentage of days through the year which fall into this temperature range.

Part 4: Prediction (5 points)

Now, you need to figure out how much the average daily weather for your best site varies from the climate normals for 2023.

Load up the GHCN_daily dataset. You’ll want to filter it down to your chosen site, and then turn the date_as_text column into a proper date. Then, join it to your climate normals (again, filtered to your chosen site) using the date.

Note that tmax is stored as Celsius. You’ll need to convert it.

Create two predictions.

First, compare the actual tmax versus predicted tmax. What is the error? Graph your results and give a 2-3 sentence explanation.

Second, compare the number of days that are predicted to be nice, versus the actual number of days that were nice. Use the same definition as the prior question.

What is the accuracy, precision, and recall of the climate normal data? Give a 2-3 sentence explanation of your results.

##               
##                Bad Nice
##   Predict Bad  165   34
##   Predict Nice  32  132
## [1] "Accuracy =  0.818181818181818"
## [1] "Precision =  0.804878048780488"
## [1] "Recall =  0.795180722891566"

Conclusion

Based on the analysis, the Climate Normals Dataset has about an 80% Accuracy, Precision, and Recall. Tmax error is calculated by subtracting the actual tmax from the predicted tmax. This data shows that for the Pittsburgh location in 2023, Climate Normal Data would consistently underestimate temperature.