Project 1

Updated by NDG 9/18/24

Description

You have been hired as a consultant by Disney to create a location for a new amusement park. Your job is to analyze weather data from different locations to pick out the best option.

Part 1: Normal data cleanup (5 points)

Begin by looking at the climate normal data. Normal data is the predicted weather for a specific date and location. It is not tied to an individual year.

Load your climate normal datafiles. You will need to do some clean-up. Be sure to look at the data carefully. Below are a list of suggested dplyr activities.

Suggested tasks:

Load both csv normal files as raw data
Cleanup the data
Use janitor to clean field names
Create a new string with paste or paste0 that has m/d/y (use 2023 for year)
Use mdy from lubridate to create an actual date column
Join the two tables together to get the titles for each station.
Remove any fields you don’t want to use.

## # A tibble: 1,460 × 4
##    STATION     NAME              tmax  d         
##    <chr>       <chr>             <chr> <date>    
##  1 USW00004725 BINGHAMTON, NY US 59.3  2023-10-10
##  2 USW00004725 BINGHAMTON, NY US 58.9  2023-10-11
##  3 USW00004725 BINGHAMTON, NY US 58.5  2023-10-12
##  4 USW00004725 BINGHAMTON, NY US 58    2023-10-13
##  5 USW00004725 BINGHAMTON, NY US 57.6  2023-10-14
##  6 USW00004725 BINGHAMTON, NY US 57.1  2023-10-15
##  7 USW00004725 BINGHAMTON, NY US 56.4  2023-10-16
##  8 USW00004725 BINGHAMTON, NY US 55.6  2023-10-17
##  9 USW00004725 BINGHAMTON, NY US 54.8  2023-10-18
## 10 USW00004725 BINGHAMTON, NY US 54.1  2023-10-19
## # ℹ 1,450 more rows

Part 2: Summary table (5 points)

Now that you have the data, create some basic summary data. Show a table with the average temperature by month and location. Have the stations as rows, and the month as columns.

Hint: you may need to use dplyr pivot. First group, then summarise, then pivot. You should end up with a table showing the name for each row, and then each month as a column. You may want to use dplyr and lubridate to create a month column.

## # A tibble: 4 × 13
## # Groups:   STATION [4]
##   STATION     January February March April   May  June  July August September
##   <chr>         <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
## 1 USW00004725    27.7     29.0  39    52.4  64.2  72.0  77.0   74.4      68.0
## 2 USW00013904    61.2     64.9  72.8  78.8  85.4  92.0  94.3   95.9      89.3
## 3 USW00014762    34.0     35.7  47.5  60.0  69.8  77.5  80.8   79.4      73.8
## 4 USW00025309    31.9     33.4  37.3  46.4  55.9  60.2  62.0   60.5      54.9
## # ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>

Part 3: Best location for amusement park (5 points)

We want to find the best location for an amusement park that isn’t too hot, or too cold.

Define an appropriate temperature range where it is comfortable to be outside. Then, create a graph showing how different locations meet your temperature requirement.

Hint: use mutate to create a new field using ifelse (and some temperature range). Set this value to either 1 (for good) or 0 (for bad). Then look at how much of your dataset falls into this ‘good’ range for each station.

Write a brief 2-3 sentence explanation of your findings.

I googled what the comfortable temperature range was for humans and it said 68 to 76 degrees, so I made a new column to tell me whether or not it was considered a comfortable temp. For the graph, I decided to use all the tmax data for each station and to make a rectangle to highlight what was considered comfortable temp. This showed me that the best two locations for an amusement park would be Austin, TX and Pittsburgh, PA.

Part 4: Prediction (5 points)

Now, you need to figure out how much the average daily weather for your best site varies from the climate normals for 2023.

Load up the GHCN_daily dataset. You’ll want to filter it down to your chosen site, and then turn the date_as_text column into a proper date. Then, join it to your climate normals (again, filtered to your chosen site) using the date.

Note that tmax is stored as Celsius. You’ll need to convert it.

Create two predictions.

First, compare the actual tmax versus predicted tmax. What is the error? Graph your results and give a 2-3 sentence explanation.

##After looking at the graph it is clear that the predicted is almost always lower than the actual tmax, and the predicted values do stay with the trend lines of the actual values as well.

Second, compare the number of days that are predicted to be nice, versus the actual number of days that were nice. Use the same definition as the prior question. ##There were only off by 9 days in their predicted vs actual for a whole year.

## # A tibble: 1,460 × 5
##    nice_day_predicted nice_day_actual sum_nice_d_predicted sum_nice_d_actual
##                 <dbl>           <dbl>                <dbl>             <dbl>
##  1                  0               0                  181               172
##  2                  0               0                  181               172
##  3                  0               0                  181               172
##  4                  0               0                  181               172
##  5                  0               0                  181               172
##  6                  0               0                  181               172
##  7                  0               0                  181               172
##  8                  0               0                  181               172
##  9                  0               0                  181               172
## 10                  0               0                  181               172
## # ℹ 1,450 more rows
## # ℹ 1 more variable: nice_day_difference <dbl>

What is the accuracy, precision, and recall of the climate normal data? Give a 2-3 sentence explanation of your results. ##Overall, I would say they were pretty accurate. The weather is constantly changing so they’re not going to be 100% accurate. They were only off 9 days for predicting nice days for the year 2023. They also were consistent in the trend line with the actual tmax with the graph above.