ES300: NOAA Data for PDX Airport

Intro

This lab will help you download and explore daily weather data for Portland International Airport from NOAA’s Local Climatological Data tool. After loading the data into R and we’ll clean it to remove unneeded variables and wrangling it to create needed ones–heat index and wind chill. Then we’ll visualize our different weather values so you can determine trends over time.

Finding the Data

To find the daily weather for PDX airport we will use the Local Climatological Data tool from NOAA. From here we can download a bulk dataset for a given location and time frame, then we will filter that dataset to just our variables of interest.

On the website, selection location type as “County”, then you’ll find Oregon and Multnomah county. You’ll see two options; select Portland International Airport and add to cart.

Go to your cart and leave daily output checked, but choose LCD CSV. Set the date range to be January 1, 2016 to as recent as is avaiable.

Press continue, enter your email address, and submit the order. After a minute or two you will receive an email with a link to download the data. Click this and it should automatically download a csv file.

Organize Your Files

Don’t leave this file in your downloads and don’t leave it with an uninformative name. Make a folder for this project and move the file into that folder. For the rest of these instructions, we will call the file “noaa_pdx.csv”, but you can name it whatever you want.

For more on file structures and why they are important see this page on file organization.

Load Data Into RStudio

Open RStudio either from a desktop version or by going to Reed’s R server. From the main menu choose File -> New Project. If you have a folder you’d like this project to be in choose Existing Directory, otherwise choose New Directory. Choose the plain R New Project, give it a name, and create the project.

Now, if you are using the server, you need to move the data from your computer to the cloud-based version of RStudio that Reed provides. To do this you will upload the file. In the lower right hand panel, go to the Files tab and click Upload.

Leave Target Directory alone and click Choose File. Find the file on your computer. Add it and say OK. The file is now in your R project folder.

Now we’ll begin the script and read in the file. Click the green plus on the top left and choose R File.

Load any libraries you will use. We will be using libraries that are included in {tidyverse}, so we’ll just load that.

library(tidyverse)

Then use read_csv() to load the data. If you are in the project file you will only need to include the name of the file in quotes. If the file is somewhere else, you will need to include the path to the file. To find this, see the file organization link above. Save the file as a variable object with a name of your choosing, here it’s pdx_raw_data.

pdx_raw_data <- read_csv("noaa_pdx.csv")

Wrangling the Data

Now we can look to see how the data is organized to help us know what variables we want and the ones we don’t. Run glimpse() on the data to see the column names. You can also click the name pdx_raw_data in the upper right pane to open the data as a viewable spreadsheet.

glimpse(pdx_raw_data)

You can see that there are many more columns than what we are interested in and also that some things we want aren’t there, but we can calculate these.

The columns we want to keep are DATE, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, and DailyAverageWindSpeed. We’ll use the select() function to keep only those. We can also change the name of the dataframe to reflect our change.

pdx_data <- pdx_raw_data |>
  select(DATE, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, DailyAverageWindSpeed)

The names for some of these variables are very long. If you want, you can change them to something shorter with the rename() function. The syntax for this function is rename(new_name = old_name). If you want to rename multiple columns, just add a comma after each change.

To Do

Add to the code to rename the columns temp, humidity, and wind_speed. Change DATE to be lower case for consistency.

pdx_data <- pdx_data |>
  ____________________

Answer

pdx_data <- pdx_data |>
  rename(date = DATE, 
         temp = DailyAverageDryBulbTemperature, 
         humidity = DailyAverageRelativeHumidity, 
         wind_speed = DailyAverageWindSpeed)

View your data to make sure the changes were successful.

Now you’ll notice that our data has a lot of NA values. If you look at the raw data you can see that this is because there are different sources for the data and only source #6 has values for daily variables.

We’ll use filter() to remove any rows with missing values in the temp column. We only need to specify that column because the missing data pattern is the same for the other columns. Within the filter we use the ! operator, which can be read as “not”, in conjuction with is.na() to remove values where is.na = TRUE.

pdx_data <- pdx_data |>
  filter(!is.na(temp))

You can see in the Enviroment pane (top right) that we now have many fewer rows.

Calculating Desired Variables.

We want heat index and wind chill which are not in our data directly, but we can calculate them and add them as new columns with the mutate() function.

First up is heat index. This formula is a doozy. See more about heat index here. Heat index is only calculated for temperatures over 80 degrees Farenheit, so we’re going to use the case_when() function to pick out those temperatures for the formula to apply to. If the temperature is under 80, heat index isn’t meaningful, so we’ll set those values to NA.

pdx_data <- pdx_data |>
  mutate(heat_index = case_when(
      temp >= 80 ~ -42.379 +
                    2.04901523  * temp +
                    10.14333127 * humidity +
                   -0.22475541  * temp * humidity +
                   -0.00683783  * temp^2 +
                   -0.05481717  * humidity^2 +
                    0.00122874  * temp^2 * humidity +
                    0.00085282  * temp * humidity^2 +
                   -0.00000199  * temp^2 * humidity^2,
      .default = NA
    ))

Bonkers formula, but now you can see there’s a new column of heat_index. You’ll notice it is NA for most rows, which is expected since Portland temperature is mostly under 80.

To Do

Now you do the same thing for wind chill. That formula is a bit simpler. It is only used when temperature is below 50 and wind speed is above 3 mph, so anything outside of that, we’ll set to NA since wind chill isn’t meaningful in those conditions. Try to fill in the gaps in this code. Information on the & symbol can be found on this page about logical operators

_____ <- _____ |>
  _____(wind_chill = _____(
      _____ & _____ ~ 35.74 +
                       0.6215 * temp +
                       -35.75 * wind_speed^0.16 +
                       0.4275 * temp * wind_speed^0.16,
      _____
    ))

Hint

Here’s a little bit more filled in. We can use <= to allow values equal to 50. Use this same format for wind speed. For the .default, use NA_real_ — the _real_ part tells R the NA should be a numeric type, which matches the rest of the column.

_____ <- _____ |>
  _____(wind_chill = _____(
      temp <= 50 & _____ ~ 35.74 +
                            0.6215 * temp +
                            -35.75 * wind_speed^0.16 +
                            0.4275 * temp * wind_speed^0.16,
      _____
    ))

Answer

pdx_data <- pdx_data |>
  mutate(wind_chill = case_when(
      temp <= 50 & wind_speed >= 3 ~ 35.74 +
                               0.6215 * temp +
                              -35.75 * wind_speed^0.16 +
                               0.4275 * temp * wind_speed^0.16,
      .default = NA
    ))

You should now have a dataset with 6 columns and at least 3493 rows (depending on what end date was available for download).

Visualizing the Data

We can now now graph the weather values over time. We’ll use the package {ggplot2} for this, which was loaded when we loaded {tidyverse}.

This package has a simple, common syntax for any type of graph:

ggplot(data = _______, aes(x = _______, y = _______) +
  geom_TYPEOFPLOT()

Let’s parse that out:

ggplot() is the base command that makes graphs, it comes from the {ggplot2} package
data = the data set we want to use
aes() stands for aesthetics, it’s where you tell R what you want to be on the graph
x = is where you name your x variable
y = is where you name your y variable
the line ends with a + showing you that the code continues on the next line
geom_TYPEOFPLOT() is where you specify what kind of graph you want to make, the most popular options are:
- geom_point()
- geom_line()
- geom_col() or geom_bar()
- geom_histogram()
- geom_boxplot()

There are many other add-ons, but just those two lines of code will get you started for most types of graphs.

The most appropriate type of graph for a time series is a line graph. Before we make our graph, we need to make sure our data is in the right format. Use glimpse() to show the data type for each column.

glimpse(pdx_data)

We can see that all our variables are <dbl> which means they are numeric data, which is good. Our date column is <ddtm> which means it is in a special date format. This is great because R knows how to treat dates as distinct from plain numbers or characters.

With our data cleaned and wrangled we should be able to make our line graph. We’ll start with temperature.

To Do

Alter the code below to have date be the x variable and temp be the y variable for our line graph.

ggplot(data = _______, aes(x = _______, y = _______)) +
  geom_TYPEOFPLOT()

Answer

ggplot(data = pdx_data, aes(x = date, y = temp)) +
  geom_line()

Making Prettier Graphs

The graph shows the data, but we can make a number of improvements to make our graph easier to understand and more professional.

Start by using the labs() command to specify the axis labels. To do this, add a + and another line of code that specifies the labels for each axis.

To Do

Add labels to your axes by putting the name inside the quotation marks.

ggplot(data = pdx_data, aes(x = date, y = temp)) +
  geom_line() + 
  labs(x = "_____",
       y = "_____")

You can also change the overall look of the graph. If you don’t like the gray background, look at the Themes section of this workshop to see how you can change it. If you don’t like the tick placement or labeling look at the Axis Ticks section of this workshop to change them.

You can also add color to your graph by adding a color option to geom_line.

ggplot(data = pdx_data, aes(x = date, y = temp)) +
  geom_line(color = "aquamarine4")

R knows a lot of colors by name. Here’s a document that lists them. You can also use hex values or specific color packages like this Wes Anderson palette (scroll down to see it).

Figure it Out

Here’s a hard one for you to explore.

Let’s plot the heat index and wind chill on the same plot as the temperature. To do this, we’ll need to rearrange our data to make it easier to graph. We’ll use the pivot_longer() function to rearrange our table so that we have these columns: date, humidity, wind_speed, temp_type, and temp_value. In the temp_type column there will be three value options: temp, heat_index, and wind_chill. Their corresponding values will be in the temp_value column.

Use the pivot_longer() section of this tutorial to help you rearrange your table.

To Do

Alter this code to make the new table.

pdx_longform <- pdx_data |>
  pivot_longer(
    cols = _____, 
    names_to = "_____", 
    values_to = "_____")

Hint

You will use c(temp, heat_index, wind_chill) as the cols value.

pdx_longform <- pdx_data |>
  pivot_longer(
    cols = c(temp, heat_index, wind_chill), 
    names_to = "_____", 
    values_to = "_____")

Answer

You will use c(temp, heat_index, wind_chill) as the cols value.

pdx_longform <- pdx_data |>
  pivot_longer(
    cols = c(temp, heat_index, wind_chill), 
    names_to = "temp_type", 
    values_to = "temp_value")

Now we can graph the data and color the different lines by the temp_type. To do this, just add another option to the aes statement that tells it what variable to color by. Also, note that we are now using pdx_longform instead of pdx_data.

To Do

Add to the code.

ggplot(data = pdx_longform, aes(x = date, y = ____, color = _____)) +
  geom_line()

Answer

Add to the code.

ggplot(data = pdx_longform, aes(x = date, y = temp_value, color = temp_type)) +
  geom_line()

That graph technically is accurate, but it’s very hard to learn anything from it. Now you’ll use Google and the resources below to figure out how to make your graph more informative. Maybe you want to change the line type; maybe you want to filter the data to just show one year; maybe you want to filter the data to just show windchill and/or heat index. Up to you.

To Do

Add code to make the graph more legible and informative.

Resources:

Creating Your Final Dataset

Before you save your data ensure that everything looks the way you want it to using glimpse() or viewing it as a spreadsheet to check.

Right now, our date column includes a time with the date and we want to remove that for consistency in combining with the other datasets. Run this line of code to change the times from

pdx_data$date <- as.Date(pdx_data$date)

When you have your data how you want it, we will use write_csv() to save our output to a csv file. The syntax is to put the name of the data object, then the name of the file you want to save it as. By default it will save in your project folder. If you want to save it elsewhere, you will need to change the file path. For more on file pahts and the importance of file structure see this page on file organization.

Since it will be cleaner to combine the data when each day is a single row, we’ll use the pdx_data rather than the pdx_longform data.

Give your output csv an informative name so it can be combined with the other datasets.

write_csv(pdx_data, "pdx_daily_noaa.csv")