Introduction

This guide will show you how to obtain and summarize weather data using R. We will be using the GSODR library to get data from the Global Surface Summary of the Day (GSOD) dataset, which contains daily weather observations from over 9000 weather stations worldwide. See here for more detailed information: https://docs.ropensci.org/GSODR/.

We will first load the necessary libraries. In this case, we need GSODR to access the GSOD data and tidyverse to manipulate and visualize the data.

library(GSODR)
library(tidyverse)

Getting the weather data

We first need to define the latitude and longitude of the location of interest. For this example, we will use latLong <- c(51.736844, -5.285334), which corresponds to Skomer Island, Wales. You can get the latitude and longitude information for your site using Google Maps.

latLong <- c(51.736844, -5.285334)

We will also define the radius (in km) of the circle where we will search for weather stations. For this example, we will use radius <- 20, which means we will search for weather stations within a 20 km radius of Skomer Island.

radius <- 20

To get a list of weather stations that are within the search area, we will use the nearest_stations() function from the GSODR library. This function takes the latitude, longitude, and radius as arguments and returns a list of weather stations that are within the search area. We will store the result in the closeStation variable.

If there are more than one you should try varying radius to find the closest one that has data. You could also (with extra data manipulation) collect data from all of the nearby weather stations and take an average of them - that is beyond this tutorial though!

(closeStation <- GSODR::nearest_stations(latLong[1], latLong[2], radius))
## [1] "036010-99999" "036040-99999" "996860-99999" "036030-99999"

We will use the get_GSOD() function from the GSODR library to obtain daily weather observations for the years 2000 to 2023. This function requires station information and returns a data frame with weather data. We will use the information for the second weather station in the closeStation list to retrieve the data.

Note that the get_GSOD() function may take some time to run, so it’s a good idea to save the output as a CSV file for future use. Additionally, all units are converted to SI units when using the get_GSOD() function. For example, inches to millimeters, Fahrenheit to Celsius and so on.

It’s possible that some stations may have no data available, in which case the get_GSOD() function will return an empty data frame. If this happens, try again with a different location.

The following code checks whether a file named “weather_data.csv” exists in the current directory. If the file does not exist, the get_GSOD() function is called to retrieve the weather data, and the data is saved as a CSV file named “weather_data.csv”. On the other hand, if the file already exists, the code reads the data from the CSV file using the read.csv() function and stores it in the weather_data variable.

if (!file.exists("weather_data.csv")) {
  weather_data <- get_GSOD(years = 2000:2023, station = closeStation[2])
  write.csv(x = weatherData, file = "weather_data.csv", row.names = FALSE)
} else {
  weather_data <- read.csv("weather_data.csv")
}

Now we can get a summary of this data.

summary(weather_data)
##     STNID               NAME               CTRY           COUNTRY_NAME      
##  Length:8448        Length:8448        Length:8448        Length:8448       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     ISO2C              ISO3C            STATE            LATITUDE   
##  Length:8448        Length:8448        Mode:logical   Min.   :51.7  
##  Class :character   Class :character   NA's:8448      1st Qu.:51.7  
##  Mode  :character   Mode  :character                  Median :51.7  
##                                                       Mean   :51.7  
##                                                       3rd Qu.:51.7  
##                                                       Max.   :51.7  
##                                                                     
##    LONGITUDE       ELEVATION      BEGIN               END          
##  Min.   :-5.05   Min.   :32   Min.   :19730101   Min.   :20221231  
##  1st Qu.:-5.05   1st Qu.:32   1st Qu.:19730101   1st Qu.:20221231  
##  Median :-5.05   Median :32   Median :19730101   Median :20221231  
##  Mean   :-5.05   Mean   :32   Mean   :19730101   Mean   :20221231  
##  3rd Qu.:-5.05   3rd Qu.:32   3rd Qu.:19730101   3rd Qu.:20221231  
##  Max.   :-5.05   Max.   :32   Max.   :19730101   Max.   :20221231  
##                                                                    
##    YEARMODA              YEAR          MONTH             DAY       
##  Length:8448        Min.   :2000   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:2005   1st Qu.: 3.000   1st Qu.: 8.00  
##  Mode  :character   Median :2011   Median : 7.000   Median :16.00  
##                     Mean   :2011   Mean   : 6.488   Mean   :15.73  
##                     3rd Qu.:2017   3rd Qu.:10.000   3rd Qu.:23.00  
##                     Max.   :2023   Max.   :12.000   Max.   :31.00  
##                                                                    
##       YDAY            TEMP      TEMP_ATTRIBUTES      DEWP       
##  Min.   :  1.0   Min.   :-3.9   Min.   : 4.00   Min.   :-8.600  
##  1st Qu.: 90.0   1st Qu.: 8.1   1st Qu.:24.00   1st Qu.: 5.200  
##  Median :182.0   Median :10.9   Median :24.00   Median : 8.500  
##  Mean   :182.1   Mean   :11.0   Mean   :23.71   Mean   : 8.228  
##  3rd Qu.:274.0   3rd Qu.:14.4   3rd Qu.:24.00   3rd Qu.:11.700  
##  Max.   :366.0   Max.   :24.6   Max.   :24.00   Max.   :18.600  
##                                                 NA's   :6       
##  DEWP_ATTRIBUTES      SLP         SLP_ATTRIBUTES       STP       
##  Min.   : 0.00   Min.   : 966.7   Min.   : 0.00   Min.   :  0.0  
##  1st Qu.:24.00   1st Qu.:1008.0   1st Qu.:24.00   1st Qu.:  9.0  
##  Median :24.00   Median :1015.8   Median :24.00   Median : 15.7  
##  Mean   :23.66   Mean   :1014.9   Mean   :23.68   Mean   :181.0  
##  3rd Qu.:24.00   3rd Qu.:1022.5   3rd Qu.:24.00   3rd Qu.: 25.1  
##  Max.   :24.00   Max.   :1048.2   Max.   :24.00   Max.   :999.8  
##                  NA's   :2                        NA's   :647    
##  STP_ATTRIBUTES      VISIB      VISIB_ATTRIBUTES      WDSP       
##  Min.   : 0.00   Min.   : 0.3   Min.   : 0.00    Min.   : 0.000  
##  1st Qu.:24.00   1st Qu.:14.0   1st Qu.:24.00    1st Qu.: 3.400  
##  Median :24.00   Median :20.6   Median :24.00    Median : 4.800  
##  Mean   :21.94   Mean   :22.9   Mean   :23.62    Mean   : 5.288  
##  3rd Qu.:24.00   3rd Qu.:29.3   3rd Qu.:24.00    3rd Qu.: 6.800  
##  Max.   :24.00   Max.   :74.2   Max.   :24.00    Max.   :16.300  
##                  NA's   :3                       NA's   :8       
##  WDSP_ATTRIBUTES     MXSPD             GUST           MAX       
##  Min.   : 0.00   Min.   : 2.100   Min.   : 8.7   Min.   :-2.20  
##  1st Qu.:24.00   1st Qu.: 5.700   1st Qu.:13.9   1st Qu.:10.40  
##  Median :24.00   Median : 7.700   Median :15.9   Median :13.50  
##  Mean   :23.67   Mean   : 8.161   Mean   :16.7   Mean   :13.79  
##  3rd Qu.:24.00   3rd Qu.:10.300   3rd Qu.:18.5   3rd Qu.:17.30  
##  Max.   :24.00   Max.   :22.600   Max.   :37.1   Max.   :30.80  
##                  NA's   :10       NA's   :5514                  
##  MAX_ATTRIBUTES          MIN         MIN_ATTRIBUTES          PRCP        
##  Length:8448        Min.   :-6.700   Length:8448        Min.   :  0.000  
##  Class :character   1st Qu.: 5.500   Class :character   1st Qu.:  0.000  
##  Mode  :character   Median : 8.500   Mode  :character   Median :  0.510  
##                     Mean   : 8.374                      Mean   :  2.911  
##                     3rd Qu.:11.700                      3rd Qu.:  3.560  
##                     Max.   :19.000                      Max.   :121.920  
##                                                         NA's   :38       
##  PRCP_ATTRIBUTES         SNDP           I_FOG        I_RAIN_DRIZZLE   
##  Length:8448        Min.   :10.20   Min.   :0.0000   Min.   :0.00000  
##  Class :character   1st Qu.:15.28   1st Qu.:0.0000   1st Qu.:0.00000  
##  Mode  :character   Median :35.55   Median :0.0000   Median :0.00000  
##                     Mean   :35.57   Mean   :0.1977   Mean   :0.01172  
##                     3rd Qu.:48.25   3rd Qu.:0.0000   3rd Qu.:0.00000  
##                     Max.   :71.10   Max.   :1.0000   Max.   :1.00000  
##                     NA's   :8442                                      
##    I_SNOW_ICE           I_HAIL           I_THUNDER I_TORNADO_FUNNEL
##  Min.   :0.000000   Min.   :0.000000   Min.   :0   Min.   :0       
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0   1st Qu.:0       
##  Median :0.000000   Median :0.000000   Median :0   Median :0       
##  Mean   :0.001776   Mean   :0.001658   Mean   :0   Mean   :0       
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0   3rd Qu.:0       
##  Max.   :1.000000   Max.   :1.000000   Max.   :0   Max.   :0       
##  NA's   :4          NA's   :4          NA's   :8   NA's   :1478    
##        EA              ES              RH        
##  Min.   :0.300   Min.   :0.500   Min.   : 45.00  
##  1st Qu.:0.900   1st Qu.:1.100   1st Qu.: 77.60  
##  Median :1.100   Median :1.300   Median : 84.10  
##  Mean   :1.131   Mean   :1.353   Mean   : 83.43  
##  3rd Qu.:1.400   3rd Qu.:1.600   3rd Qu.: 90.20  
##  Max.   :2.100   Max.   :3.100   Max.   :100.00  
##  NA's   :6                       NA's   :6

We can also check number of rows of data, per year. This can be useful to check for data consistency and look for missing data. In this case, though there are clearly one or two problems (e.g., missing days) it looks good enough to work with.

table(weather_data$YEAR)
## 
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 
##  366  364  365  362  366  365  365  365  366  364  365  365  366  365  365  365 
## 2016 2017 2018 2019 2020 2021 2022 2023 
##  366  365  364  365  364  363  365   57

Summarising the weather data

Now that we have the weather data, we can summarize it to make it more usable. There are many ways to do this, depending on your question or hypothesis. For this example, we will create a new variable that categorizes each day as a storm day (1) or non-storm day (0) based on wind speed.

First, it creates a new variable called storm_day that categorizes each day as a storm day (1) or non-storm day (0) based on the wind speed (WDSP). If the wind speed is greater than 10, then the day is classified as a storm day, otherwise, it is a non-storm day.

Next, it groups the data by year and month using the group_by() function.

Finally, it calculates the total number of stormy days (n_storm_day) for each month using the summarise() function. The output is a new dataset called weather_summary that includes the number of stormy days in each month of each year.

weather_summary <- weather_data %>%
  # Create a new variable that categorizes the day as a storm day (1) or
  # non-storm day (0), based on WDSP (wind speed)
  mutate(storm_day = ifelse(WDSP > 10, 1, 0)) %>%
  group_by(year = YEAR, month = MONTH) %>%
  # Sum the storm day variable to get number of stormy days in each month
  summarise(n_storm_day = sum(storm_day))

Here are the first few lines of the summarised weather data that was generated from the previous code block.

weather_summary
## # A tibble: 278 × 3
## # Groups:   year [24]
##     year month n_storm_day
##    <int> <int>       <dbl>
##  1  2000     1           4
##  2  2000     2           5
##  3  2000     3           0
##  4  2000     4           2
##  5  2000     5           0
##  6  2000     6           0
##  7  2000     7           0
##  8  2000     8           0
##  9  2000     9           0
## 10  2000    10           6
## # … with 268 more rows

Finally, I can create a boxplot that shows the distribution of stormy days by month. The weather_summary dataset is used as the data source for this plot.

The geom_boxplot() function is used to create the boxplot itself. The outlier.colour = NA argument is used to remove the outlier points from the plot. I do this because I will add all of the points using geom_jitter(). The geom_jitter() function is used to add a jitter to the points so that they do not overlap. The height argument specifies the amount of vertical jitter to use (I don’t want much).

ggplot(weather_summary, aes(x = as.factor(month), y = n_storm_day)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(height = 0.1) +
  labs(x = "Month", y = "Storm days")