This guide will show you how to obtain and summarize weather data using R. We will be using the GSODR library to get data from the Global Surface Summary of the Day (GSOD) dataset, which contains daily weather observations from over 9000 weather stations worldwide. See here for more detailed information: https://docs.ropensci.org/GSODR/.
We will first load the necessary libraries. In this case, we need
GSODR to access the GSOD data and tidyverse to
manipulate and visualize the data.
library(GSODR)
library(tidyverse)
We first need to define the latitude and longitude of the location of
interest. For this example, we will use
latLong <- c(51.736844, -5.285334), which corresponds to
Skomer Island, Wales. You can get the latitude and longitude information
for your site using Google Maps.
latLong <- c(51.736844, -5.285334)
We will also define the radius (in km) of the circle where we will
search for weather stations. For this example, we will use
radius <- 20, which means we will search for weather
stations within a 20 km radius of Skomer Island.
radius <- 20
To get a list of weather stations that are within the search area, we
will use the nearest_stations() function from the
GSODR library. This function takes the latitude, longitude,
and radius as arguments and returns a list of weather stations that are
within the search area. We will store the result in the
closeStation variable.
If there are more than one you should try varying radius
to find the closest one that has data. You could also (with extra data
manipulation) collect data from all of the nearby weather stations and
take an average of them - that is beyond this tutorial though!
(closeStation <- GSODR::nearest_stations(latLong[1], latLong[2], radius))
## [1] "036010-99999" "036040-99999" "996860-99999" "036030-99999"
We will use the get_GSOD() function from the
GSODR library to obtain daily weather observations for the
years 2000 to 2023. This function requires station information and
returns a data frame with weather data. We will use the information for
the second weather station in the closeStation list to
retrieve the data.
Note that the get_GSOD() function may take some time to
run, so it’s a good idea to save the output as a CSV file for future
use. Additionally, all units are converted to SI units when using the
get_GSOD() function. For example, inches to millimeters,
Fahrenheit to Celsius and so on.
It’s possible that some stations may have no data available, in which
case the get_GSOD() function will return an empty data
frame. If this happens, try again with a different location.
The following code checks whether a file named
“weather_data.csv” exists in the current directory. If the
file does not exist, the get_GSOD() function is called to
retrieve the weather data, and the data is saved as a CSV file named
“weather_data.csv”. On the other hand, if the file already
exists, the code reads the data from the CSV file using the
read.csv() function and stores it in the
weather_data variable.
if (!file.exists("weather_data.csv")) {
weather_data <- get_GSOD(years = 2000:2023, station = closeStation[2])
write.csv(x = weatherData, file = "weather_data.csv", row.names = FALSE)
} else {
weather_data <- read.csv("weather_data.csv")
}
Now we can get a summary of this data.
summary(weather_data)
## STNID NAME CTRY COUNTRY_NAME
## Length:8448 Length:8448 Length:8448 Length:8448
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## ISO2C ISO3C STATE LATITUDE
## Length:8448 Length:8448 Mode:logical Min. :51.7
## Class :character Class :character NA's:8448 1st Qu.:51.7
## Mode :character Mode :character Median :51.7
## Mean :51.7
## 3rd Qu.:51.7
## Max. :51.7
##
## LONGITUDE ELEVATION BEGIN END
## Min. :-5.05 Min. :32 Min. :19730101 Min. :20221231
## 1st Qu.:-5.05 1st Qu.:32 1st Qu.:19730101 1st Qu.:20221231
## Median :-5.05 Median :32 Median :19730101 Median :20221231
## Mean :-5.05 Mean :32 Mean :19730101 Mean :20221231
## 3rd Qu.:-5.05 3rd Qu.:32 3rd Qu.:19730101 3rd Qu.:20221231
## Max. :-5.05 Max. :32 Max. :19730101 Max. :20221231
##
## YEARMODA YEAR MONTH DAY
## Length:8448 Min. :2000 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:2005 1st Qu.: 3.000 1st Qu.: 8.00
## Mode :character Median :2011 Median : 7.000 Median :16.00
## Mean :2011 Mean : 6.488 Mean :15.73
## 3rd Qu.:2017 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :2023 Max. :12.000 Max. :31.00
##
## YDAY TEMP TEMP_ATTRIBUTES DEWP
## Min. : 1.0 Min. :-3.9 Min. : 4.00 Min. :-8.600
## 1st Qu.: 90.0 1st Qu.: 8.1 1st Qu.:24.00 1st Qu.: 5.200
## Median :182.0 Median :10.9 Median :24.00 Median : 8.500
## Mean :182.1 Mean :11.0 Mean :23.71 Mean : 8.228
## 3rd Qu.:274.0 3rd Qu.:14.4 3rd Qu.:24.00 3rd Qu.:11.700
## Max. :366.0 Max. :24.6 Max. :24.00 Max. :18.600
## NA's :6
## DEWP_ATTRIBUTES SLP SLP_ATTRIBUTES STP
## Min. : 0.00 Min. : 966.7 Min. : 0.00 Min. : 0.0
## 1st Qu.:24.00 1st Qu.:1008.0 1st Qu.:24.00 1st Qu.: 9.0
## Median :24.00 Median :1015.8 Median :24.00 Median : 15.7
## Mean :23.66 Mean :1014.9 Mean :23.68 Mean :181.0
## 3rd Qu.:24.00 3rd Qu.:1022.5 3rd Qu.:24.00 3rd Qu.: 25.1
## Max. :24.00 Max. :1048.2 Max. :24.00 Max. :999.8
## NA's :2 NA's :647
## STP_ATTRIBUTES VISIB VISIB_ATTRIBUTES WDSP
## Min. : 0.00 Min. : 0.3 Min. : 0.00 Min. : 0.000
## 1st Qu.:24.00 1st Qu.:14.0 1st Qu.:24.00 1st Qu.: 3.400
## Median :24.00 Median :20.6 Median :24.00 Median : 4.800
## Mean :21.94 Mean :22.9 Mean :23.62 Mean : 5.288
## 3rd Qu.:24.00 3rd Qu.:29.3 3rd Qu.:24.00 3rd Qu.: 6.800
## Max. :24.00 Max. :74.2 Max. :24.00 Max. :16.300
## NA's :3 NA's :8
## WDSP_ATTRIBUTES MXSPD GUST MAX
## Min. : 0.00 Min. : 2.100 Min. : 8.7 Min. :-2.20
## 1st Qu.:24.00 1st Qu.: 5.700 1st Qu.:13.9 1st Qu.:10.40
## Median :24.00 Median : 7.700 Median :15.9 Median :13.50
## Mean :23.67 Mean : 8.161 Mean :16.7 Mean :13.79
## 3rd Qu.:24.00 3rd Qu.:10.300 3rd Qu.:18.5 3rd Qu.:17.30
## Max. :24.00 Max. :22.600 Max. :37.1 Max. :30.80
## NA's :10 NA's :5514
## MAX_ATTRIBUTES MIN MIN_ATTRIBUTES PRCP
## Length:8448 Min. :-6.700 Length:8448 Min. : 0.000
## Class :character 1st Qu.: 5.500 Class :character 1st Qu.: 0.000
## Mode :character Median : 8.500 Mode :character Median : 0.510
## Mean : 8.374 Mean : 2.911
## 3rd Qu.:11.700 3rd Qu.: 3.560
## Max. :19.000 Max. :121.920
## NA's :38
## PRCP_ATTRIBUTES SNDP I_FOG I_RAIN_DRIZZLE
## Length:8448 Min. :10.20 Min. :0.0000 Min. :0.00000
## Class :character 1st Qu.:15.28 1st Qu.:0.0000 1st Qu.:0.00000
## Mode :character Median :35.55 Median :0.0000 Median :0.00000
## Mean :35.57 Mean :0.1977 Mean :0.01172
## 3rd Qu.:48.25 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :71.10 Max. :1.0000 Max. :1.00000
## NA's :8442
## I_SNOW_ICE I_HAIL I_THUNDER I_TORNADO_FUNNEL
## Min. :0.000000 Min. :0.000000 Min. :0 Min. :0
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0 1st Qu.:0
## Median :0.000000 Median :0.000000 Median :0 Median :0
## Mean :0.001776 Mean :0.001658 Mean :0 Mean :0
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0 3rd Qu.:0
## Max. :1.000000 Max. :1.000000 Max. :0 Max. :0
## NA's :4 NA's :4 NA's :8 NA's :1478
## EA ES RH
## Min. :0.300 Min. :0.500 Min. : 45.00
## 1st Qu.:0.900 1st Qu.:1.100 1st Qu.: 77.60
## Median :1.100 Median :1.300 Median : 84.10
## Mean :1.131 Mean :1.353 Mean : 83.43
## 3rd Qu.:1.400 3rd Qu.:1.600 3rd Qu.: 90.20
## Max. :2.100 Max. :3.100 Max. :100.00
## NA's :6 NA's :6
We can also check number of rows of data, per year. This can be useful to check for data consistency and look for missing data. In this case, though there are clearly one or two problems (e.g., missing days) it looks good enough to work with.
table(weather_data$YEAR)
##
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 366 364 365 362 366 365 365 365 366 364 365 365 366 365 365 365
## 2016 2017 2018 2019 2020 2021 2022 2023
## 366 365 364 365 364 363 365 57
Now that we have the weather data, we can summarize it to make it more usable. There are many ways to do this, depending on your question or hypothesis. For this example, we will create a new variable that categorizes each day as a storm day (1) or non-storm day (0) based on wind speed.
First, it creates a new variable called storm_day that
categorizes each day as a storm day (1) or non-storm day (0) based on
the wind speed (WDSP). If the wind speed is greater than
10, then the day is classified as a storm day, otherwise, it is a
non-storm day.
Next, it groups the data by year and month using the
group_by() function.
Finally, it calculates the total number of stormy days
(n_storm_day) for each month using the
summarise() function. The output is a new dataset called
weather_summary that includes the number of stormy days in
each month of each year.
weather_summary <- weather_data %>%
# Create a new variable that categorizes the day as a storm day (1) or
# non-storm day (0), based on WDSP (wind speed)
mutate(storm_day = ifelse(WDSP > 10, 1, 0)) %>%
group_by(year = YEAR, month = MONTH) %>%
# Sum the storm day variable to get number of stormy days in each month
summarise(n_storm_day = sum(storm_day))
Here are the first few lines of the summarised weather data that was generated from the previous code block.
weather_summary
## # A tibble: 278 × 3
## # Groups: year [24]
## year month n_storm_day
## <int> <int> <dbl>
## 1 2000 1 4
## 2 2000 2 5
## 3 2000 3 0
## 4 2000 4 2
## 5 2000 5 0
## 6 2000 6 0
## 7 2000 7 0
## 8 2000 8 0
## 9 2000 9 0
## 10 2000 10 6
## # … with 268 more rows
Finally, I can create a boxplot that shows the distribution of stormy
days by month. The weather_summary dataset is used as the
data source for this plot.
The geom_boxplot() function is used to create the
boxplot itself. The outlier.colour = NA argument is used to
remove the outlier points from the plot. I do this because I will add
all of the points using geom_jitter(). The
geom_jitter() function is used to add a jitter to the
points so that they do not overlap. The height argument specifies the
amount of vertical jitter to use (I don’t want much).
ggplot(weather_summary, aes(x = as.factor(month), y = n_storm_day)) +
geom_boxplot(outlier.colour = NA) +
geom_jitter(height = 0.1) +
labs(x = "Month", y = "Storm days")