This notebook, “1-p1-explore-weather-trends.Rmd” uses original content from the Data Analyst Nanodegree by Udacity, used under CC BY-NC-ND 4.0. “1-p1-explore-weather-trends.Rmd” is licensed under CC0 by Clare Gibson

All supporting files can be found in the Github repo

1 Summary

In this project, I will analyse local and global temperature data and compare the temperature trends where I live to overall global temperature trends.

Udacity provided a database with three tables for this project:

city_list contains a list of cities and countries in the database.
city_data contains the average temperatures for each city by year (°C).
global_data contains the average global temperatures by year (°C).

Udacity did not provide a link to the source of this data, nor did they explain how the yearly average temperatures were calculated.

2 Preparation

2.1 Extracting the data

Using the SQL Workspace provided by Udacity, I began this project by writing a query on the city_list table to see which cities were available in my country so that I could select the one that is closest to where I live. I live in the UK, so I used 'United Kingdom' as my search parameter.

--SQL Query #1
SELECT *
FROM city_list
WHERE country = 'United Kingdom'
ORDER BY city

I downloaded the results to a CSV file. The results looked like this:

UK cities in the project database
city	country
Belfast	United Kingdom
Birmingham	United Kingdom
Cardiff	United Kingdom
Edinburgh	United Kingdom
London	United Kingdom

My nearest city in this list is London, so that is the city I will use for my analysis of local data.

I then wrote a second SQL query to extract data from the city_data database for London.

--SQL Query #2
SELECT *
FROM city_data
WHERE country = 'United Kingdom'
AND city = 'London'

This query returned 271 results. I downloaded the results to a CSV file and the first 6 rows are shown below.

Average yearly temperatures for London
year	city	country	avg_temp
1743	London	United Kingdom	7.54
1744	London	United Kingdom	10.34
1745	London	United Kingdom	4.13
1746	London	United Kingdom	NA
1747	London	United Kingdom	NA
1748	London	United Kingdom	NA

Finally, I wrote a query to extract everything from the global_data database.

--SQL Query #3
SELECT *
FROM global_data

This query returned 266 results. I downloaded the results to a CSV file and the first 6 rows are displayed below.

Average yearly global temperatures
year	avg_temp
1750	8.72
1751	7.98
1752	5.78
1753	8.39
1754	8.47
1755	8.36

2.2 Unifying the data

The data in london and global shares a similar structure. global contains two columns of data:

names(global)

## [1] "year"     "avg_temp"

while london shares those same column plus two more:

names(london)

## [1] "year"     "city"     "country"  "avg_temp"

We can add country and city columns to the global data (with the single value “Global” in each), which will then allow us to combine these two tables in a union. This will allow us to analyse both the local and the global data in the same plot.

# First add country and city columns to global
global <- global %>% 
  # create the new columns
  mutate(city = "Global",
         country = "Global") %>% 
  # arrange columns in same order as london data frame
  select(year, city, country, everything())

# Next create a merged data frame
temps <- global %>% 
  bind_rows(london) %>% 
  arrange(year, country, city)

kable(head(temps, 10),
      caption = "Average yearly temperatures for London and the globe")

Average yearly temperatures for London and the globe
year	city	country	avg_temp
1743	London	United Kingdom	7.54
1744	London	United Kingdom	10.34
1745	London	United Kingdom	4.13
1746	London	United Kingdom	NA
1747	London	United Kingdom	NA
1748	London	United Kingdom	NA
1749	London	United Kingdom	NA
1750	Global	Global	8.72
1750	London	United Kingdom	10.25
1751	Global	Global	7.98

After combining the data, I noticed that the data for London begins in the year 1743, whereas the global data only begins in 1750. If we look at the tail of the data frame, we can see that global includes data up to the year 2015, whereas london only goes as far as 2013.

kable(tail(temps),
      caption = "Tail of the `temps` table")

Tail of the `temps` table
year	city	country	avg_temp
2012	Global	Global	9.51
2012	London	United Kingdom	10.00
2013	Global	Global	9.61
2013	London	United Kingdom	9.91
2014	Global	Global	9.57
2015	Global	Global	9.83

Since we are interested in the comparison between the two regions, I will filter the temps data frame to include only records for years which have data for both london and global. Conveniently, this also serves to remove the rows with missing values from the London data.

temps <- temps %>% 
  filter(year >= 1750,
         year <= 2013)

kable(head(temps),
      caption = "Temperature table after filtering to include only data for years present in both data sets")

Temperature table after filtering to include only data for years present in both data sets
year	city	country	avg_temp
1750	Global	Global	8.72
1750	London	United Kingdom	10.25
1751	Global	Global	7.98
1751	London	United Kingdom	9.99
1752	Global	Global	5.78
1752	London	United Kingdom	6.54

3 Exploration

The project rubric asks me to calculate a moving average to use in the line charts. Moving averages are typically used for time-based data when there are known to be fluctuations or peaks and troughs on certain days or seasons. My notes on moving averages have more detail on their uses cases and how to calculate them.

It is not immediately obvious to me how a yearly average temperature statistic would have fluctuations. I’m going to start out by plotting a line graph for both the global and local data without calculating any moving averages, to get a feel for how the data fluctuates.

fig1 <- temps %>% 
  filter(city == "London") %>% 
  ggplot(mapping = aes(x = year, y = avg_temp)) +
  geom_line() +
  labs(x = "Year",
       y = "Average Temperature (°C)",
       title = "Average temperature by year in London 1750-2013")
  
fig1

Average temperature by year in London 1750-2013

fig2 <- temps %>% 
  filter(city == "Global") %>% 
  ggplot(mapping = aes(x = year, y = avg_temp)) +
  geom_line() +
  labs(x = "Year",
       y = "Average Temperature (°C)",
       title = "Global average temperature by year 1750-2013")
  
fig2

Global average temperature by year 1750-2013

From these initial plots I can make several observations:

It does appear that yearly average temperatures follow a somewhat cyclical trend of peaks and troughs, which seems to occur over a period of around 5 years. We could calculate moving averages over 5, 7 and 10 years and see which of those gives us the best insight into the data.
The london data appears to fluctuate between around 8°C and 11°C (a range of 3°C), whereas the global data appears to fluctuate between around 7°C and 9.5°C (a range of 2.5°C) with the gap between peaks and troughs becoming much narrower from around 1900 onwards.

3.1 Calculating the moving averages

In order to calculate the 5, 7 and 10 year moving averages, I will use a function called rollmean from the zoo package in R. The arguments required to be passed to rollmean are:

x an object (in this case either the london or global data frame)
k integer width of the rolling window (in our case either 5, 7 or 10 years)
fill filling values for observations outside of the rolling window.
align character specifying whether the index of the result should be left-, right- or centre-aligned compared to the rolling window of observations. In our case we’ll choose right-aligned so that the rolling window ends on the observation year.

temps <- temps %>% 
  arrange(year, country, city) %>% 
  group_by(city) %>% 
  mutate(temp_05ya = rollmean(x = avg_temp,
                              k = 5,
                              fill = NA,
                              align = "right"),
         temp_07ya = rollmean(x = avg_temp,
                              k = 7,
                              fill = NA,
                              align = "right"),
         temp_10ya = rollmean(x = avg_temp,
                              k = 10,
                              fill = NA,
                              align = "right"))
kable(head(temps, 20))

year	city	country	avg_temp	temp_05ya	temp_07ya	temp_10ya
1750	Global	Global	8.72	NA	NA	NA
1750	London	United Kingdom	10.25	NA	NA	NA
1751	Global	Global	7.98	NA	NA	NA
1751	London	United Kingdom	9.99	NA	NA	NA
1752	Global	Global	5.78	NA	NA	NA
1752	London	United Kingdom	6.54	NA	NA	NA
1753	Global	Global	8.39	NA	NA	NA
1753	London	United Kingdom	9.42	NA	NA	NA
1754	Global	Global	8.47	7.868	NA	NA
1754	London	United Kingdom	9.20	9.080	NA	NA
1755	Global	Global	8.36	7.796	NA	NA
1755	London	United Kingdom	8.95	8.820	NA	NA
1756	Global	Global	8.85	7.970	8.078571	NA
1756	London	United Kingdom	9.42	8.706	9.110000	NA
1757	Global	Global	9.02	8.618	8.121429	NA
1757	London	United Kingdom	9.34	9.266	8.980000	NA
1758	Global	Global	6.74	8.288	7.944286	NA
1758	London	United Kingdom	8.85	9.152	8.817143	NA
1759	Global	Global	7.99	8.192	8.260000	8.030
1759	London	United Kingdom	9.80	9.272	9.282857	9.176

Now let’s make a plot showing both London and global data on the same chart, using the 5-year moving average.

fig3 <- temps %>%  
  ggplot(mapping = aes(x = year, y = temp_05ya, color = city)) +
  geom_line() +
  labs(x = "Year",
       y = "Temperature (°C)",
       title = "Moving 5-year average temperature",
       subtitle = "London vs global average 1750-2013")
  
fig3

Moving 5-year average temperature for London and globally 1750-2013

Let’s repeat this for the 7-year and 10-year moving averages.

fig4 <- temps %>%  
  ggplot(mapping = aes(x = year, y = temp_07ya, color = city)) +
  geom_line() +
  labs(x = "Year",
       y = "Temperature (°C)",
       title = "Moving 7-year average temperature",
       subtitle = "London vs global average 1750-2013")
  
fig4

Moving 7-year average temperature for London and globally 1750-2013

fig5 <- temps %>%  
  ggplot(mapping = aes(x = year, y = temp_10ya, color = city)) +
  geom_line() +
  labs(x = "Year",
       y = "Temperature (°C)",
       title = "Moving 10-year average temperature",
       subtitle = "London vs global average 1750-2013")
  
fig5

Moving 10-year average temperatures for London and globally 1750-2013

I think that of all the moving average calculations I created, the 10-year moving average gives the best impression of overall trend while still retaining a sense of the peaks and troughs in the data.

3.2 Observations

From the analysis completed so far, I have the following observations:

The yearly average temperatures both globally and in London are somewhat lower than I imagined. I guess that this is because the yearly averages are calculated as an average of all times of day across all days of the year. This means that naturally those averages encompass both night time and day time temperatures, with nights generally being cooler than days. In my own lived experience, I am typically asleep during the night time, so I’m not noticing the night time temperatures.
There is a clear upward trend in temperatures both globally and in London which seems to start in around 1900, becoming much sharper from around 1970 onwards.
I was surprised to see that London shows higher average temperatures than the globe as a whole. I don’t think of London as being a particularly warm place! Looking at Figure 5, it appears that the difference between London and the global average is starting to become smaller.
I was also surprised to see that there is a cyclical trend in yearly average temperatures, and I wonder what causes this? I don’t know much about weather science but I wonder if there is some weather phenomenon such as El Niño that drives this?

4 Beyond the rubric

From my initial observations I was particularly struck by the fact that London’s average temperatures are consistently higher than the global average. My intuition would have me believe the opposite. I’m interested to know which other cities fall above the global average and which fall below.

In order to visualise this, I will need to extract the average yearly temperature data for all cities in the city_data database. I can use the following SQL query to do this:

--SQL Query #4
SELECT *
FROM city_data

This query returned 70,792 results, which I downloaded to a CSV file. The first 6 rows are shown below.

Average yearly temperatures for all cities
year	city	country	avg_temp
1849	Abidjan	Côte D’Ivoire	25.58
1850	Abidjan	Côte D’Ivoire	25.52
1851	Abidjan	Côte D’Ivoire	25.67
1852	Abidjan	Côte D’Ivoire	NA
1853	Abidjan	Côte D’Ivoire	NA
1854	Abidjan	Côte D’Ivoire	NA

For this analysis, I want to compare each city in the cities table against the global average. Therefore, instead of unioning the cities data with the global data, I am going to perform an inner join, so that the temperature statistics for the city and the global average sit side by side. I chose an inner join as I only want to keep records where we have a value for both the city and the global average.

temps_all <- cities %>% 
  inner_join(select(global,
                    year,
                    global_avg_temp = avg_temp))

Now let’s add some custom calculations to this data set:

difference between the city average temperature and the global average temperature;
moving 10-year average temperature for each city
moving 10-year global average temperature
difference between the city 10-year average and the global 10-year average.

temps_all <- temps_all %>% 
  arrange(country, city, year) %>% 
  group_by(country, city) %>% 
  mutate(diff = avg_temp - global_avg_temp,
         temp_10ya = rollmean(x = avg_temp,
                              k = 10,
                              fill = NA,
                              align = "right"),
         global_temp_10ya = rollmean(x = global_avg_temp,
                                     k = 10,
                                     fill = NA,
                                     align = "right"),
         diff_10ya = temp_10ya - global_temp_10ya)

kable(head(temps_all,10),
      caption = "Average yearly temperatures for individual cities and the global average")

Average yearly temperatures for individual cities and the global average
year	city	country	avg_temp	global_avg_temp	diff	temp_10ya	global_temp_10ya	diff_10ya
1833	Kabul	Afghanistan	13.91	8.01	5.90	NA	NA	NA
1834	Kabul	Afghanistan	13.91	8.15	5.76	NA	NA	NA
1835	Kabul	Afghanistan	14.71	7.39	7.32	NA	NA	NA
1836	Kabul	Afghanistan	NA	7.70	NA	NA	NA	NA
1837	Kabul	Afghanistan	15.47	7.38	8.09	NA	NA	NA
1838	Kabul	Afghanistan	18.50	7.51	10.99	NA	NA	NA
1839	Kabul	Afghanistan	NA	7.63	NA	NA	NA	NA
1840	Kabul	Afghanistan	13.70	7.80	5.90	NA	NA	NA
1841	Kabul	Afghanistan	14.17	7.69	6.48	NA	NA	NA
1842	Kabul	Afghanistan	15.37	8.02	7.35	NA	7.728	NA

To create the visuals for this secondary analysis, I am going to use Tableau, which is a visualisation tool that I am more familiar with. You can view the original visualisation here.

Tableau visualisation: Explore Weather Trends

4.1 Further Observations

There are many more cities with an average yearly temperature that is higher than the global average, than there are cities with an average yearly temperature that is lower than the global average. I was a little surprised by this at first. I expected that there would be more of a balance of cities falling either side of the global average. Upon reflection, however, I realised that having a sample made up only of major cities is not representative of the Earth as a whole. By their nature, cities are on land, so none of the data points come from the oceans, where temperatures may be cooler. Also, due to something called the “urban heat island” effect, urban areas in general tend to be hotter than non-urban areas.

Course 1: Welcome to the Nanodegree Program!

Project 1: Explore Weather Trends

Clare Gibson

2021-06-24