This notebook, “1-p1-explore-weather-trends.Rmd” uses original content from the Data Analyst Nanodegree by Udacity, used under CC BY-NC-ND 4.0. “1-p1-explore-weather-trends.Rmd” is licensed under CC0 by Clare Gibson

All supporting files can be found in the Github repo

1 Summary

In this project, I will analyse local and global temperature data and compare the temperature trends where I live to overall global temperature trends.

Udacity provided a database with three tables for this project:

  • city_list contains a list of cities and countries in the database.
  • city_data contains the average temperatures for each city by year (°C).
  • global_data contains the average global temperatures by year (°C).

Udacity did not provide a link to the source of this data, nor did they explain how the yearly average temperatures were calculated.

2 Preparation

2.1 Extracting the data

Using the SQL Workspace provided by Udacity, I began this project by writing a query on the city_list table to see which cities were available in my country so that I could select the one that is closest to where I live. I live in the UK, so I used 'United Kingdom' as my search parameter.

--SQL Query #1
SELECT *
FROM city_list
WHERE country = 'United Kingdom'
ORDER BY city

I downloaded the results to a CSV file. The results looked like this:

UK cities in the project database
city country
Belfast United Kingdom
Birmingham United Kingdom
Cardiff United Kingdom
Edinburgh United Kingdom
London United Kingdom

My nearest city in this list is London, so that is the city I will use for my analysis of local data.

I then wrote a second SQL query to extract data from the city_data database for London.

--SQL Query #2
SELECT *
FROM city_data
WHERE country = 'United Kingdom'
AND city = 'London'

This query returned 271 results. I downloaded the results to a CSV file and the first 6 rows are shown below.

Average yearly temperatures for London
year city country avg_temp
1743 London United Kingdom 7.54
1744 London United Kingdom 10.34
1745 London United Kingdom 4.13
1746 London United Kingdom NA
1747 London United Kingdom NA
1748 London United Kingdom NA

Finally, I wrote a query to extract everything from the global_data database.

--SQL Query #3
SELECT *
FROM global_data

This query returned 266 results. I downloaded the results to a CSV file and the first 6 rows are displayed below.

Average yearly global temperatures
year avg_temp
1750 8.72
1751 7.98
1752 5.78
1753 8.39
1754 8.47
1755 8.36

2.2 Unifying the data

The data in london and global shares a similar structure. global contains two columns of data:

names(global)
## [1] "year"     "avg_temp"

while london shares those same column plus two more:

names(london)
## [1] "year"     "city"     "country"  "avg_temp"

We can add country and city columns to the global data (with the single value “Global” in each), which will then allow us to combine these two tables in a union. This will allow us to analyse both the local and the global data in the same plot.

# First add country and city columns to global
global <- global %>% 
  # create the new columns
  mutate(city = "Global",
         country = "Global") %>% 
  # arrange columns in same order as london data frame
  select(year, city, country, everything())

# Next create a merged data frame
temps <- global %>% 
  bind_rows(london) %>% 
  arrange(year, country, city)

kable(head(temps, 10),
      caption = "Average yearly temperatures for London and the globe")
Average yearly temperatures for London and the globe
year city country avg_temp
1743 London United Kingdom 7.54
1744 London United Kingdom 10.34
1745 London United Kingdom 4.13
1746 London United Kingdom NA
1747 London United Kingdom NA
1748 London United Kingdom NA
1749 London United Kingdom NA
1750 Global Global 8.72
1750 London United Kingdom 10.25
1751 Global Global 7.98

After combining the data, I noticed that the data for London begins in the year 1743, whereas the global data only begins in 1750. If we look at the tail of the data frame, we can see that global includes data up to the year 2015, whereas london only goes as far as 2013.

kable(tail(temps),
      caption = "Tail of the `temps` table")
Tail of the temps table
year city country avg_temp
2012 Global Global 9.51
2012 London United Kingdom 10.00
2013 Global Global 9.61
2013 London United Kingdom 9.91
2014 Global Global 9.57
2015 Global Global 9.83

Since we are interested in the comparison between the two regions, I will filter the temps data frame to include only records for years which have data for both london and global. Conveniently, this also serves to remove the rows with missing values from the London data.

temps <- temps %>% 
  filter(year >= 1750,
         year <= 2013)

kable(head(temps),
      caption = "Temperature table after filtering to include only data for years present in both data sets")
Temperature table after filtering to include only data for years present in both data sets
year city country avg_temp
1750 Global Global 8.72
1750 London United Kingdom 10.25
1751 Global Global 7.98
1751 London United Kingdom 9.99
1752 Global Global 5.78
1752 London United Kingdom 6.54

3 Exploration

The project rubric asks me to calculate a moving average to use in the line charts. Moving averages are typically used for time-based data when there are known to be fluctuations or peaks and troughs on certain days or seasons. My notes on moving averages have more detail on their uses cases and how to calculate them.

It is not immediately obvious to me how a yearly average temperature statistic would have fluctuations. I’m going to start out by plotting a line graph for both the global and local data without calculating any moving averages, to get a feel for how the data fluctuates.

fig1 <- temps %>% 
  filter(city == "London") %>% 
  ggplot(mapping = aes(x = year, y = avg_temp)) +
  geom_line() +
  labs(x = "Year",
       y = "Average Temperature (°C)",
       title = "Average temperature by year in London 1750-2013")
  
fig1
Average temperature by year in London 1750-2013

Average temperature by year in London 1750-2013

fig2 <- temps %>% 
  filter(city == "Global") %>% 
  ggplot(mapping = aes(x = year, y = avg_temp)) +
  geom_line() +
  labs(x = "Year",
       y = "Average Temperature (°C)",
       title = "Global average temperature by year 1750-2013")
  
fig2
Global average temperature by year 1750-2013

Global average temperature by year 1750-2013

From these initial plots I can make several observations:

  • It does appear that yearly average temperatures follow a somewhat cyclical trend of peaks and troughs, which seems to occur over a period of around 5 years. We could calculate moving averages over 5, 7 and 10 years and see which of those gives us the best insight into the data.
  • The london data appears to fluctuate between around 8°C and 11°C (a range of 3°C), whereas the global data appears to fluctuate between around 7°C and 9.5°C (a range of 2.5°C) with the gap between peaks and troughs becoming much narrower from around 1900 onwards.

3.1 Calculating the moving averages

In order to calculate the 5, 7 and 10 year moving averages, I will use a function called rollmean from the zoo package in R. The arguments required to be passed to rollmean are:

  • x an object (in this case either the london or global data frame)
  • k integer width of the rolling window (in our case either 5, 7 or 10 years)
  • fill filling values for observations outside of the rolling window.
  • align character specifying whether the index of the result should be left-, right- or centre-aligned compared to the rolling window of observations. In our case we’ll choose right-aligned so that the rolling window ends on the observation year.
temps <- temps %>% 
  arrange(year, country, city) %>% 
  group_by(city) %>% 
  mutate(temp_05ya = rollmean(x = avg_temp,
                              k = 5,
                              fill = NA,
                              align = "right"),
         temp_07ya = rollmean(x = avg_temp,
                              k = 7,
                              fill = NA,
                              align = "right"),
         temp_10ya = rollmean(x = avg_temp,
                              k = 10,
                              fill = NA,
                              align = "right"))
kable(head(temps, 20))
year city country avg_temp temp_05ya temp_07ya temp_10ya
1750 Global Global 8.72 NA NA NA
1750 London United Kingdom 10.25 NA NA NA
1751 Global Global 7.98 NA NA NA
1751 London United Kingdom 9.99 NA NA NA
1752 Global Global 5.78 NA NA NA
1752 London United Kingdom 6.54 NA NA NA
1753 Global Global 8.39 NA NA NA
1753 London United Kingdom 9.42 NA NA NA
1754 Global Global 8.47 7.868 NA NA
1754 London United Kingdom 9.20 9.080 NA NA
1755 Global Global 8.36 7.796 NA NA
1755 London United Kingdom 8.95 8.820 NA NA
1756 Global Global 8.85 7.970 8.078571 NA
1756 London United Kingdom 9.42 8.706 9.110000 NA
1757 Global Global 9.02 8.618 8.121429 NA
1757 London United Kingdom 9.34 9.266 8.980000 NA
1758 Global Global 6.74 8.288 7.944286 NA
1758 London United Kingdom 8.85 9.152 8.817143 NA
1759 Global Global 7.99 8.192 8.260000 8.030
1759 London United Kingdom 9.80 9.272 9.282857 9.176

Now let’s make a plot showing both London and global data on the same chart, using the 5-year moving average.

fig3 <- temps %>%  
  ggplot(mapping = aes(x = year, y = temp_05ya, color = city)) +
  geom_line() +
  labs(x = "Year",
       y = "Temperature (°C)",
       title = "Moving 5-year average temperature",
       subtitle = "London vs global average 1750-2013")
  
fig3
Moving 5-year average temperature for London and globally 1750-2013

Moving 5-year average temperature for London and globally 1750-2013

Let’s repeat this for the 7-year and 10-year moving averages.

fig4 <- temps %>%  
  ggplot(mapping = aes(x = year, y = temp_07ya, color = city)) +
  geom_line() +
  labs(x = "Year",
       y = "Temperature (°C)",
       title = "Moving 7-year average temperature",
       subtitle = "London vs global average 1750-2013")
  
fig4
Moving 7-year average temperature for London and globally 1750-2013

Moving 7-year average temperature for London and globally 1750-2013

fig5 <- temps %>%  
  ggplot(mapping = aes(x = year, y = temp_10ya, color = city)) +
  geom_line() +
  labs(x = "Year",
       y = "Temperature (°C)",
       title = "Moving 10-year average temperature",
       subtitle = "London vs global average 1750-2013")
  
fig5
Moving 10-year average temperatures for London and globally 1750-2013

Moving 10-year average temperatures for London and globally 1750-2013

I think that of all the moving average calculations I created, the 10-year moving average gives the best impression of overall trend while still retaining a sense of the peaks and troughs in the data.

3.2 Observations

From the analysis completed so far, I have the following observations:

  • The yearly average temperatures both globally and in London are somewhat lower than I imagined. I guess that this is because the yearly averages are calculated as an average of all times of day across all days of the year. This means that naturally those averages encompass both night time and day time temperatures, with nights generally being cooler than days. In my own lived experience, I am typically asleep during the night time, so I’m not noticing the night time temperatures.
  • There is a clear upward trend in temperatures both globally and in London which seems to start in around 1900, becoming much sharper from around 1970 onwards.
  • I was surprised to see that London shows higher average temperatures than the globe as a whole. I don’t think of London as being a particularly warm place! Looking at Figure 5, it appears that the difference between London and the global average is starting to become smaller.
  • I was also surprised to see that there is a cyclical trend in yearly average temperatures, and I wonder what causes this? I don’t know much about weather science but I wonder if there is some weather phenomenon such as El Niño that drives this?

4 Beyond the rubric

From my initial observations I was particularly struck by the fact that London’s average temperatures are consistently higher than the global average. My intuition would have me believe the opposite. I’m interested to know which other cities fall above the global average and which fall below.

In order to visualise this, I will need to extract the average yearly temperature data for all cities in the city_data database. I can use the following SQL query to do this:

--SQL Query #4
SELECT *
FROM city_data

This query returned 70,792 results, which I downloaded to a CSV file. The first 6 rows are shown below.

Average yearly temperatures for all cities
year city country avg_temp
1849 Abidjan Côte D’Ivoire 25.58
1850 Abidjan Côte D’Ivoire 25.52
1851 Abidjan Côte D’Ivoire 25.67
1852 Abidjan Côte D’Ivoire NA
1853 Abidjan Côte D’Ivoire NA
1854 Abidjan Côte D’Ivoire NA

For this analysis, I want to compare each city in the cities table against the global average. Therefore, instead of unioning the cities data with the global data, I am going to perform an inner join, so that the temperature statistics for the city and the global average sit side by side. I chose an inner join as I only want to keep records where we have a value for both the city and the global average.

temps_all <- cities %>% 
  inner_join(select(global,
                    year,
                    global_avg_temp = avg_temp))

Now let’s add some custom calculations to this data set:

  • difference between the city average temperature and the global average temperature;
  • moving 10-year average temperature for each city
  • moving 10-year global average temperature
  • difference between the city 10-year average and the global 10-year average.
temps_all <- temps_all %>% 
  arrange(country, city, year) %>% 
  group_by(country, city) %>% 
  mutate(diff = avg_temp - global_avg_temp,
         temp_10ya = rollmean(x = avg_temp,
                              k = 10,
                              fill = NA,
                              align = "right"),
         global_temp_10ya = rollmean(x = global_avg_temp,
                                     k = 10,
                                     fill = NA,
                                     align = "right"),
         diff_10ya = temp_10ya - global_temp_10ya)

kable(head(temps_all,10),
      caption = "Average yearly temperatures for individual cities and the global average")
Average yearly temperatures for individual cities and the global average
year city country avg_temp global_avg_temp diff temp_10ya global_temp_10ya diff_10ya
1833 Kabul Afghanistan 13.91 8.01 5.90 NA NA NA
1834 Kabul Afghanistan 13.91 8.15 5.76 NA NA NA
1835 Kabul Afghanistan 14.71 7.39 7.32 NA NA NA
1836 Kabul Afghanistan NA 7.70 NA NA NA NA
1837 Kabul Afghanistan 15.47 7.38 8.09 NA NA NA
1838 Kabul Afghanistan 18.50 7.51 10.99 NA NA NA
1839 Kabul Afghanistan NA 7.63 NA NA NA NA
1840 Kabul Afghanistan 13.70 7.80 5.90 NA NA NA
1841 Kabul Afghanistan 14.17 7.69 6.48 NA NA NA
1842 Kabul Afghanistan 15.37 8.02 7.35 NA 7.728 NA

To create the visuals for this secondary analysis, I am going to use Tableau, which is a visualisation tool that I am more familiar with. You can view the original visualisation here.

Tableau visualisation: Explore Weather Trends

4.1 Further Observations

There are many more cities with an average yearly temperature that is higher than the global average, than there are cities with an average yearly temperature that is lower than the global average. I was a little surprised by this at first. I expected that there would be more of a balance of cities falling either side of the global average. Upon reflection, however, I realised that having a sample made up only of major cities is not representative of the Earth as a whole. By their nature, cities are on land, so none of the data points come from the oceans, where temperatures may be cooler. Also, due to something called the “urban heat island” effect, urban areas in general tend to be hotter than non-urban areas.