This notebook, “1-p1-explore-weather-trends.Rmd” uses original content from the Data Analyst Nanodegree by Udacity, used under CC BY-NC-ND 4.0. “1-p1-explore-weather-trends.Rmd” is licensed under CC0 by Clare Gibson
All supporting files can be found in the Github repo
In this project, I will analyse local and global temperature data and compare the temperature trends where I live to overall global temperature trends.
Udacity provided a database with three tables for this project:
city_list contains a list of cities and countries in the database.city_data contains the average temperatures for each city by year (°C).global_data contains the average global temperatures by year (°C).Udacity did not provide a link to the source of this data, nor did they explain how the yearly average temperatures were calculated.
Using the SQL Workspace provided by Udacity, I began this project by writing a query on the city_list table to see which cities were available in my country so that I could select the one that is closest to where I live. I live in the UK, so I used 'United Kingdom' as my search parameter.
--SQL Query #1
SELECT *
FROM city_list
WHERE country = 'United Kingdom'
ORDER BY city
I downloaded the results to a CSV file. The results looked like this:
| city | country |
|---|---|
| Belfast | United Kingdom |
| Birmingham | United Kingdom |
| Cardiff | United Kingdom |
| Edinburgh | United Kingdom |
| London | United Kingdom |
My nearest city in this list is London, so that is the city I will use for my analysis of local data.
I then wrote a second SQL query to extract data from the city_data database for London.
--SQL Query #2
SELECT *
FROM city_data
WHERE country = 'United Kingdom'
AND city = 'London'
This query returned 271 results. I downloaded the results to a CSV file and the first 6 rows are shown below.
| year | city | country | avg_temp |
|---|---|---|---|
| 1743 | London | United Kingdom | 7.54 |
| 1744 | London | United Kingdom | 10.34 |
| 1745 | London | United Kingdom | 4.13 |
| 1746 | London | United Kingdom | NA |
| 1747 | London | United Kingdom | NA |
| 1748 | London | United Kingdom | NA |
Finally, I wrote a query to extract everything from the global_data database.
--SQL Query #3
SELECT *
FROM global_data
This query returned 266 results. I downloaded the results to a CSV file and the first 6 rows are displayed below.
| year | avg_temp |
|---|---|
| 1750 | 8.72 |
| 1751 | 7.98 |
| 1752 | 5.78 |
| 1753 | 8.39 |
| 1754 | 8.47 |
| 1755 | 8.36 |
The data in london and global shares a similar structure. global contains two columns of data:
names(global)
## [1] "year" "avg_temp"
while london shares those same column plus two more:
names(london)
## [1] "year" "city" "country" "avg_temp"
We can add country and city columns to the global data (with the single value “Global” in each), which will then allow us to combine these two tables in a union. This will allow us to analyse both the local and the global data in the same plot.
# First add country and city columns to global
global <- global %>%
# create the new columns
mutate(city = "Global",
country = "Global") %>%
# arrange columns in same order as london data frame
select(year, city, country, everything())
# Next create a merged data frame
temps <- global %>%
bind_rows(london) %>%
arrange(year, country, city)
kable(head(temps, 10),
caption = "Average yearly temperatures for London and the globe")
| year | city | country | avg_temp |
|---|---|---|---|
| 1743 | London | United Kingdom | 7.54 |
| 1744 | London | United Kingdom | 10.34 |
| 1745 | London | United Kingdom | 4.13 |
| 1746 | London | United Kingdom | NA |
| 1747 | London | United Kingdom | NA |
| 1748 | London | United Kingdom | NA |
| 1749 | London | United Kingdom | NA |
| 1750 | Global | Global | 8.72 |
| 1750 | London | United Kingdom | 10.25 |
| 1751 | Global | Global | 7.98 |
After combining the data, I noticed that the data for London begins in the year 1743, whereas the global data only begins in 1750. If we look at the tail of the data frame, we can see that global includes data up to the year 2015, whereas london only goes as far as 2013.
kable(tail(temps),
caption = "Tail of the `temps` table")
| year | city | country | avg_temp |
|---|---|---|---|
| 2012 | Global | Global | 9.51 |
| 2012 | London | United Kingdom | 10.00 |
| 2013 | Global | Global | 9.61 |
| 2013 | London | United Kingdom | 9.91 |
| 2014 | Global | Global | 9.57 |
| 2015 | Global | Global | 9.83 |
Since we are interested in the comparison between the two regions, I will filter the temps data frame to include only records for years which have data for both london and global. Conveniently, this also serves to remove the rows with missing values from the London data.
temps <- temps %>%
filter(year >= 1750,
year <= 2013)
kable(head(temps),
caption = "Temperature table after filtering to include only data for years present in both data sets")
| year | city | country | avg_temp |
|---|---|---|---|
| 1750 | Global | Global | 8.72 |
| 1750 | London | United Kingdom | 10.25 |
| 1751 | Global | Global | 7.98 |
| 1751 | London | United Kingdom | 9.99 |
| 1752 | Global | Global | 5.78 |
| 1752 | London | United Kingdom | 6.54 |
The project rubric asks me to calculate a moving average to use in the line charts. Moving averages are typically used for time-based data when there are known to be fluctuations or peaks and troughs on certain days or seasons. My notes on moving averages have more detail on their uses cases and how to calculate them.
It is not immediately obvious to me how a yearly average temperature statistic would have fluctuations. I’m going to start out by plotting a line graph for both the global and local data without calculating any moving averages, to get a feel for how the data fluctuates.
fig1 <- temps %>%
filter(city == "London") %>%
ggplot(mapping = aes(x = year, y = avg_temp)) +
geom_line() +
labs(x = "Year",
y = "Average Temperature (°C)",
title = "Average temperature by year in London 1750-2013")
fig1
Average temperature by year in London 1750-2013
fig2 <- temps %>%
filter(city == "Global") %>%
ggplot(mapping = aes(x = year, y = avg_temp)) +
geom_line() +
labs(x = "Year",
y = "Average Temperature (°C)",
title = "Global average temperature by year 1750-2013")
fig2
Global average temperature by year 1750-2013
From these initial plots I can make several observations:
london data appears to fluctuate between around 8°C and 11°C (a range of 3°C), whereas the global data appears to fluctuate between around 7°C and 9.5°C (a range of 2.5°C) with the gap between peaks and troughs becoming much narrower from around 1900 onwards.In order to calculate the 5, 7 and 10 year moving averages, I will use a function called rollmean from the zoo package in R. The arguments required to be passed to rollmean are:
x an object (in this case either the london or global data frame)k integer width of the rolling window (in our case either 5, 7 or 10 years)fill filling values for observations outside of the rolling window.align character specifying whether the index of the result should be left-, right- or centre-aligned compared to the rolling window of observations. In our case we’ll choose right-aligned so that the rolling window ends on the observation year.temps <- temps %>%
arrange(year, country, city) %>%
group_by(city) %>%
mutate(temp_05ya = rollmean(x = avg_temp,
k = 5,
fill = NA,
align = "right"),
temp_07ya = rollmean(x = avg_temp,
k = 7,
fill = NA,
align = "right"),
temp_10ya = rollmean(x = avg_temp,
k = 10,
fill = NA,
align = "right"))
kable(head(temps, 20))
| year | city | country | avg_temp | temp_05ya | temp_07ya | temp_10ya |
|---|---|---|---|---|---|---|
| 1750 | Global | Global | 8.72 | NA | NA | NA |
| 1750 | London | United Kingdom | 10.25 | NA | NA | NA |
| 1751 | Global | Global | 7.98 | NA | NA | NA |
| 1751 | London | United Kingdom | 9.99 | NA | NA | NA |
| 1752 | Global | Global | 5.78 | NA | NA | NA |
| 1752 | London | United Kingdom | 6.54 | NA | NA | NA |
| 1753 | Global | Global | 8.39 | NA | NA | NA |
| 1753 | London | United Kingdom | 9.42 | NA | NA | NA |
| 1754 | Global | Global | 8.47 | 7.868 | NA | NA |
| 1754 | London | United Kingdom | 9.20 | 9.080 | NA | NA |
| 1755 | Global | Global | 8.36 | 7.796 | NA | NA |
| 1755 | London | United Kingdom | 8.95 | 8.820 | NA | NA |
| 1756 | Global | Global | 8.85 | 7.970 | 8.078571 | NA |
| 1756 | London | United Kingdom | 9.42 | 8.706 | 9.110000 | NA |
| 1757 | Global | Global | 9.02 | 8.618 | 8.121429 | NA |
| 1757 | London | United Kingdom | 9.34 | 9.266 | 8.980000 | NA |
| 1758 | Global | Global | 6.74 | 8.288 | 7.944286 | NA |
| 1758 | London | United Kingdom | 8.85 | 9.152 | 8.817143 | NA |
| 1759 | Global | Global | 7.99 | 8.192 | 8.260000 | 8.030 |
| 1759 | London | United Kingdom | 9.80 | 9.272 | 9.282857 | 9.176 |
Now let’s make a plot showing both London and global data on the same chart, using the 5-year moving average.
fig3 <- temps %>%
ggplot(mapping = aes(x = year, y = temp_05ya, color = city)) +
geom_line() +
labs(x = "Year",
y = "Temperature (°C)",
title = "Moving 5-year average temperature",
subtitle = "London vs global average 1750-2013")
fig3
Moving 5-year average temperature for London and globally 1750-2013
Let’s repeat this for the 7-year and 10-year moving averages.
fig4 <- temps %>%
ggplot(mapping = aes(x = year, y = temp_07ya, color = city)) +
geom_line() +
labs(x = "Year",
y = "Temperature (°C)",
title = "Moving 7-year average temperature",
subtitle = "London vs global average 1750-2013")
fig4
Moving 7-year average temperature for London and globally 1750-2013
fig5 <- temps %>%
ggplot(mapping = aes(x = year, y = temp_10ya, color = city)) +
geom_line() +
labs(x = "Year",
y = "Temperature (°C)",
title = "Moving 10-year average temperature",
subtitle = "London vs global average 1750-2013")
fig5
Moving 10-year average temperatures for London and globally 1750-2013
I think that of all the moving average calculations I created, the 10-year moving average gives the best impression of overall trend while still retaining a sense of the peaks and troughs in the data.
From the analysis completed so far, I have the following observations:
From my initial observations I was particularly struck by the fact that London’s average temperatures are consistently higher than the global average. My intuition would have me believe the opposite. I’m interested to know which other cities fall above the global average and which fall below.
In order to visualise this, I will need to extract the average yearly temperature data for all cities in the city_data database. I can use the following SQL query to do this:
--SQL Query #4
SELECT *
FROM city_data
This query returned 70,792 results, which I downloaded to a CSV file. The first 6 rows are shown below.
| year | city | country | avg_temp |
|---|---|---|---|
| 1849 | Abidjan | Côte D’Ivoire | 25.58 |
| 1850 | Abidjan | Côte D’Ivoire | 25.52 |
| 1851 | Abidjan | Côte D’Ivoire | 25.67 |
| 1852 | Abidjan | Côte D’Ivoire | NA |
| 1853 | Abidjan | Côte D’Ivoire | NA |
| 1854 | Abidjan | Côte D’Ivoire | NA |
For this analysis, I want to compare each city in the cities table against the global average. Therefore, instead of unioning the cities data with the global data, I am going to perform an inner join, so that the temperature statistics for the city and the global average sit side by side. I chose an inner join as I only want to keep records where we have a value for both the city and the global average.
temps_all <- cities %>%
inner_join(select(global,
year,
global_avg_temp = avg_temp))
Now let’s add some custom calculations to this data set:
temps_all <- temps_all %>%
arrange(country, city, year) %>%
group_by(country, city) %>%
mutate(diff = avg_temp - global_avg_temp,
temp_10ya = rollmean(x = avg_temp,
k = 10,
fill = NA,
align = "right"),
global_temp_10ya = rollmean(x = global_avg_temp,
k = 10,
fill = NA,
align = "right"),
diff_10ya = temp_10ya - global_temp_10ya)
kable(head(temps_all,10),
caption = "Average yearly temperatures for individual cities and the global average")
| year | city | country | avg_temp | global_avg_temp | diff | temp_10ya | global_temp_10ya | diff_10ya |
|---|---|---|---|---|---|---|---|---|
| 1833 | Kabul | Afghanistan | 13.91 | 8.01 | 5.90 | NA | NA | NA |
| 1834 | Kabul | Afghanistan | 13.91 | 8.15 | 5.76 | NA | NA | NA |
| 1835 | Kabul | Afghanistan | 14.71 | 7.39 | 7.32 | NA | NA | NA |
| 1836 | Kabul | Afghanistan | NA | 7.70 | NA | NA | NA | NA |
| 1837 | Kabul | Afghanistan | 15.47 | 7.38 | 8.09 | NA | NA | NA |
| 1838 | Kabul | Afghanistan | 18.50 | 7.51 | 10.99 | NA | NA | NA |
| 1839 | Kabul | Afghanistan | NA | 7.63 | NA | NA | NA | NA |
| 1840 | Kabul | Afghanistan | 13.70 | 7.80 | 5.90 | NA | NA | NA |
| 1841 | Kabul | Afghanistan | 14.17 | 7.69 | 6.48 | NA | NA | NA |
| 1842 | Kabul | Afghanistan | 15.37 | 8.02 | 7.35 | NA | 7.728 | NA |
To create the visuals for this secondary analysis, I am going to use Tableau, which is a visualisation tool that I am more familiar with. You can view the original visualisation here.
Tableau visualisation: Explore Weather Trends
There are many more cities with an average yearly temperature that is higher than the global average, than there are cities with an average yearly temperature that is lower than the global average. I was a little surprised by this at first. I expected that there would be more of a balance of cities falling either side of the global average. Upon reflection, however, I realised that having a sample made up only of major cities is not representative of the Earth as a whole. By their nature, cities are on land, so none of the data points come from the oceans, where temperatures may be cooler. Also, due to something called the “urban heat island” effect, urban areas in general tend to be hotter than non-urban areas.