Introduction

The goal of this analysis is to identify the overall global temperature trend and compare this to the average temperature of the nearest metropolitan city to me, using moving averages to aid in visualization.

Tools

I will be using R, specifically the ggplot2 library to produce my plots. I also created a moving average function to help prepare the data for analysis (I will include below). This report is produced using RMarkdown with pdf output.

Procedure

To obtain the initial .csv files, I needed to perform two SQL queries, all done natively in browser on Udacity’s Data Analytics Nanodegree program. These queries were as follows:

SELECT * FROM global_data;
SELECT * FROM city_data WHERE country = 'United States' AND city = 'Los Angeles';

Once the CSV’s were exported, I imported them into R and required the ggplot2 package for later.

## Loading required package: ggplot2

Next, I needed a function to obtain moving average data for the global and LA data sets. Since the year ranges were inconsistent, I kept the two data sets separate and ran decade-long moving averages, as you will see later in my plots.

The first ten and last ten values are assigned NA in this case since we need the first ten years to average the first decade - similarly for the final decade. These NA’s will be omitted before the analysis.

moving_avg <- function(num, x) {
   mv_avg <- data.frame(0)
   colnames(mv_avg) <- c("mvg_avg")
   for (i in 1:(length(x))) {
     if(i < num || i > (length(x)-num)){
       mv_avg[i,] <- NA
     }else{
      mv_avg[i,] <- mean(x[i:(i+num)])
     }
   }
   mv_avg
} 
mvg_avg_la <- moving_avg(10,la_data$avg_temp)
head(mvg_avg_la, 15)
mvg_avg_gl <- moving_avg(10,gl_data$avg_temp)

Now, I omit the NA values and create new data frames containing the moving averages for each decade for the two data sets. Notice the ten year offset in column numbers - meaning we properly implemented the moving average per decade.

full_la <- na.omit(cbind(la_data,mvg_avg_la))
full_gl <- na.omit(cbind(gl_data,mvg_avg_gl))
head(full_gl, 5)

To take a quick look at our data before our line plots, we can summarize the data and look over the trend of our moving average.

For Los Angeles -

summary(full_la); boxplot(full_la$mvg_avg, ylab = 'Avg Temp', xlab = 'Los Angeles', main = 'Moving Avg Distribution')
##       year               city              country       avg_temp    
##  Min.   :1858   Los Angeles:146   United States:146   Min.   :14.36  
##  1st Qu.:1894                                         1st Qu.:15.44  
##  Median :1930                                         Median :15.77  
##  Mean   :1930                                         Mean   :15.83  
##  3rd Qu.:1967                                         3rd Qu.:16.20  
##  Max.   :2003                                         Max.   :17.08  
##     mvg_avg     
##  Min.   :15.25  
##  1st Qu.:15.60  
##  Median :15.82  
##  Mean   :15.86  
##  3rd Qu.:16.05  
##  Max.   :16.72

For Global -

summary(full_gl); boxplot(full_gl$mvg_avg, ylab = 'Avg Temp', xlab = 'Global', main = 'Moving Avg Distribution')
##       year         avg_temp        mvg_avg     
##  Min.   :1759   Min.   :6.780   Min.   :7.236  
##  1st Qu.:1820   1st Qu.:8.065   1st Qu.:8.057  
##  Median :1882   Median :8.350   Median :8.272  
##  Mean   :1882   Mean   :8.332   Mean   :8.363  
##  3rd Qu.:1944   3rd Qu.:8.650   3rd Qu.:8.642  
##  Max.   :2005   Max.   :9.700   Max.   :9.604

Data Visualization

The first plot is going to show the annual line plot, including all years. This is not what we want because the plot is very jittery and may not help with overall visualization of trends.

g_temp <- ggplot(mapping = aes(year, avg_temp)) + 
  geom_line(data = la_data, aes(colour='red')) + 
  geom_line(data = gl_data, aes(colour='blue')) +
  guides(colour = guide_legend(title='Average')) + 
  scale_colour_manual(labels=c('Global', 'Los Angeles'),
                      values=c('blue','red')) + 
  labs(title = 'Average Annual Temperatures',
       x='Year', y='Average Temperature') +
  theme(plot.title = element_text(hjust = 0.5))
g_temp

Picking out any trends outside of the overall trend of our lines is difficult because of the drastic spikes. This is similar to over-fitting - the plot just seems a little too convoluted. Thus we use a moving average to smooth our results.

Next plot is our moving average. The trend and overall comprehension of this plot is much easier. We are following decade long trends where each consecutive year is added to the average, and the first year in the average is omitted. This gives us a ‘slide rule’ perspective of our data.

g <- ggplot(mapping = aes(year, mvg_avg)) + 
  geom_line(data = full_la, aes(colour='red')) + 
  geom_line(data = full_gl, aes(colour='blue')) +
  guides(colour = guide_legend(title='Average')) + 
  scale_colour_manual(labels=c('Global', 'Los Angeles'),
                      values=c('blue','red')) + 
  labs(title = 'Average Temperatures by Decade',
       x='Year', y='Average Temperature') +
  theme(plot.title = element_text(hjust = 0.5))
g

We can see an upward trend over the course of several decades, indicating a warming trend. both globally and locally. Los Angeles also has a much higher temperature than the global average. This makes sense since LA is in Southern California, not too far from the equator.

Furthermore, many of the global trends seem to be much more drastic in LA. Looking at the 1900-1950 warming trend, we can see a major spike in our moving average line for Los Angeles over the same year interval. Similarly for years >1975!

Alarmingly, the global warming trend is increasing almost exponentially from 1950. Additionally, the trend is seen in LA but much more linearly.

Lastly, I wanted to include a ‘zoomed’ plot representing temperature trends during my lifetime. This plot uses the moving average over the 2.5 decades I’ve been here on Earth. Interestingly enough, despite me being here for only a fraction of the overall data set range, there is a noticeable warming trend globally and locally.

g_lifetime <- g <- ggplot(mapping = aes(year, mvg_avg)) + 
  geom_line(data = full_la, aes(colour='red')) + 
  geom_line(data = full_gl, aes(colour='blue')) +
  xlim(1990, max(full_gl$year))+
  guides(colour = guide_legend(title='Average')) + 
  scale_colour_manual(labels=c('Global', 'Los Angeles'),
                      values=c('blue','red')) + 
  labs(title = 'Average Temperatures Relative to My Lifetime',
       x='Year', y='Average Temperature') +
  theme(plot.title = element_text(hjust = 0.5))
g_lifetime

Conclusion

Overall, we can see warming trends in Los Angeles and globally. The periods of locally can also be traced in the global trend, meaning both are fairly consistent. Locally, Los Angeles sees more drastic changes, but this is expected since globally, we are averaging all regions of Earth. I cannot comment directly on global warming since I am not an expert in the area, but very elementary analysis like this really points to a warming trend. It would be interesting to trace \(\text{CO}_2\) levels alongside the temperatures above.