Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

0.Introduction

This is the report of the first project of Data Analyst Nanodegree of Udacity.

The project is done in R, and it is tried to respond clearly to the assessment requirements. Beyond the basics, I have devised an applet in order to take advantage of interactive visualization using World Temprature Data. The applect can be found on my shiny library, here

#require(smooth) # for calculation of the simple moving average sma()
require(dplyr)
require(ggplot2)
require(TTR)
require(zoo) # for imputation

1. Data Extraction and Preparation

What is the goal of the project?

to create a visualization and prepare a write up describing the similarities and differences between global temperature trends and temperature trends in the closest big city to where you live.

I live in Barcelona, which was a big city the last time I measured it!

The SQL codes in order to extact the Barcelona data :

SELECT * FROM city_data WHERE country = ‘Spain’ AND city = ‘Barcelona’ ;

The SQL script in order to extract the global data:

SELECT * FROM global_data ;

Let’s have a look at the Global Temprature data, as well as Barcelona’s

global <- read.csv(file = "World.csv")
barcelona <- read.csv(file = "Barcelona.csv")
#global 
global_ts<- ts(global)
summary(global)
##       year         avg_temp    
##  Min.   :1750   Min.   :5.780  
##  1st Qu.:1816   1st Qu.:8.082  
##  Median :1882   Median :8.375  
##  Mean   :1882   Mean   :8.369  
##  3rd Qu.:1949   3rd Qu.:8.707  
##  Max.   :2015   Max.   :9.830
range(global$year)
## [1] 1750 2015
#local (Barcelona)
barcelona_ts <- ts(barcelona)
summary(barcelona)
##       year             city      country       avg_temp    
##  Min.   :1743   Barcelona:271   Spain:271   Min.   :10.78  
##  1st Qu.:1810                               1st Qu.:15.83  
##  Median :1878                               Median :16.09  
##  Mean   :1878                               Mean   :16.12  
##  3rd Qu.:1946                               3rd Qu.:16.47  
##  Max.   :2013                               Max.   :17.90  
##                                             NA's   :4
range(barcelona$year)
## [1] 1743 2013

As it can be seen, there are 4 NA or missing values in the Barcelona data. Also the ranges of the the two time series are slightly different.

Having missing values in the data, it is not easy to use moving average. We have to deal with the missing values, either cutting out the part ended to missing values, or change the number of periods to average over, or imputing the missing values. Having 4 missing values over 271 observations is not something to be worry about, so I go for imputation. How to impute? Replacing NAs by linear interpolation.

barcelona$avg_temp <- na.approx(object = barcelona$avg_temp)
barcelona$moving_avg <- SMA(x = barcelona$avg_temp , n = 10 ) 

global$moving_avg <- SMA(x = global$avg_temp , n = 10 )

It is possible to write a simple moving average function and use it instead of SMA().

ma <- function(vector,n){
        if (n > length(vector)){
                print("n cannot be larget than the vector's length")
                break 
        }
        if (sum(is.na(vector))!=0) {
                print("Error: The vector includes NA values.")
                break
        }
        
        t <- sapply(X = n:length(vector),function(X) mean(vector[(X-n+1):X]) )
        return(t)
}


a<- ma(barcelona$avg_temp,10)
b<- na.exclude(SMA(x = barcelona$avg_temp , n = 10 ))
table(round(a,3) == round(b,3))
## 
## TRUE 
##  262

So there is no difference between the manually crafted function of mw() and TTA::SMA().

2. Visual Exploration

The First graph is Barcelona’s average annual temperature, in figure1. The origianl values of the tempratures are plotted in pale blue, and the moving average smoothed values are depicted in

 ggplot(data = barcelona) + 
        geom_line(aes(x = year , y = avg_temp ), color = "blue", alpha = 0.3) + 
        theme_linedraw() + 
        geom_line(aes(x = year , y = moving_avg), color = "blue") +
        xlab("Year") + 
        ylab("Average Temprature C") + 
        ggtitle("The Local(Barcelona) Average Temprature Trend")
Figure1:The Local(Barcelona) Average Temprature Trend

Figure1:The Local(Barcelona) Average Temprature Trend

Intersting points are seen. There are two sudden fall in the moving average, one around 1820, and another around 1890. Also the temprature has clearly increased in the 21st century. Recent years seem more constant with zero slope, but the slope of 1970-2000 is steep.

Let’s do the same with global data. The result is shown in figure2.

 ggplot(data = global) + 
        geom_line(aes(x = year , y = avg_temp ), color = "red", alpha = 0.3) + 
        theme_linedraw() + 
        geom_line(aes(x = year , y = moving_avg), color = "red") + 
        xlab("Year") + 
        ylab("Average Temprature C") + 
        ggtitle("The Global Average Temprature Trend")
Figure2:The Global Average Temprature Trend

Figure2:The Global Average Temprature Trend

Clearly increasing in the 20th sentury and later. Again very cold years seem to happen around 1920.

In order to nicely plot both timeseries together on one plot, I rather bind the two datasets.

colnames(barcelona)[which(colnames(barcelona)=="city")] <- "Location"
global$Location <- "global"
barca_glob_df<-rbind(barcelona[c("year","avg_temp","moving_avg","Location")], global)

Now we can plot them nicely together

ggplot(data = barca_glob_df) +
        geom_line(aes(x = year , y = avg_temp , color = Location),
                  alpha = 0.3) +
        geom_line(aes(x = year , y = moving_avg, color = Location)) + 
        theme_linedraw() + 
        xlab(label = "Year") +
        ylab(label = "Temp C") + 
        ggtitle("Comparision of The Global and a Local\nof Average Temprature Trends ")
Figure3:The Comparison plot of Average Temprature Trends

Figure3:The Comparison plot of Average Temprature Trends

3. Quesion/Hypthesis generation out of visual exploration

What are the interesting points of this figure3 in my point of view?

  1. The variation of the smoothed average annual temprature of Barcelona is considerably higher than the global’s variation. In other words, the smoothed line of Barcelona has more fluctuations comparing to the golbe’s line.

  2. There are two sudden drops of temprature in Barcelona SMA, one around 1820, and one around 1890. In the global data, we clearly have 1820 drop of temprature. What was this 1820? An outlier?

  3. About the trends, I rather divide the time-series into two parts: before 1900, and after 1900. Seemingly, before 1900 we do not have considerable change in the temprature on average. However, from 1900 to the second decade of 21st century, we witness an increasing trend. Specially after 1975 this increase is accelerated.

  4. Barcelona’s temprature is clearly higher than the globe’s average. Are the patterns and trends of colder than average locations similar to globe’s or Barcelona’s? In other words, is there any difference between behaviours of colder than average places and hotter than average places?

  5. The general variation of the time-series reduces over time in the globe’s time-serie. So the moving average line fluctuates more smoothely in recent decades than in early periods of the data. Does it mean anything?

Let’s focus on the data since 1975.

barca_glob_df %>% filter(year>=1975) %>% 
ggplot() +
        geom_line(aes(x = year , y = avg_temp , color = Location),
                  alpha = 0.3) +
        geom_line(aes(x = year , y = moving_avg, color = Location)) + 
        theme_linedraw() + 
        xlab(label = "Year") +
        ylab(label = "Temp C") + 
        ggtitle("Average Temprature Since 1975")
Figure4:Average Temprature Since 1975

Figure4:Average Temprature Since 1975

Figure4 shows more than 1 degree of celsius increase in the Barcelona’s moving average temperature in almost 40 years. For the same period of time, the increase of global’s average temperature is less about 1 degree, and clearly less than Barcelona’s. Nevertheless, the slope of the lines in the last 40 years are different. Barcelona’s average temprature seems constant with zero slope, while the global’s average temprature gradually increasing with a very mild slope.

4. cross-correlation

I am no expert in time-series analysis, and I guess this report doesn’t expect me to be one, but as far as I searched, there is a cross-correlation concept which is similar to correlation of two numerical vectors, but it is about to moving one of the time-series along the time axis in order to find the maximum correlation coefficient. Let’s try it here.

ccf(x = barcelona$avg_temp , global$avg_temp)

The maximum cross-correlation is less than 0.4, which means a weak correlation. However, it is interesting that this maximum correlation happens at +1 and +7 periods, which means a high temperature in the world annual average in period t, leads to a high temperature in Barcelona’s annual average. But these two are weakly correlated.

But I am curious about the cross-correlation in the recent 40-50 years rather than from 1800. I guess the maximum coefficient is higher.

recent_barcelona <- barcelona %>% filter(year>1975)
recent_globe <- global %>% filter(year > 1975)
ccf(recent_barcelona$avg_temp , recent_globe$avg_temp )

As I expected, the maximum correlation has gone higher. The maximum correlation is now close to 0.6, a moderate amount.

Let’s check the values

(ccf(recent_barcelona$avg_temp , recent_globe$avg_temp, plot = FALSE ))
## 
## Autocorrelations of series 'X', by lag
## 
##    -12    -11    -10     -9     -8     -7     -6     -5     -4     -3 
##  0.191  0.137  0.231  0.374  0.453  0.410  0.434  0.414  0.466  0.361 
##     -2     -1      0      1      2      3      4      5      6      7 
##  0.489  0.643  0.630  0.543  0.498  0.261  0.341  0.197  0.251  0.206 
##      8      9     10     11     12 
##  0.248  0.139 -0.039 -0.046 -0.141

As it is seen, at t=-1 and t=0 we have the apex of correlation coefficients. t=-1 is 0.643 , which means that Barcelona’s average annual temprature in \(year_t\) is correlated positively and moderately to the World’s average annual temprature at \(year_t+1\). So using barcelona’s, we can predict the world’s to some extent.

  1. Interactive Visualization

Finally, I built an applet for interactive visualization of the whole dataset. It is possible to add several cities and World’s average, and compare them. It has some other features, specially about changing the moving average period in order to see the effect on the smoothed line.

The applet can be found at my shinyapps library, here and the source code can be found in my github