Preparations:

getwd()
## [1] "C:/Users/laure/Desktop"
setwd("C:/Users/laure/Desktop")
getwd()
## [1] "C:/Users/laure/Desktop"
rm(list=ls())                 
Sys.setenv(LANG = "en")     

EXERCISE 1 - European Statistical Office (eurostat)

a) Information about the data set

The European Statistical Office, known as Eurostat is a Directorate-General of the European Commission located in Luxembourg.

Its main responsibilities are to provide statistical information to the institutions of the European Union (EU) and to promote the harmonisation of statistical methods across its member states as well as EFTA countries. Its mission is to provide high quality statistics that enable comparisons between countries and regions.

The Eurostat statistical work is spread into a great variety of subjects such as Economy and Finance, Population and social conditions, Science and Technology and Industry and Agriculture.

Its statistics work and its statistical databases are accessible to the public.

Sources: https://ec.europa.eu/eurostat/home? https://es.wikipedia.org/wiki/Eurostat

b) Install & load the “eurostat” package

#install.packages("eurostat", repos = "https://ec.europa.eu/")
# syntax to install the "eurostat" package
library(eurostat)               
# load the installed package

c) Explanation of the variables

Data set chosen: Life Expectancy by age and sex: “demo_mlexpec”

Life expectancy is a statistical measure of the average time an organism is expected to live, based on the year of its birth, its current age and other demographic factors including gender.

Source of the “demo_mlexpec” data set: https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_mlexpec&lang=en

We base our analysis in comparing the life expectancy for men in different Europe Regions by Generations:

  • Europe Regions: Scandinavia, CentralEU, EasternEU, Mediterranean

  • Scandinavia is composed by Denmark (DK). Finnland (FI), Norway (NO), Sweden (SE)

  • CentralEU is composed by Swtizerland (CH), Germany (DE), Belgium (BE), Netherlands (NL), Luxemburg (LU), France (FR), Austria (AT)

  • EasternEU is composed by Slovenia (SI), Bulgaria (BG), Hungary (HU), Poland (PO), Ukrania (UA), Czech Republic (CZ)

  • Mediterranen is composed by Italy (IT), Spain (ES), Portugal (PT), Greece (GR)

  • Generations: Silent, Boomers, Generation X, Millenials, Generation Z

We choose the year “2017” four our analysis, that means, the life expectancy from that year on.

d) Download the data

Let’s download our data “demo_mlexpec” from the eurostats package and assign it to “LifeExp” which means “Life Expectancy”

LifeExp <- get_eurostat("demo_mlexpec")   
## Table demo_mlexpec cached at C:\Users\laure\AppData\Local\Temp\RtmpwjGuoe/eurostat/demo_mlexpec_date_code_TF.rds

Let’s take a look at our data and open it in a table

LifeExp                                  
## # A tibble: 434,988 x 6
##    unit  sex   age   geo   time       values
##    <fct> <fct> <fct> <fct> <date>      <dbl>
##  1 YR    F     Y1    AL    2017-01-01   79.7
##  2 YR    F     Y1    AM    2017-01-01   78.5
##  3 YR    F     Y1    AT    2017-01-01   83.2
##  4 YR    F     Y1    AZ    2017-01-01   77.7
##  5 YR    F     Y1    BE    2017-01-01   83.2
##  6 YR    F     Y1    BG    2017-01-01   77.9
##  7 YR    F     Y1    BY    2017-01-01   78.6
##  8 YR    F     Y1    CH    2017-01-01   84.9
##  9 YR    F     Y1    CY    2017-01-01   83.3
## 10 YR    F     Y1    CZ    2017-01-01   81.2
## # ... with 434,978 more rows
View(LifeExp)                             

Let’s take a closer look at the variables

summary(LifeExp)                          
##  unit        sex             age              geo              time           
##  YR:434988   F:145168   Y1     :  5058   BE     : 14964   Min.   :1960-01-01  
##              M:144910   Y10    :  5058   BG     : 14964   1st Qu.:1985-01-01  
##              T:144910   Y11    :  5058   CH     : 14964   Median :1999-01-01  
##                         Y12    :  5058   CZ     : 14964   Mean   :1996-01-24  
##                         Y13    :  5058   DE_TOT : 14964   3rd Qu.:2009-01-01  
##                         Y14    :  5058   EE     : 14964   Max.   :2017-01-01  
##                         (Other):404640   (Other):345204                       
##      values     
##  Min.   : 1.50  
##  1st Qu.:17.30  
##  Median :35.60  
##  Mean   :37.32  
##  3rd Qu.:56.10  
##  Max.   :86.80  
##  NA's   :258
str(LifeExp)                             
## Classes 'tbl_df', 'tbl' and 'data.frame':    434988 obs. of  6 variables:
##  $ unit  : Factor w/ 1 level "YR": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex   : Factor w/ 3 levels "F","M","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age   : Factor w/ 86 levels "Y1","Y10","Y11",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ geo   : Factor w/ 55 levels "AL","AM","AT",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ time  : Date, format: "2017-01-01" "2017-01-01" ...
##  $ values: num  79.7 78.5 83.2 77.7 83.2 77.9 78.6 84.9 83.3 81.2 ...
LifeExp$age[1]                        
## [1] Y1
## 86 Levels: Y1 Y10 Y11 Y12 Y13 Y14 Y15 Y16 Y17 Y18 Y19 Y2 Y20 Y21 Y22 ... Y_LT1

Analysis conclusion is that our choosen data set contains the following variables:

  • unit: years
  • sex: gender (3 levels: M = male, F = female, T = Transgender and others)
  • age: years since birth (86 levels: Y_LT1, Y1, Y2, …, Y83, Y84, Y_GE85)
  • where “Y_LT1” means “less than 1 year”
  • where “Y_GE85” means “greater than 85 years”
  • time: date (ranging from 01-01-1960 to 01-01-2017)
  • values: a value represents the life expectancy

We choose this data set as an interesting example, because our R Mini Project consists of people from different european countries, so we want to figure out who has the longest life expectancy.

#Vignette
browseVignettes(package = "eurostat")
## starting httpd help server ... done

e) Subsetting, Statistics, Plotting

Subsetting idea:

  • We want to subset by year 2017, because that’s the most recent data available in this data set
  • We want to subset by the males, because the life expectancy of men differs from the life expectancy of women
  • We want to subset by regions (Scandinavia, EasternEU, Mediterranean, CentralEU), because we want to compare the life expectancy between regions
  • We want to subset by age (GenZ, Millenials, GenX, Boomers, Silent), because we want to compare the life expectancy between generations

Subset by Gender (M) and Date (2017):

LifeExp_M <- subset(LifeExp, sex == "M" & time == "2017-01-01")   
# We assign this subset to "LifeExp_M" which means the life expectancy only for males for the date 01-01-2017
LifeExp_M
## # A tibble: 4,300 x 6
##    unit  sex   age   geo   time       values
##    <fct> <fct> <fct> <fct> <date>      <dbl>
##  1 YR    M     Y1    AL    2017-01-01   76.7
##  2 YR    M     Y1    AM    2017-01-01   72.1
##  3 YR    M     Y1    AT    2017-01-01   78.7
##  4 YR    M     Y1    AZ    2017-01-01   73.1
##  5 YR    M     Y1    BE    2017-01-01   78.5
##  6 YR    M     Y1    BG    2017-01-01   70.9
##  7 YR    M     Y1    BY    2017-01-01   68.5
##  8 YR    M     Y1    CH    2017-01-01   80.9
##  9 YR    M     Y1    CY    2017-01-01   79.3
## 10 YR    M     Y1    CZ    2017-01-01   75.3
## # ... with 4,290 more rows
View(LifeExp_M)

Subset by Generations (age)

Source: https://www.pewresearch.org/topics/generations-and-age/

GenZ <- c("Y_LT1","Y1","Y2","Y3","Y4","Y5","Y6","Y7","Y8","Y9","Y10","Y11","Y12","Y13","Y14","Y15","Y16","Y17","Y18","Y19")       
# Generation Z "GenZ" is defined as people from the age of less than 1yr to 19yrs

Millenials <- c("Y20","Y21","Y22","Y23","Y24","Y25","Y26","Y27","Y28","Y29","Y30","Y31","Y33","Y34","Y35","Y36","Y37","Y38")      
# Generation Y "Millenials" is defined as people from the age of less than 20yrs to 38yrs

GenX <- c("Y39","Y40","Y41","Y42","Y43","Y44","Y45","Y46","Y47","Y48","Y49","Y50","Y51","Y52","Y53","Y54")                        
# Generation X "GenX" is defined as people from the age of less than 1yr to 19yrs

Boomers <- c("Y55","Y56","Y57","Y58","Y59","Y60","Y61","Y62","Y63","Y64","Y65","Y66","Y67","Y68","Y69","Y70","Y71","Y72","Y73")   
# The Baby Boomer Generation "Boomers" is defined as people from the age of less than 1yr to 19yrs

Silent <- c("Y74","Y75","Y76","Y77","Y78","Y79","Y80","Y81","Y82","Y83","Y84","Y_GE85")                                           
# The Silent Generation "Silent" is defined as people from the age of 74yrs to greater than 85yrs

Subset by Region (geo):

#The Life Expectancy for the Scandinavian countries for each Generation
LifeExp_M_Scandinavia_GenZ <- subset(LifeExp_M, geo %in% c("DK","FI","NO","SE") & age %in% GenZ)
LifeExp_M_Scandinavia_Millenials <- subset(LifeExp_M, geo %in% c("DK","FI","NO","SE") & age %in% Millenials)
LifeExp_M_Scandinavia_GenX <- subset(LifeExp_M, geo %in% c("DK","FI","NO","SE") & age %in% GenX)
LifeExp_M_Scandinavia_Boomers <- subset(LifeExp_M, geo %in% c("DK","FI","NO","SE") & age %in% Boomers)
LifeExp_M_Scandinavia_Silent <- subset(LifeExp_M, geo %in% c("DK","FI","NO","SE") & age %in% Silent)

#The Life Expectancy for the Eastern EU countries for each Generation
LifeExp_M_EasternEU_GenZ <- subset(LifeExp_M, geo %in% c("SI","BG","HU","PO","UA","CZ") & age %in% GenZ)
LifeExp_M_EasternEU_Millenials <- subset(LifeExp_M, geo %in% c("SI","BG","HU","PO","UA","CZ") & age %in% Millenials)
LifeExp_M_EasternEU_GenX <- subset(LifeExp_M, geo %in% c("SI","BG","HU","PO","UA","CZ") & age %in% GenX)
LifeExp_M_EasternEU_Boomers <- subset(LifeExp_M, geo %in% c("SI","BG","HU","PO","UA","CZ") & age %in% Boomers)
LifeExp_M_EasternEU_Silent <- subset(LifeExp_M, geo %in% c("SI","BG","HU","PO","UA","CZ") & age %in% Silent)

#The Life Expectancy for the Mediterranean countries for each Generation
LifeExp_M_Mediterranean_GenZ <- subset(LifeExp_M, geo %in% c("IT","ES","PT","GR") & age %in% GenZ)
LifeExp_M_Mediterranean_Millenials <- subset(LifeExp_M, geo %in% c("IT","ES","PT","GR") & age %in% Millenials)
LifeExp_M_Mediterranean_GenX <- subset(LifeExp_M, geo %in% c("IT","ES","PT","GR") & age %in% GenX)
LifeExp_M_Mediterranean_Boomers <- subset(LifeExp_M, geo %in% c("IT","ES","PT","GR") & age %in% Boomers)
LifeExp_M_Mediterranean_Silent <- subset(LifeExp_M, geo %in% c("IT","ES","PT","GR") & age %in% Silent)

#The Life Expectancy for the CentralEU countries for each Generation
LifeExp_M_CentralEU_GenZ <- subset(LifeExp_M, geo %in% c("CH","DE","BE","NL","LU","FR","AT") & age %in% GenZ)
LifeExp_M_CentralEU_Millenials <- subset(LifeExp_M, geo %in% c("CH","DE","BE","NL","LU","FR","AT") & age %in% Millenials)
LifeExp_M_CentralEU_GenX <- subset(LifeExp_M, geo %in% c("CH","DE","BE","NL","LU","FR","AT") & age %in% GenX)
LifeExp_M_CentralEU_Boomers <- subset(LifeExp_M, geo %in% c("CH","DE","BE","NL","LU","FR","AT") & age %in% Boomers)
LifeExp_M_CentralEU_Silent <- subset(LifeExp_M, geo %in% c("CH","DE","BE","NL","LU","FR","AT") & age %in% Silent)

Statistics idea: * We want to compare the average life expectancy for each generation against each region for the date 2017 * We want to compare the standard deviation for generation Z against each region for the date 2017 * We want to compare the maximum life expectancy for “Y_LT1” between men and women for the date 2017 * We want to compare the minimum life expectancy for “Y_LT1” between men and women for the date 2017 * We want to compute the quantiles as an additional descriptive statistic for the date 2017

Mean:

mean_Scandinavia_GenZ <- mean(LifeExp_M_Scandinavia_GenZ$values)
mean_EasternEU_GenZ <- mean(LifeExp_M_EasternEU_GenZ$values)                
mean_Mediterranean_GenZ <- mean(LifeExp_M_Mediterranean_GenZ$values)
mean_CentralEU_GenZ <- mean(LifeExp_M_CentralEU_GenZ$values)

mean_Scandinavia_Millenials <- mean(LifeExp_M_Scandinavia_Millenials$values)
mean_EasternEU_Millenials <- mean(LifeExp_M_EasternEU_Millenials$values)
mean_Mediterranean_Millenials <- mean(LifeExp_M_Mediterranean_Millenials$values)
mean_CentralEU_Millenials <- mean(LifeExp_M_CentralEU_Millenials$values)

mean_Scandinavia_GenX <- mean(LifeExp_M_Scandinavia_GenX$values)
mean_EasternEU_GenX <- mean(LifeExp_M_EasternEU_GenX$values)
mean_Mediterranean_GenX <- mean(LifeExp_M_Mediterranean_GenX$values)
mean_CentralEU_GenX <- mean(LifeExp_M_CentralEU_GenX$values)

mean_Scandinavia_Boomers <- mean(LifeExp_M_Scandinavia_Boomers$values)
mean_EasternEU_Boomers <- mean(LifeExp_M_EasternEU_Boomers$values)
mean_Mediterranean_Boomers <- mean(LifeExp_M_Mediterranean_Boomers$values)
mean_CentralEU_Boomers <- mean(LifeExp_M_CentralEU_Boomers$values)

mean_Scandinavia_Silent <- mean(LifeExp_M_Scandinavia_Silent$values)
mean_EasternEU_Silent <- mean(LifeExp_M_EasternEU_Silent$values)
mean_Mediterranean_Silent <- mean(LifeExp_M_Mediterranean_Silent$values)
mean_CentralEU_Silent <- mean(LifeExp_M_CentralEU_Silent$values)

Standard Deviation:

stDev_GenZ <- NULL    # Create an empty vector in order to assign the standard deviation of each region to an element
stDev_GenZ[1] <- sd(LifeExp_M_Scandinavia_GenZ$values)
stDev_GenZ[2] <- sd(LifeExp_M_EasternEU_GenZ$values)
stDev_GenZ[3] <- sd(LifeExp_M_Mediterranean_GenZ$values)
stDev_GenZ[4] <- sd(LifeExp_M_CentralEU_GenZ$values)
header <- c("Scandinavia", "EasternEU", "Mediterranean", "CentralEU")
names(stDev_GenZ) <- header
stDev_GenZ            
##   Scandinavia     EasternEU Mediterranean     CentralEU 
##      5.821170      6.612482      5.851897      5.779247

Conclusions:

  • We see that in EasternEU the standard deviation is the highest, whereas in all other regions the standard deviation is similar.

  • We hypothesize that in EasternEU the poor people have a relatively low life expectancy, but the rich people have about the same life expectancy as in the other regions.

Quantiles:

quantile(LifeExp_M$values)
##   0%  25%  50%  75% 100% 
##  4.0 17.8 35.9 56.5 81.6
### Min & Max
#Subset men (M), time 2017, age Y_LT1
LifeExp_M_YLT1 <- subset(LifeExp, sex == "M" & time == "2017-01-01" & age == "Y_LT1")
LifeExp_M_YLT1
## # A tibble: 50 x 6
##    unit  sex   age   geo   time       values
##    <fct> <fct> <fct> <fct> <date>      <dbl>
##  1 YR    M     Y_LT1 AL    2017-01-01   77.1
##  2 YR    M     Y_LT1 AM    2017-01-01   72.5
##  3 YR    M     Y_LT1 AT    2017-01-01   79.4
##  4 YR    M     Y_LT1 AZ    2017-01-01   73.2
##  5 YR    M     Y_LT1 BE    2017-01-01   79.2
##  6 YR    M     Y_LT1 BG    2017-01-01   71.4
##  7 YR    M     Y_LT1 BY    2017-01-01   69.3
##  8 YR    M     Y_LT1 CH    2017-01-01   81.6
##  9 YR    M     Y_LT1 CY    2017-01-01   80.2
## 10 YR    M     Y_LT1 CZ    2017-01-01   76.1
## # ... with 40 more rows
View(LifeExp_M_YLT1)

#Subset women (F), time 2017, age Y_LT1
LifeExp_F_YLT1 <- subset(LifeExp, sex == "F" & time == "2017-01-01" & age == "Y_LT1")
LifeExp_F_YLT1
## # A tibble: 50 x 6
##    unit  sex   age   geo   time       values
##    <fct> <fct> <fct> <fct> <date>      <dbl>
##  1 YR    F     Y_LT1 AL    2017-01-01   80.1
##  2 YR    F     Y_LT1 AM    2017-01-01   78.9
##  3 YR    F     Y_LT1 AT    2017-01-01   84  
##  4 YR    F     Y_LT1 AZ    2017-01-01   77.9
##  5 YR    F     Y_LT1 BE    2017-01-01   83.9
##  6 YR    F     Y_LT1 BG    2017-01-01   78.4
##  7 YR    F     Y_LT1 BY    2017-01-01   79.3
##  8 YR    F     Y_LT1 CH    2017-01-01   85.6
##  9 YR    F     Y_LT1 CY    2017-01-01   84.2
## 10 YR    F     Y_LT1 CZ    2017-01-01   82  
## # ... with 40 more rows
View(LifeExp_F_YLT1)

#Min-Max Life Expectancy (M,2017,Y_LT1)
max_LifeExp_M_YLT1 <- LifeExp_M_YLT1[which.max(LifeExp_M_YLT1$values),c(4,6)]    
max_LifeExp_M_YLT1                                                              # Men (M) have the highest life expectancy in Switzerland of 81.6yrs for "less than 1 year (Y_LT1)", year 2017
## # A tibble: 1 x 2
##   geo   values
##   <fct>  <dbl>
## 1 CH      81.6
min_LifeExp_M_YLT1 <- LifeExp_M_YLT1[which.min(LifeExp_M_YLT1$values),c(4,6)]    
min_LifeExp_M_YLT1                                                              # Men (M) have the lowest life expectancy in Ukraina of 68.3yrs for "less than 1 year (Y_LT1)", year 2017
## # A tibble: 1 x 2
##   geo   values
##   <fct>  <dbl>
## 1 UA      68.3
#Min-Max Life Expectancy (F,2017,Y_LT1)
max_LifeExp_F_YLT1 <- LifeExp_F_YLT1[which.max(LifeExp_F_YLT1$values),c(4,6)]   
max_LifeExp_F_YLT1                                                              # Women (F) have the highest life expectancy in Switzerland of 86.1yrs for "less than 1 year (Y_LT1)", year 2017
## # A tibble: 1 x 2
##   geo   values
##   <fct>  <dbl>
## 1 ES      86.1
min_LifeExp_F_YLT1 <- LifeExp_F_YLT1[which.min(LifeExp_F_YLT1$values),c(4,6)]   
min_LifeExp_F_YLT1                                                              # Women (F) have the lowest life expectancy in Ukraina of 77.8yrs for "less than 1 year (Y_LT1)", year 2017
## # A tibble: 1 x 2
##   geo   values
##   <fct>  <dbl>
## 1 GE      77.8
#How much longer live women vs. men when considering the maximum life expectancy (2017,Y_LT1)?
max_LifeExp_YLT1_Difference <- (max_LifeExp_F_YLT1[,2] - max_LifeExp_M_YLT1[,2])
max_LifeExp_YLT1_Difference                                                     # Women have a 4.5 years longer maximum life expectancy than men
##   values
## 1    4.5
#How much longer live women vs. men when considering the minimum life expectancy (2017,Y_LT1)?
min_LifeExp_YLT1_Difference <- (min_LifeExp_F_YLT1[,2] - min_LifeExp_M_YLT1[,2])
min_LifeExp_YLT1_Difference                                                     # Women have a 9.5 years longer minimum life expectancy than men
##   values
## 1    9.5

Plotting idea:

  • We want to create a grouped bar plot, where each group of bars displays the life expectancies of generations by region

Instructions:

  • The package “plotly” is able to “Create interactive web graphics from ‘ggplot2’ graphs and/or a custom interface to the (MIT-licensed) JavaScript library ‘plotly.js’ inspired by the grammar of graphics.”

  • In our plot you can double click on the legend to isolate one trace (useful to compare between countries within a certain generation).

  • In our plot you can hover over the bar to display the life expectancy Source: https://www.rdocumentation.org/packages/plotly/versions/4.9.1

Install package “plotly” & enable it in the library:

#install.packages("plotly", repos = "http://cran.rstudio.com/")
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Plot (M,2017,Y_LT1)

x <- c("Scandinavia","EasternEU","Mediterranean","CentralEU")
y_GenZ <- round(c(mean_Scandinavia_GenZ,mean_EasternEU_GenZ,mean_Mediterranean_GenZ,mean_CentralEU_GenZ), digits = 1)
y_Millenials <- round(c(mean_Scandinavia_Millenials,mean_EasternEU_Millenials,mean_Mediterranean_Millenials,mean_CentralEU_Millenials), digits = 1)
y_GenX <- round(c(mean_Scandinavia_GenX,mean_EasternEU_GenX,mean_Mediterranean_GenX,mean_CentralEU_GenX), digits = 1)
y_Boomers <- round(c(mean_Scandinavia_Boomers,mean_EasternEU_Boomers,mean_Mediterranean_Boomers,mean_CentralEU_Boomers), digits = 1)
y_Silent <- round(c(mean_Scandinavia_Silent,mean_EasternEU_Silent,mean_Mediterranean_Silent,mean_CentralEU_Silent), digits = 1)

data <- data.frame(x, y_GenZ, y_Millenials, y_GenX, y_Boomers)
data %>% 
  plot_ly() %>%
  add_trace(x = ~x, y = ~y_Silent, name = 'Silent', type = 'bar', 
            text = y_Silent, textposition = 'auto',
            marker = list(color = 'rgb(204,229,255)',
                          line = list(color = 'rgb(0,51,102)', width = 1.5))) %>%
  add_trace(x = ~x, y = ~y_Boomers, name = 'Boomers', type = 'bar', 
            text = y_Boomers, textposition = 'auto',
            marker = list(color = 'rgb(153,204,255)',
                          line = list(color = 'rgb(0,51,102)', width = 1.5))) %>%
  add_trace(x = ~x, y = ~y_GenX, name = 'GenX', type = 'bar', 
            text = y_GenX, textposition = 'auto',
            marker = list(color = 'rgb(102,178,255)',
                          line = list(color = 'rgb(0,51,102)', width = 1.5))) %>%
  add_trace(x = ~x, y = ~y_Millenials, name = 'Millenials', type = 'bar', 
            text = y_Millenials, textposition = 'auto',
            marker = list(color = 'rgb(51,153,255)',
                          line = list(color = 'rgb(0,51,102)', width = 1.5))) %>%
  add_trace(x = ~x, y = ~y_GenZ, name = 'GenZ', type = 'bar', 
            text = y_GenZ, textposition = 'auto',
            marker = list(color = 'rgb(0,102,204)',
                          line = list(color = 'rgb(0,51,102)', width = 1.5))) %>%
  layout(title = "Comparison of average Life Expectancy per Generation by Region",
         barmode = 'group',
         xaxis = list(title = "Region"),
         yaxis = list(title = "Mean of Life Expectancy per Generation"))

f) Native data format

  • The data format of our source as shown on the website online [ https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_mlexpec&lang=en) ] is presented with each different data variable in a separate column. Time for each line grouped by Country, therefore we conclude that its format is wide.

  • Whereas the data format of our source when downloaded as shown [ view(LifeExp) ] is long

View(LifeExp)

g) Reshape data format from long to wide

# install the package "tidyverse"
#install.packages("tidyverse", repos = "https://tidyverse.tidyverse.org/")  
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.3
## v tidyr   1.0.0     v dplyr   0.8.3
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()
# Long
LifeExp_M_CentralEU_Boomers
## # A tibble: 133 x 6
##    unit  sex   age   geo   time       values
##    <fct> <fct> <fct> <fct> <date>      <dbl>
##  1 YR    M     Y55   AT    2017-01-01   26.7
##  2 YR    M     Y55   BE    2017-01-01   26.6
##  3 YR    M     Y55   CH    2017-01-01   28.5
##  4 YR    M     Y55   DE    2017-01-01   26  
##  5 YR    M     Y55   FR    2017-01-01   27.5
##  6 YR    M     Y55   LU    2017-01-01   27.1
##  7 YR    M     Y55   NL    2017-01-01   27.2
##  8 YR    M     Y56   AT    2017-01-01   25.8
##  9 YR    M     Y56   BE    2017-01-01   25.8
## 10 YR    M     Y56   CH    2017-01-01   27.6
## # ... with 123 more rows
View(LifeExp_M_CentralEU_Boomers)

# Wide
LifeExp_M_CentralEU_Boomers_wide <- spread(data = LifeExp_M_CentralEU_Boomers, key = geo, value = values)
View(LifeExp_M_CentralEU_Boomers_wide)

EXERCISE 2 - Deutscher Wetterdienst (rdwd)

a) Finding a new data set

Deutscher Wetterdienst (called “rdwd” R package)

?rdwd
## No documentation for 'rdwd' in specified packages and libraries:
## you could try '??rdwd'
browseVignettes(package = "rdwd")

b) Information about the data set

  • The Deutscher Wetterdienst (The German Meteorological Service, DWD), founded in 1952, is a public institution with partial legal capacity under the Federal Ministry of Transport and Digital Infrastructure. and is responsible for meeting meteorological requirements arising from all areas of economy and society in Germany.
  • One of the DWD’s main tasks is to ensure constant improvement of its meteorological prediction models by recording, analysing and monitoring the physical and chemical processes in the atmosphere DWD provides thousands of datasets with weather observations.
  • The R package rdwd contains code to select, download and read weather data from measuring stations across Germany.

Sources: https://www.dwd.de/SharedDocs/broschueren/EN/press/kurzportraet_en.pdf?__blob=publicationFile&v=9 https://www.dwd.de/EN/aboutus/aboutus_node.html #??rdwd

c) Install & load the “rdwd” package

# install the package "rdwd"
#install.packages("rdwd", repos = "https://bookdown.org/brry/rdwd")                                
library(rdwd)                                           

# install the package "lubridate"
#install.packages("lubridate", repos = "https://cran.r-project.org/")                         
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
# install the package "ggplot2"
#install.packages("ggplot2", repos = "https://cran.r-project.org/")                           
library(ggplot2)

d) Choose an interesting example. Identify the variables you are interested in and the respective codes. Obtain basic information about the variables. Write 1-2 paragraphs and specify the source (URL) of your information.

  • rwdw is a R data package containing data from the german weather service. In this package you can find data from 6 thousant meteorological recording stations (where 2500 are still active).
  • The package contain rain, temperature, wind, sunshine, pressure, cloudiness, humidity, snow and other informations about the weather in diferent periods, but its data is restricted to the german territory. https://cran.r-project.org/web/packages/rdwd/vignettes/rdwd.html

e) Download the data

  • The mountain peak “Zugspitze” has 2962m in altitude and is therefore in our opinion an “interesting example”
findID("Zugspitze")     
## Zugspitze 
##      5792
findID("Regensburg")
## Regensburg 
##       4104

selectDWD: Select files for downloading * name: Choose the name of the weather station * res: Choose resolution, e.g. “hourly”,“daily”, “monthly” * var: Choose variable of interest, e.g. “air_temperature”, “cloudiness” * per: Choose desired time priod

  • “recent”: data from the last year, up to date usually within a few days
  • “historical”: long time series

Zugspitze

link1 <- selectDWD(name="Zugspitze", res="daily", var="kl", per="recent")     


file1 <- dataDWD(link1, read=FALSE)                                           
## rmarkdown::render -> knitr::knit -> call_block -> block_exec -> in_dir -> evaluate -> evaluate::evaluate -> evaluate_call -> timing_fn -> handle -> dataDWD -> dirDWD: adding to directory 'C:/Users/laure/Desktop/DWDdata'
## rmarkdown::render -> knitr::knit -> call_block -> block_exec -> in_dir -> evaluate -> evaluate::evaluate -> evaluate_call -> timing_fn -> handle -> dataDWD: 1 file already existing and not downloaded again:  'daily_kl_recent_tageswerte_KL_05792_akt.zip'
## Now downloading 0 files...
# dataDWD: Get climate data from the German Weather Service (DWD) FTP-server. 
# The desired .zip (or .txt) dataset is downloaded
# Complete file URL(s) (including base and filename.zip) as returned by 'selectDWD'

Zugspitze <- readDWD(file1, varnames=TRUE)                                    
# Read climate data that was downloaded with dataDWD
# The data is unzipped and subsequently, the file is read, processed and returned as a data.frame
# varnames: TRUE to obtein more informative column names

Regensburg

link2 <- selectDWD(name="Regensburg", res="daily", var="kl", per="recent")    
# selectDWD: Select files for downloading
# name: Choose the name of the weather station
# res: Choose resolution, e.g. "hourly","daily", "monthly"
# var: Choose variable of interest, e.g. "air_temperature", "cloudiness"
# per: Choose desired time priod
# "recent": data from the last year, up to date usually within a few days 
   
# "historical": long time series
file2 <- dataDWD(link2, read=FALSE)                                                     
## rmarkdown::render -> knitr::knit -> call_block -> block_exec -> in_dir -> evaluate -> evaluate::evaluate -> evaluate_call -> timing_fn -> handle -> dataDWD -> dirDWD: adding to directory 'C:/Users/laure/Desktop/DWDdata'
## rmarkdown::render -> knitr::knit -> call_block -> block_exec -> in_dir -> evaluate -> evaluate::evaluate -> evaluate_call -> timing_fn -> handle -> dataDWD: 1 file already existing and not downloaded again:  'daily_kl_recent_tageswerte_KL_04104_akt.zip'
## Now downloading 0 files...
# dataDWD: Get climate data from the German Weather Service (DWD) FTP-server. 
                                                                              
# The desired .zip (or .txt) dataset is downloaded
                                                                              
# Complete file URL(s) (including base and filename.zip) as returned by 'selectDWD'
Regensburg <- readDWD(file2, varnames=TRUE)                                   
# Read climate data that was downloaded with dataDWD
# The data is unzipped and subsequently, the file is read, processed and returned as a data.frame
# varnames: TRUE to obtein more informative column names

f) Subsetting, Statistics, Plotting

#Subset of Zugspitze
tempZugspitze <- Zugspitze[,c(2,14)]

# Subset of Regensburg
tempRegensburg <- Regensburg[,c(2,14)]

Statistics of Zugspitze

length(Zugspitze$MESS_DATUM)                          # number of data points
## [1] 550
mean(Zugspitze$TMK.Lufttemperatur)                    # mean of air temperature
## [1] -1.525636
max(Zugspitze$TMK.Lufttemperatur)                     # max air temperature
## [1] 13.6
min(Zugspitze$TMK.Lufttemperatur)                     # min air temperature
## [1] -21
mean(Zugspitze$SHK_TAG.Schneehoehe, na.rm=TRUE)       # mean of snow fall height
## [1] 191.0712

Statistics of Regensburg

length(Regensburg$MESS_DATUM)                         # number of data points
## [1] 550
mean(Regensburg$TMK.Lufttemperatur)                   # mean of sunshine duration
## [1] 12.59127
max(Regensburg$TMK.Lufttemperatur)                    # max duration of sunshine
## [1] 27.3
min(Regensburg$TMK.Lufttemperatur)                    # min duration of sunshine
## [1] -5.4
mean(Regensburg$SHK_TAG.Schneehoehe, na.rm=TRUE)      # mean of snow fall height
## [1] 0.7896825

Comparison of the Statistics: Zugspitze vs. Regensburg

Air temperature:

  • Zugspitze: -1.497455 (mean), 13.6 (max), -21 (min)

  • Regensburg: 12.61327 (mean), 27.3 (max), -5.4 (min)

Snowfall:

  • Zugspitze: 192 (mean)

  • Regensburg: 0.79 (mean)

# Plot of Zugspitze & Regensburg
par(mar=c(4,4,2,0.5), mgp=c(2.7, 0.8, 0), cex=0.8)
plot(Zugspitze[,c(2,14)], type="l", ylim=c(-20,30), col="blue", xaxt="n", las=1, main="Daily temp Regensburg")
lines(Regensburg[,c(2,14)], type="l", col="red", xaxt="n", las=1, main="Daily temp Zugspitze vs. Regensburg")
berryFunctions::monthAxis()   ;   abline(h=0)
legend("top", c("Zugspitze","Regensburg"),cex=.8,col=c("red","blue"),lty =c(1,1))

g) Native data format

View(Zugspitze)
# The data format is wide, because a subject's repeated responses is in a single row, and each response is in a separate column.