Welcome to the second script of Module 1!
At the example of phenological observations, you will here learn and further practice how to:

  • import and edit data
  • plot data in diagrams
  • plot data in maps

Please read the following code instructions carefully, try to understand the code that follows each instruction, execute it and see what happens. Do not hesitate to change the code or write your own code, and execute it as well!


1. Import and edit data

We start off again with defining our working directory. This time, we load all packages at the beginning of our script:

dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Module_1_Data_visualization/"

library(reshape2)
library(dplyr)
library(ggplot2)
library(viridis)
library(pals)
library(rgdal)
library(tigris)

We will work with the phenological observations of potato. The original dates when a phenological phase was observed were previously converted to numbers from 1 to 366, referring to the day of the year:

data <- read.csv(paste0(dir, "M01_Part02_Phenological-Observations/data/phenology_by_crop/potato.csv"), sep=";")
head(data)
year station crop variety X01_Seeding X02_Sprouts X03_Lateral_shoots_formation X04_Inflorescences_formation X05_Blossoming X06_Cease_blossoming X07_Withering_of_aboveground_part X08_Harvest
1995 Odzun potato Lorkh 105 134 146 167 211 216 243 250
1996 Odzun potato Lorkh 133 152 162 172 180 202 244 260
1997 Odzun potato Lorkh 114 138 148 169 177 201 222 260
1998 Odzun potato Lorkh 97 136 146 161 169 185 212 251
1999 Odzun potato Lorkh 105 140 146 167 179 201 243 250
2000 Odzun potato Lorkh 105 141 149 168 180 192 233 248

Let’s have a look at the different phenological phases, years, stations and varieties contained in our dataset:

names(data)[5:12]             ## There are observations of 8 different phenological phases...
## [1] "X01_Seeding"                       "X02_Sprouts"                      
## [3] "X03_Lateral_shoots_formation"      "X04_Inflorescences_formation"     
## [5] "X05_Blossoming"                    "X06_Cease_blossoming"             
## [7] "X07_Withering_of_aboveground_part" "X08_Harvest"
sort(unique(data$year))       ## ...from the years 1995 to 2020
##  [1] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [16] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
unique(data$station)    
##  [1] "Odzun"      "Stepanavan" "Ijevan"     "Shorza"     "Martuni"   
##  [6] "Masrik"     "Goris"      "Tashir"     "Amasia"     "Vanadzor"  
## [11] "Sevan"      "Gavar"
length(unique(data$station))  ## ...from 12 different agrometeorological stations
## [1] 12
unique(data$variety)          ## ...of 12 different varieties of potato.
##  [1] "Lorkh"    "Impala"   "Victoria" "Ogonyok"  "Monalisa" "Pamir"   
##  [7] "Cosmos"   "Holland"  "Marfona"  "Diamant"  "Diana"    "German "

There are many layers in our data, and therefore many possible ways in which to visualize it. But before we visualize it, let’s clean and transform the data:

Our dataset contains NA values. We identify all rows that do not contain any NA values with the function complete.cases(), and only keep these rows:

data <- data[complete.cases(data),]

We transform our dataset with melt() to shorten the amount of columns:

data_melted <- melt(data, id.vars = c("year", "station", "crop", "variety"))
names(data_melted)[5:6] <- c("phase", "DOY")
head(data_melted)
year station crop variety phase DOY
1995 Odzun potato Lorkh X01_Seeding 105
1996 Odzun potato Lorkh X01_Seeding 133
1997 Odzun potato Lorkh X01_Seeding 114
1998 Odzun potato Lorkh X01_Seeding 97
1999 Odzun potato Lorkh X01_Seeding 105
2000 Odzun potato Lorkh X01_Seeding 105

To remove the leading “X” in the column “phase”, we can do the following:

data_melted$phase <- substring(data_melted$phase, 2)
head(data_melted)
year station crop variety phase DOY
1995 Odzun potato Lorkh 01_Seeding 105
1996 Odzun potato Lorkh 01_Seeding 133
1997 Odzun potato Lorkh 01_Seeding 114
1998 Odzun potato Lorkh 01_Seeding 97
1999 Odzun potato Lorkh 01_Seeding 105
2000 Odzun potato Lorkh 01_Seeding 105

To be able to group our data by “year” in the following plots, we have to transform the column “year” to a factor variable:

data_melted$year <- as.factor(data_melted$year)

We can use functions from the package dplyr to calculate some statistics across groups defined by certain parameter combinations. For example, we can calculate the mean phase date for each year in the following way:

mean_yearly <- data_melted %>% group_by(year, phase) %>% summarize(mean_DOY=mean(DOY)) %>% as.data.frame()
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Or, calculate the latest date of each phase for each existing combination of variety and station:

max_station_variety <- data_melted %>% group_by(station, variety, phase) %>% summarize(max_DOY=max(DOY)) %>% as.data.frame()
## `summarise()` has grouped output by 'station', 'variety'. You can override using the `.groups` argument.

2. Plot data in diagrams

The phenological data consists of many repeated observations of each phenological phase, because these phases were observed in different years, and at different agrometeorological stations. Let’s plot the DOY of all observations of “01_Seeding” as points in one plot, using the function geom_jitter(). We can control the color, size and transparency (argument “alpha”) of the points inside geom_jitter():

seeding <- filter(data_melted, phase == "01_Seeding")

ggplot(data=seeding, aes(x=phase, y=DOY)) +
  geom_jitter(color="darkblue", size=2, alpha=0.2)

Since we deal with repeated observations, we should visually summarize them accordingly to make them more easily understandable. Box plots are ideal for that:

ggplot(data=seeding, aes(x=phase, y=DOY)) +
  geom_jitter(color="darkblue", size=2, alpha=0.2) +
  geom_boxplot(alpha=0.7)

The seeding DOY ranges from 83 to 161. We know that the data in “seeding” contains observations from different years, and different stations. Maybe the great variation in seeding DOYs stems from the fact that in some years or some stations, seeding starts consistently earlier or later? Let us find out by ‘splitting up’ the data and create two new figures, with separate box plots for each year, and each station, respectively. To do so, we have to add “year” or “station” both to the arguments “x” and “fill” in the aes() function inside ggplot(), and to the argument “color” in the aes() function inside geom_jitter():

ggplot(data=seeding, aes(x=year, y=DOY, fill=year)) +
  geom_jitter(aes(color=year), size=2, alpha=0.2) +
  geom_boxplot(alpha=0.7) +
  theme(axis.text.x=element_text(angle=45, hjust=1))

ggplot(data=seeding, aes(x=station, y=DOY, fill=station)) +
  geom_jitter(aes(color=station), size=2, alpha=0.2) +
  geom_boxplot(alpha=0.7) +
  theme(axis.text.x=element_text(angle=45, hjust=1))

It seems there are considerable differences between stations! However, let us ignore these differences for a moment, and create a figure with one boxplot for each phenological phase. Let us use some colors from the palette “kovesi.rainbow” from the package pals, and let us get all text elements of the plot right:

plot1 <- ggplot(data=data_melted, aes(x=phase, y=DOY, fill=phase)) +
           geom_jitter(aes(color=phase), size=2, alpha=0.4) + 
           geom_boxplot(alpha=0.7) +
           scale_fill_manual(values = as.vector(kovesi.rainbow(8))) +
           scale_colour_manual(values = as.vector(kovesi.rainbow(8))) +
           theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="none") +
           ggtitle("Phenological stages of potato (all stations, all varieties)") + 
           xlab("") + 
           ylab("Day of the year (DOY)")

plot1

The labels on the x-axis are quite long. We can shorten them by supplying a vector of pre-defined labels in the following way:

xlabels <- c("Seeding", "Sprouts", "Lateral shoots", "Inflorescences", "Blossoming", "Cease blossoming", "Withering", "Harvest")

plot1 + scale_x_discrete(labels=xlabels)

To explore the differences between stations for all phases separately, we can use again the function facet_wrap(). Note we should change the colors to avoid confusion, and we have to adjust the number of colours supplied to scale_fill_manual() and scale_colour_manual() to 12, according to the amount of existing stations:

ggplot(data=data_melted, aes(x=station, y=DOY, fill=station)) +
           geom_jitter(aes(color=station), size=2, alpha=0.4) + 
           geom_boxplot(alpha=0.7) +
           scale_fill_manual(values = as.vector(cubicl(12))) +
           scale_colour_manual(values = as.vector(cubicl(12))) +
           theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="none") +
           ggtitle("Phenological stages of potato (by station, all varieties)") + 
           xlab("") + 
           ylab("Day of the year (DOY)") +
           facet_wrap(~phase, scales="free")

Until now, we have disregarded the temporal component in the data, i.e. the column “year”. However, it might be that there are some temporal patterns in the phenological observations. Let’s find out! For this, we use geom_point() to plot points, and these points should have “year” as x and “DOY” as y coordinate. We define each phenological phase as a separate group with a separate color, and we add a linear regression fit with the function geom_smooth(). We also customize our x axis labels by providing a vector of breaks in scale_x_discrete():

ggplot(data=data_melted, aes(x=year, y=DOY, color = phase, group = phase)) +
    geom_point(alpha=0.4, size=2) + 
    geom_smooth(method=lm, se=FALSE, size=1.3) +
    scale_color_manual(values = as.vector(kovesi.rainbow(8))) +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    ggtitle("Potato, all stations, all varieties") +
    scale_x_discrete(breaks = c(1995,2000,2005,2010,2015,2020)) + 
    xlab("") + 
    ylab("DOY")
## `geom_smooth()` using formula 'y ~ x'

To explore the temporal trends for each station separately, we can group and color by “station”, and create a separate facet for each phase:

ggplot(data=data_melted, aes(x=year, y=DOY, color = station, group = station)) +
    geom_point(alpha=0.4, size=1.5) + 
    geom_smooth(method=lm, se=FALSE, size=1.3) +
    scale_color_manual(values = as.vector(cols25(12))) +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    ggtitle("Potato, all varieties") +
    scale_x_discrete(breaks = c(1995,2000,2005,2010,2015,2020)) + 
    xlab("") + 
    ylab("DOY") +
    facet_wrap(~phase, scales="free_y")
## `geom_smooth()` using formula 'y ~ x'

3. Plot data in maps

The goal of this last part is to plot the location of the agrometeorological stations onto a country map, using ggplot(). To complement this map, we will also use again the yield data that we have already worked on in the first script of module 1.

First, we import the two shapefiles needed for our map:

marzes   <- readOGR(paste0(dir, "/shapefiles/ARM_marzes_short_simplified_500.shp"), verbose=F)
stations <- readOGR(paste0(dir, "/shapefiles/stations.shp"), verbose=F)

The two shapefiles do not have the same coordinate system. We therefore re-project “marzes” to match the coordinate system of “stations”, using the function spTransform(). You can ignore the message ## Warning in proj4string(marzes): CRS object has comment, which is lost in output:

proj4string(marzes)
## Warning in proj4string(marzes): CRS object has comment, which is lost in output
## [1] "+proj=utm +zone=38 +datum=WGS84 +units=m +no_defs"
proj4string(stations)
## Warning in proj4string(stations): CRS object has comment, which is lost in
## output
## [1] "+proj=longlat +datum=WGS84 +no_defs"
marzes_new <- spTransform(marzes, proj4string(stations))
## Warning in proj4string(stations): CRS object has comment, which is lost in
## output
proj4string(marzes)
## Warning in proj4string(marzes): CRS object has comment, which is lost in output
## [1] "+proj=utm +zone=38 +datum=WGS84 +units=m +no_defs"
proj4string(stations)
## Warning in proj4string(stations): CRS object has comment, which is lost in
## output
## [1] "+proj=longlat +datum=WGS84 +no_defs"

We can now quickly plot the stations on top of the marzes with the plot() function.

plot(marzes_new)
plot(stations, col="red", add=T)

The map above does not look very nice. You already learned that ggplot() provides many options to customize figures. We can make use of all these options to create maps, too. However, to be able to plot the marzes in ggplot(), we first have to convert the coordinates of “marzes” to a table which we call “marzes_fort”, using the function fortify(). Each row in “marzes_fort” refers to a single vertex of the shapefile:

marzes_fort <- fortify(marzes_new)
## Regions defined for each Polygons
head(marzes_fort)
long lat order hole piece id group
45.72718 39.60272 1 FALSE 1 0 0.1
45.69634 39.60340 2 FALSE 1 0 0.1
45.69286 39.61411 3 FALSE 1 0 0.1
45.71225 39.64223 4 FALSE 1 0 0.1
45.71604 39.68062 5 FALSE 1 0 0.1
45.72947 39.70467 6 FALSE 1 0 0.1

By default, fortify() creates a column called “group” in “marzes_fort”. For now, we base the coloring of the marzes on this column. To plot the marzes, we add the function geom_polygon() to ggplot(). The function coord_equal() ensures that the maps does not get distorted:

ggplot(data=marzes_fort, aes(x=long, y=lat, group=group, fill=group)) +  
  geom_polygon(color="black", size=0.25) +
  coord_equal() 

Let’s make the coloration more meaningful and base if on potato yields. To do so, we have to first join the yield statistics to “marzes_new”, and then to “marzes_fort”. For the latter step, we have to create a new column “id” in “marzes_new” before.

## import yield data
yields <- read.csv(paste0(dir, "M01_Part01_Agricultural-Statistics/data/all_stats.csv"))

## edit and filter yield data
yields$region[yields$region=="Vayots dzor"] <- "Vayotz Dzor"
yields_potato <- filter(yields, region %in% c("00_ARMENIA", "Yerevan c.") == FALSE, crop == "Potato", indicator == "Yield")

## reformat table to obtain one column for each year
yields_potato_dcast <- dcast(yields_potato, region~year) 

## join yields to "marzes_new" 
marzes_new <- geo_join(marzes_new, yields_potato_dcast, "div_short", "region", how = 'left')
## Warning: We recommend using the dplyr::*_join() family of functions instead.
## create id to join "marzes_new" with "marzes_fort"
marzes_new@data$id <- c(0:10)

## join "marzes_new" with "marzes_fort"
marzes_fort <- merge(marzes_fort, marzes_new@data, by="id")

head(marzes_fort)
id long lat order hole piece group div_short region X2005 X2006 X2007 X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015 X2016 X2017 X2018 X2019 X2020
0 45.72718 39.60272 1 FALSE 1 0.1 Syunik Syunik 175.1 150.2 173.2 179.8 179.5 143.5 164.7 174.4 178.9 190 189.4 190.6 169 144.3 144.6 149.8
0 45.69634 39.60340 2 FALSE 1 0.1 Syunik Syunik 175.1 150.2 173.2 179.8 179.5 143.5 164.7 174.4 178.9 190 189.4 190.6 169 144.3 144.6 149.8
0 45.69286 39.61411 3 FALSE 1 0.1 Syunik Syunik 175.1 150.2 173.2 179.8 179.5 143.5 164.7 174.4 178.9 190 189.4 190.6 169 144.3 144.6 149.8
0 45.71225 39.64223 4 FALSE 1 0.1 Syunik Syunik 175.1 150.2 173.2 179.8 179.5 143.5 164.7 174.4 178.9 190 189.4 190.6 169 144.3 144.6 149.8
0 45.71604 39.68062 5 FALSE 1 0.1 Syunik Syunik 175.1 150.2 173.2 179.8 179.5 143.5 164.7 174.4 178.9 190 189.4 190.6 169 144.3 144.6 149.8
0 45.72947 39.70467 6 FALSE 1 0.1 Syunik Syunik 175.1 150.2 173.2 179.8 179.5 143.5 164.7 174.4 178.9 190 189.4 190.6 169 144.3 144.6 149.8

We can now create a ggplot with the coloration based on the yield data of a specific year:

ggplot(data=marzes_fort, aes(x=long, y=lat, group=group, fill=X2005)) +  
  geom_polygon(color="black", size=0.25) +
  coord_equal() +
  scale_fill_gradientn(colors=viridis(10))

Select the agrometeorological stations for which there are phenological observations of potato, and add them to the plot:

## convert station names from "data_melted" to uppercase characters to match the station names in the attribute table of "stations" 
stations_potato <- toupper(unique(data_melted$station))

## compare to "stations_potato" to station names in the shapefile "stations"
stations_potato
##  [1] "ODZUN"      "STEPANAVAN" "IJEVAN"     "SHORZA"     "MARTUNI"   
##  [6] "MASRIK"     "GORIS"      "TASHIR"     "AMASIA"     "VANADZOR"  
## [11] "SEVAN"      "GAVAR"
stations@data$Station_Na
##  [1] "ASHOTSK"           "TASHIR"            "BAGRATASHEN"      
##  [4] "ODZUN"             "AMASIA"            "GYUMRI"           
##  [7] "ARTIK"             "TSAHKAHOVIT"       "STEPANAVAN"       
## [10] "PUSHKINI CANYON"   "HANKAVAN"          "APARAN"           
## [13] "VANADZOR"          "DILIJAN"           "SEMYONOVKA"       
## [16] "IDJEVAN"           "SEVAN OZERO"       "TCHAMBARAK"       
## [19] "APARAN JRAMBAR"    "TALIN"             "ARAGATS H/M"      
## [22] "HAMBERD"           "Yerevan agro"      "ASHTARAK"         
## [25] "EGVARD"            "ARMAVIR"           "YEREVAN ZVARTNOTS"
## [28] "YEREVAN ARABKIR"   "YEREVAN AERO"      "FANTAN"           
## [31] "HRAZDAN"           "GAVAR"             "SHORJA"           
## [34] "MARTUNI"           "MASRIK"            "ARTASHAT"         
## [37] "URTSADZOR"         "YANIKH"            "ARARAT"           
## [40] "ARENI"             "ANANUN CANYON"     "VOROTANI CANYON"  
## [43] "JERMUK"            "SISIAN"            "KAJARAN"          
## [46] "GORIS"             "MEGHRI"            "KAPAN"
## -> Some stations are written differently! Let's correct that.
stations_potato[3] <- "IDJEVAN"
stations_potato[4] <- "SHORJA"
stations_potato[11] <- "SEVAN OZERO"

## create a new shapefile "stations_sel" that only contains those features that match "stations_potato"
stations_sel <- stations[stations@data$Station_Na %in% stations_potato,]
stations_sel@data
station_ID Station_Na Latitude Lati_new Longitude Longi_new elev1 elev2
2 37618 TASHIR 41 07 00N 41.11667 44 16 45E 44.27917 1507.034 1509.565
4 37627 ODZUN 41 03 37N 41.06028 44 36 37E 44.61028 1105.337 1110.351
5 37682 AMASIA 40 57 01N 40.95028 43 47 01E 43.78361 1876.039 1878.491
9 37693 STEPANAVAN 41 00 07N 41.00194 44 24 46E 44.41278 1397.230 1399.82
13 37704 VANADZOR 40 50 20N 40.83889 44 26 04E 44.43444 1375.925 1379.035
16 37711 IDJEVAN 40 52 18N 40.87167 45 08 50E 45.14722 732.378 732.8
17 37717 SEVAN OZERO 40 33 55N 40.56528 45 00 30E 45.00833 1917.237 1913.179
32 37801 GAVAR 40 20 55N 40.34861 45 07 48E 45.13000 1960.309 1961.526
33 37802 SHORJA 40 30 02N 40.50056 45 16 18E 45.27167 1917.225 1922.626
34 37808 MARTUNI 40 08 13N 40.13694 45 17 49E 45.29694 1943.254 1943.124
35 37815 MASRIK 40 12 27N 40.20750 45 45 52E 45.76444 1939.114 1940.214
46 37953 GORIS 39 31 05N 39.51806 46 20 18E 46.33833 1550.129 1553.623

We can now add the agrometeorological stations as red dots to the ggplot-map, supplying the attribute table of “stations_sel” as data source to the function geom_point() and indicating which columns contain the numeric coordinates. As we are working with two different datasets now (“marzes_fort” and “stations_sel”), we have to supply a separate aesthetics mapping to both geom_polygon() and geom_point(), and do not provide a global aesthetics mapping in the ggplot() function itself.

ggplot() +  
  geom_polygon(data=marzes_fort, aes(x=long, y=lat, group=group, fill=X2005), color="black", size=0.25) +
  geom_point(data=stations_sel@data, aes(x = Longi_new, y = Lati_new, group = NULL), col="red") +
  coord_equal() +
  scale_fill_viridis_c()   ## is the same as "scale_fill_gradientn(colors=viridis(10))"

Let’s use again facets to create several maps at once, one map for each year. To have a column that we can supply to facet_wrap(), we first have to melt “marzes_fort”:

marzes_fort_melt <- melt(marzes_fort, id.vars = c("id", "long", "lat", "order", "hole", "piece", "group", "div_short", "region"))
names(marzes_fort_melt)[10:11] <- c("year", "yield")
marzes_fort_melt$year <- substring(marzes_fort_melt$year, 2)

head(marzes_fort_melt)
id long lat order hole piece group div_short region year yield
0 45.72718 39.60272 1 FALSE 1 0.1 Syunik Syunik 2005 175.1
0 45.69634 39.60340 2 FALSE 1 0.1 Syunik Syunik 2005 175.1
0 45.69286 39.61411 3 FALSE 1 0.1 Syunik Syunik 2005 175.1
0 45.71225 39.64223 4 FALSE 1 0.1 Syunik Syunik 2005 175.1
0 45.71604 39.68062 5 FALSE 1 0.1 Syunik Syunik 2005 175.1
0 45.72947 39.70467 6 FALSE 1 0.1 Syunik Syunik 2005 175.1
ggplot() +  
  geom_polygon(data=marzes_fort_melt, aes(x=long, y=lat, group=group, fill=yield), color="black", size=0.25) +
  geom_point(data=stations_sel@data, aes(x = Longi_new, y = Lati_new, group = NULL), col="red") +
  coord_equal() +
  scale_fill_viridis_c() +
  facet_wrap(~year) +
  ggtitle("Potato yield by year") +
  xlab("") + 
  ylab("")

Congratulations, you made it to the end of the second script of module 1! You can now solve exercises 6-10 in “M01_Exercises_Assignment.Rmd”.