Welcome to the second script of Module 1!
At the example of phenological observations, you will here learn and further practice how to:

import and edit data
plot data in diagrams
plot data in maps

Please read the following code instructions carefully, try to understand the code that follows each instruction, execute it and see what happens. Do not hesitate to change the code or write your own code, and execute it as well!

1. Import and edit data

We start off again with defining our working directory. This time, we load all packages at the beginning of our script:

dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Module_1_Data_visualization/"

library(reshape2)
library(dplyr)
library(ggplot2)
library(viridis)
library(pals)
library(rgdal)
library(tigris)

We will work with the phenological observations of potato. The original dates when a phenological phase was observed were previously converted to numbers from 1 to 366, referring to the day of the year:

data <- read.csv(paste0(dir, "M01_Part02_Phenological-Observations/data/phenology_by_crop/potato.csv"), sep=";")
head(data)

year	station	crop	variety	X01_Seeding	X02_Sprouts	X03_Lateral_shoots_formation	X04_Inflorescences_formation	X05_Blossoming	X06_Cease_blossoming	X07_Withering_of_aboveground_part	X08_Harvest
1995	Odzun	potato	Lorkh	105	134	146	167	211	216	243	250
1996	Odzun	potato	Lorkh	133	152	162	172	180	202	244	260
1997	Odzun	potato	Lorkh	114	138	148	169	177	201	222	260
1998	Odzun	potato	Lorkh	97	136	146	161	169	185	212	251
1999	Odzun	potato	Lorkh	105	140	146	167	179	201	243	250
2000	Odzun	potato	Lorkh	105	141	149	168	180	192	233	248

Let’s have a look at the different phenological phases, years, stations and varieties contained in our dataset:

names(data)[5:12]             ## There are observations of 8 different phenological phases...

## [1] "X01_Seeding"                       "X02_Sprouts"                      
## [3] "X03_Lateral_shoots_formation"      "X04_Inflorescences_formation"     
## [5] "X05_Blossoming"                    "X06_Cease_blossoming"             
## [7] "X07_Withering_of_aboveground_part" "X08_Harvest"

sort(unique(data$year))       ## ...from the years 1995 to 2020

##  [1] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [16] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

unique(data$station)

##  [1] "Odzun"      "Stepanavan" "Ijevan"     "Shorza"     "Martuni"   
##  [6] "Masrik"     "Goris"      "Tashir"     "Amasia"     "Vanadzor"  
## [11] "Sevan"      "Gavar"

length(unique(data$station))  ## ...from 12 different agrometeorological stations

## [1] 12

unique(data$variety)          ## ...of 12 different varieties of potato.

##  [1] "Lorkh"    "Impala"   "Victoria" "Ogonyok"  "Monalisa" "Pamir"   
##  [7] "Cosmos"   "Holland"  "Marfona"  "Diamant"  "Diana"    "German "

There are many layers in our data, and therefore many possible ways in which to visualize it. But before we visualize it, let’s clean and transform the data:

Our dataset contains NA values. We identify all rows that do not contain any NA values with the function complete.cases(), and only keep these rows:

data <- data[complete.cases(data),]

We transform our dataset with melt() to shorten the amount of columns:

data_melted <- melt(data, id.vars = c("year", "station", "crop", "variety"))
names(data_melted)[5:6] <- c("phase", "DOY")
head(data_melted)

year	station	crop	variety	phase	DOY
1995	Odzun	potato	Lorkh	X01_Seeding	105
1996	Odzun	potato	Lorkh	X01_Seeding	133
1997	Odzun	potato	Lorkh	X01_Seeding	114
1998	Odzun	potato	Lorkh	X01_Seeding	97
1999	Odzun	potato	Lorkh	X01_Seeding	105
2000	Odzun	potato	Lorkh	X01_Seeding	105

To remove the leading “X” in the column “phase”, we can do the following:

data_melted$phase <- substring(data_melted$phase, 2)
head(data_melted)

year	station	crop	variety	phase	DOY
1995	Odzun	potato	Lorkh	01_Seeding	105
1996	Odzun	potato	Lorkh	01_Seeding	133
1997	Odzun	potato	Lorkh	01_Seeding	114
1998	Odzun	potato	Lorkh	01_Seeding	97
1999	Odzun	potato	Lorkh	01_Seeding	105
2000	Odzun	potato	Lorkh	01_Seeding	105

To be able to group our data by “year” in the following plots, we have to transform the column “year” to a factor variable:

data_melted$year <- as.factor(data_melted$year)

We can use functions from the package dplyr to calculate some statistics across groups defined by certain parameter combinations. For example, we can calculate the mean phase date for each year in the following way:

mean_yearly <- data_melted %>% group_by(year, phase) %>% summarize(mean_DOY=mean(DOY)) %>% as.data.frame()

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Or, calculate the latest date of each phase for each existing combination of variety and station:

max_station_variety <- data_melted %>% group_by(station, variety, phase) %>% summarize(max_DOY=max(DOY)) %>% as.data.frame()

## `summarise()` has grouped output by 'station', 'variety'. You can override using the `.groups` argument.

2. Plot data in diagrams

The phenological data consists of many repeated observations of each phenological phase, because these phases were observed in different years, and at different agrometeorological stations. Let’s plot the DOY of all observations of “01_Seeding” as points in one plot, using the function geom_jitter(). We can control the color, size and transparency (argument “alpha”) of the points inside geom_jitter():

seeding <- filter(data_melted, phase == "01_Seeding")

ggplot(data=seeding, aes(x=phase, y=DOY)) +
  geom_jitter(color="darkblue", size=2, alpha=0.2)

Since we deal with repeated observations, we should visually summarize them accordingly to make them more easily understandable. Box plots are ideal for that:

ggplot(data=seeding, aes(x=phase, y=DOY)) +
  geom_jitter(color="darkblue", size=2, alpha=0.2) +
  geom_boxplot(alpha=0.7)

The seeding DOY ranges from 83 to 161. We know that the data in “seeding” contains observations from different years, and different stations. Maybe the great variation in seeding DOYs stems from the fact that in some years or some stations, seeding starts consistently earlier or later? Let us find out by ‘splitting up’ the data and create two new figures, with separate box plots for each year, and each station, respectively. To do so, we have to add “year” or “station” both to the arguments “x” and “fill” in the aes() function inside ggplot(), and to the argument “color” in the aes() function inside geom_jitter():

ggplot(data=seeding, aes(x=year, y=DOY, fill=year)) +
  geom_jitter(aes(color=year), size=2, alpha=0.2) +
  geom_boxplot(alpha=0.7) +
  theme(axis.text.x=element_text(angle=45, hjust=1))

ggplot(data=seeding, aes(x=station, y=DOY, fill=station)) +
  geom_jitter(aes(color=station), size=2, alpha=0.2) +
  geom_boxplot(alpha=0.7) +
  theme(axis.text.x=element_text(angle=45, hjust=1))

It seems there are considerable differences between stations! However, let us ignore these differences for a moment, and create a figure with one boxplot for each phenological phase. Let us use some colors from the palette “kovesi.rainbow” from the package pals, and let us get all text elements of the plot right:

plot1 <- ggplot(data=data_melted, aes(x=phase, y=DOY, fill=phase)) +
           geom_jitter(aes(color=phase), size=2, alpha=0.4) + 
           geom_boxplot(alpha=0.7) +
           scale_fill_manual(values = as.vector(kovesi.rainbow(8))) +
           scale_colour_manual(values = as.vector(kovesi.rainbow(8))) +
           theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="none") +
           ggtitle("Phenological stages of potato (all stations, all varieties)") + 
           xlab("") + 
           ylab("Day of the year (DOY)")

plot1

The labels on the x-axis are quite long. We can shorten them by supplying a vector of pre-defined labels in the following way:

xlabels <- c("Seeding", "Sprouts", "Lateral shoots", "Inflorescences", "Blossoming", "Cease blossoming", "Withering", "Harvest")

plot1 + scale_x_discrete(labels=xlabels)

To explore the differences between stations for all phases separately, we can use again the function facet_wrap(). Note we should change the colors to avoid confusion, and we have to adjust the number of colours supplied to scale_fill_manual() and scale_colour_manual() to 12, according to the amount of existing stations:

ggplot(data=data_melted, aes(x=station, y=DOY, fill=station)) +
           geom_jitter(aes(color=station), size=2, alpha=0.4) + 
           geom_boxplot(alpha=0.7) +
           scale_fill_manual(values = as.vector(cubicl(12))) +
           scale_colour_manual(values = as.vector(cubicl(12))) +
           theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="none") +
           ggtitle("Phenological stages of potato (by station, all varieties)") + 
           xlab("") + 
           ylab("Day of the year (DOY)") +
           facet_wrap(~phase, scales="free")

Until now, we have disregarded the temporal component in the data, i.e. the column “year”. However, it might be that there are some temporal patterns in the phenological observations. Let’s find out! For this, we use geom_point() to plot points, and these points should have “year” as x and “DOY” as y coordinate. We define each phenological phase as a separate group with a separate color, and we add a linear regression fit with the function geom_smooth(). We also customize our x axis labels by providing a vector of breaks in scale_x_discrete():

ggplot(data=data_melted, aes(x=year, y=DOY, color = phase, group = phase)) +
    geom_point(alpha=0.4, size=2) + 
    geom_smooth(method=lm, se=FALSE, size=1.3) +
    scale_color_manual(values = as.vector(kovesi.rainbow(8))) +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    ggtitle("Potato, all stations, all varieties") +
    scale_x_discrete(breaks = c(1995,2000,2005,2010,2015,2020)) + 
    xlab("") + 
    ylab("DOY")

## `geom_smooth()` using formula 'y ~ x'

To explore the temporal trends for each station separately, we can group and color by “station”, and create a separate facet for each phase:

ggplot(data=data_melted, aes(x=year, y=DOY, color = station, group = station)) +
    geom_point(alpha=0.4, size=1.5) + 
    geom_smooth(method=lm, se=FALSE, size=1.3) +
    scale_color_manual(values = as.vector(cols25(12))) +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    ggtitle("Potato, all varieties") +
    scale_x_discrete(breaks = c(1995,2000,2005,2010,2015,2020)) + 
    xlab("") + 
    ylab("DOY") +
    facet_wrap(~phase, scales="free_y")

## `geom_smooth()` using formula 'y ~ x'

3. Plot data in maps

The goal of this last part is to plot the location of the agrometeorological stations onto a country map, using ggplot(). To complement this map, we will also use again the yield data that we have already worked on in the first script of module 1.

First, we import the two shapefiles needed for our map:

marzes   <- readOGR(paste0(dir, "/shapefiles/ARM_marzes_short_simplified_500.shp"), verbose=F)
stations <- readOGR(paste0(dir, "/shapefiles/stations.shp"), verbose=F)

The two shapefiles do not have the same coordinate system. We therefore re-project “marzes” to match the coordinate system of “stations”, using the function spTransform(). You can ignore the message ## Warning in proj4string(marzes): CRS object has comment, which is lost in output:

proj4string(marzes)

## Warning in proj4string(marzes): CRS object has comment, which is lost in output

## [1] "+proj=utm +zone=38 +datum=WGS84 +units=m +no_defs"

proj4string(stations)

## Warning in proj4string(stations): CRS object has comment, which is lost in
## output

## [1] "+proj=longlat +datum=WGS84 +no_defs"

marzes_new <- spTransform(marzes, proj4string(stations))

## Warning in proj4string(stations): CRS object has comment, which is lost in
## output

proj4string(marzes)

## Warning in proj4string(marzes): CRS object has comment, which is lost in output

## [1] "+proj=utm +zone=38 +datum=WGS84 +units=m +no_defs"

proj4string(stations)

## Warning in proj4string(stations): CRS object has comment, which is lost in
## output

## [1] "+proj=longlat +datum=WGS84 +no_defs"

We can now quickly plot the stations on top of the marzes with the plot() function.

plot(marzes_new)
plot(stations, col="red", add=T)

The map above does not look very nice. You already learned that ggplot() provides many options to customize figures. We can make use of all these options to create maps, too. However, to be able to plot the marzes in ggplot(), we first have to convert the coordinates of “marzes” to a table which we call “marzes_fort”, using the function fortify(). Each row in “marzes_fort” refers to a single vertex of the shapefile:

marzes_fort <- fortify(marzes_new)

## Regions defined for each Polygons

head(marzes_fort)

long	lat	order	hole	piece	group
45.72718	39.60272	1	FALSE	1	0.1
45.69634	39.60340	2	FALSE	1	0.1
45.69286	39.61411	3	FALSE	1	0.1
45.71225	39.64223	4	FALSE	1	0.1
45.71604	39.68062	5	FALSE	1	0.1
45.72947	39.70467	6	FALSE	1	0.1

By default, fortify() creates a column called “group” in “marzes_fort”. For now, we base the coloring of the marzes on this column. To plot the marzes, we add the function geom_polygon() to ggplot(). The function coord_equal() ensures that the maps does not get distorted:

ggplot(data=marzes_fort, aes(x=long, y=lat, group=group, fill=group)) +  
  geom_polygon(color="black", size=0.25) +
  coord_equal()

Let’s make the coloration more meaningful and base if on potato yields. To do so, we have to first join the yield statistics to “marzes_new”, and then to “marzes_fort”. For the latter step, we have to create a new column “id” in “marzes_new” before.

## import yield data
yields <- read.csv(paste0(dir, "M01_Part01_Agricultural-Statistics/data/all_stats.csv"))

## edit and filter yield data
yields$region[yields$region=="Vayots dzor"] <- "Vayotz Dzor"
yields_potato <- filter(yields, region %in% c("00_ARMENIA", "Yerevan c.") == FALSE, crop == "Potato", indicator == "Yield")

## reformat table to obtain one column for each year
yields_potato_dcast <- dcast(yields_potato, region~year) 

## join yields to "marzes_new" 
marzes_new <- geo_join(marzes_new, yields_potato_dcast, "div_short", "region", how = 'left')

## Warning: We recommend using the dplyr::*_join() family of functions instead.

## create id to join "marzes_new" with "marzes_fort"
marzes_new@data$id <- c(0:10)

## join "marzes_new" with "marzes_fort"
marzes_fort <- merge(marzes_fort, marzes_new@data, by="id")

head(marzes_fort)

long	lat	order	hole	piece	group	div_short	region	X2005	X2006	X2007	X2008	X2009	X2010	X2011	X2012	X2013	X2014	X2015	X2016	X2017	X2018	X2019	X2020
45.72718	39.60272	1	FALSE	1	0.1	Syunik	Syunik	175.1	150.2	173.2	179.8	179.5	143.5	164.7	174.4	178.9	190	189.4	190.6	169	144.3	144.6	149.8
45.69634	39.60340	2	FALSE	1	0.1	Syunik	Syunik	175.1	150.2	173.2	179.8	179.5	143.5	164.7	174.4	178.9	190	189.4	190.6	169	144.3	144.6	149.8
45.69286	39.61411	3	FALSE	1	0.1	Syunik	Syunik	175.1	150.2	173.2	179.8	179.5	143.5	164.7	174.4	178.9	190	189.4	190.6	169	144.3	144.6	149.8
45.71225	39.64223	4	FALSE	1	0.1	Syunik	Syunik	175.1	150.2	173.2	179.8	179.5	143.5	164.7	174.4	178.9	190	189.4	190.6	169	144.3	144.6	149.8
45.71604	39.68062	5	FALSE	1	0.1	Syunik	Syunik	175.1	150.2	173.2	179.8	179.5	143.5	164.7	174.4	178.9	190	189.4	190.6	169	144.3	144.6	149.8
45.72947	39.70467	6	FALSE	1	0.1	Syunik	Syunik	175.1	150.2	173.2	179.8	179.5	143.5	164.7	174.4	178.9	190	189.4	190.6	169	144.3	144.6	149.8

We can now create a ggplot with the coloration based on the yield data of a specific year:

ggplot(data=marzes_fort, aes(x=long, y=lat, group=group, fill=X2005)) +  
  geom_polygon(color="black", size=0.25) +
  coord_equal() +
  scale_fill_gradientn(colors=viridis(10))

Select the agrometeorological stations for which there are phenological observations of potato, and add them to the plot:

## convert station names from "data_melted" to uppercase characters to match the station names in the attribute table of "stations" 
stations_potato <- toupper(unique(data_melted$station))

## compare to "stations_potato" to station names in the shapefile "stations"
stations_potato

##  [1] "ODZUN"      "STEPANAVAN" "IJEVAN"     "SHORZA"     "MARTUNI"   
##  [6] "MASRIK"     "GORIS"      "TASHIR"     "AMASIA"     "VANADZOR"  
## [11] "SEVAN"      "GAVAR"

stations@data$Station_Na

##  [1] "ASHOTSK"           "TASHIR"            "BAGRATASHEN"      
##  [4] "ODZUN"             "AMASIA"            "GYUMRI"           
##  [7] "ARTIK"             "TSAHKAHOVIT"       "STEPANAVAN"       
## [10] "PUSHKINI CANYON"   "HANKAVAN"          "APARAN"           
## [13] "VANADZOR"          "DILIJAN"           "SEMYONOVKA"       
## [16] "IDJEVAN"           "SEVAN OZERO"       "TCHAMBARAK"       
## [19] "APARAN JRAMBAR"    "TALIN"             "ARAGATS H/M"      
## [22] "HAMBERD"           "Yerevan agro"      "ASHTARAK"         
## [25] "EGVARD"            "ARMAVIR"           "YEREVAN ZVARTNOTS"
## [28] "YEREVAN ARABKIR"   "YEREVAN AERO"      "FANTAN"           
## [31] "HRAZDAN"           "GAVAR"             "SHORJA"           
## [34] "MARTUNI"           "MASRIK"            "ARTASHAT"         
## [37] "URTSADZOR"         "YANIKH"            "ARARAT"           
## [40] "ARENI"             "ANANUN CANYON"     "VOROTANI CANYON"  
## [43] "JERMUK"            "SISIAN"            "KAJARAN"          
## [46] "GORIS"             "MEGHRI"            "KAPAN"

## -> Some stations are written differently! Let's correct that.
stations_potato[3] <- "IDJEVAN"
stations_potato[4] <- "SHORJA"
stations_potato[11] <- "SEVAN OZERO"

## create a new shapefile "stations_sel" that only contains those features that match "stations_potato"
stations_sel <- stations[stations@data$Station_Na %in% stations_potato,]
stations_sel@data

	station_ID	Station_Na	Latitude	Lati_new	Longitude	Longi_new	elev1	elev2
2	37618	TASHIR	41 07 00N	41.11667	44 16 45E	44.27917	1507.034	1509.565
4	37627	ODZUN	41 03 37N	41.06028	44 36 37E	44.61028	1105.337	1110.351
5	37682	AMASIA	40 57 01N	40.95028	43 47 01E	43.78361	1876.039	1878.491
9	37693	STEPANAVAN	41 00 07N	41.00194	44 24 46E	44.41278	1397.230	1399.82
13	37704	VANADZOR	40 50 20N	40.83889	44 26 04E	44.43444	1375.925	1379.035
16	37711	IDJEVAN	40 52 18N	40.87167	45 08 50E	45.14722	732.378	732.8
17	37717	SEVAN OZERO	40 33 55N	40.56528	45 00 30E	45.00833	1917.237	1913.179
32	37801	GAVAR	40 20 55N	40.34861	45 07 48E	45.13000	1960.309	1961.526
33	37802	SHORJA	40 30 02N	40.50056	45 16 18E	45.27167	1917.225	1922.626
34	37808	MARTUNI	40 08 13N	40.13694	45 17 49E	45.29694	1943.254	1943.124
35	37815	MASRIK	40 12 27N	40.20750	45 45 52E	45.76444	1939.114	1940.214
46	37953	GORIS	39 31 05N	39.51806	46 20 18E	46.33833	1550.129	1553.623

We can now add the agrometeorological stations as red dots to the ggplot-map, supplying the attribute table of “stations_sel” as data source to the function geom_point() and indicating which columns contain the numeric coordinates. As we are working with two different datasets now (“marzes_fort” and “stations_sel”), we have to supply a separate aesthetics mapping to both geom_polygon() and geom_point(), and do not provide a global aesthetics mapping in the ggplot() function itself.

ggplot() +  
  geom_polygon(data=marzes_fort, aes(x=long, y=lat, group=group, fill=X2005), color="black", size=0.25) +
  geom_point(data=stations_sel@data, aes(x = Longi_new, y = Lati_new, group = NULL), col="red") +
  coord_equal() +
  scale_fill_viridis_c()   ## is the same as "scale_fill_gradientn(colors=viridis(10))"

Let’s use again facets to create several maps at once, one map for each year. To have a column that we can supply to facet_wrap(), we first have to melt “marzes_fort”:

marzes_fort_melt <- melt(marzes_fort, id.vars = c("id", "long", "lat", "order", "hole", "piece", "group", "div_short", "region"))
names(marzes_fort_melt)[10:11] <- c("year", "yield")
marzes_fort_melt$year <- substring(marzes_fort_melt$year, 2)

head(marzes_fort_melt)

long	lat	order	hole	piece	group	div_short	region	year	yield
45.72718	39.60272	1	FALSE	1	0.1	Syunik	Syunik	2005	175.1
45.69634	39.60340	2	FALSE	1	0.1	Syunik	Syunik	2005	175.1
45.69286	39.61411	3	FALSE	1	0.1	Syunik	Syunik	2005	175.1
45.71225	39.64223	4	FALSE	1	0.1	Syunik	Syunik	2005	175.1
45.71604	39.68062	5	FALSE	1	0.1	Syunik	Syunik	2005	175.1
45.72947	39.70467	6	FALSE	1	0.1	Syunik	Syunik	2005	175.1

ggplot() +  
  geom_polygon(data=marzes_fort_melt, aes(x=long, y=lat, group=group, fill=yield), color="black", size=0.25) +
  geom_point(data=stations_sel@data, aes(x = Longi_new, y = Lati_new, group = NULL), col="red") +
  coord_equal() +
  scale_fill_viridis_c() +
  facet_wrap(~year) +
  ggtitle("Potato yield by year") +
  xlab("") + 
  ylab("")

Congratulations, you made it to the end of the second script of module 1! You can now solve exercises 6-10 in “M01_Exercises_Assignment.Rmd”.

Module 1: Data visualization
Script 02: Phenological Observations

Max Hofmann

1. Import and edit data

2. Plot data in diagrams

3. Plot data in maps

Module 1: Data visualization Script 02: Phenological Observations

Max Hofmann

1. Import and edit data

2. Plot data in diagrams

3. Plot data in maps

Module 1: Data visualization
Script 02: Phenological Observations