Profolio 1

Setup

require(tidyverse)
require(ggplot2)
require(ggrepel)
require(ggridges)
require(patchwork)
require(zoo)
theme_set(theme_minimal())

Data Description

The dataset I choose is a 650 year record of grape harvest dates (GHD) of 27 regions in Western Europe. I find it when searching for climate datasets on Data is Plural. It comes from (Daux, Valérie, et al) and the researchers analyse to its pattern to verify the impact of climate changing. I have never seen such a “historical” dataset before and find it very interesting - a long term dataset should be very precious. Also, the average global temperature is increase steadily in past hundred years, and I wonder if I can see a somehow related trend in GHD series

The data consist of two parts. - The main data frame: 650 years x 27 regions. Each row is a year and each column is a region. The values in the data frame is the regional mean number of days between the grape harvest date and September 1st. - The location data frame: A sheet providing the longitude and latitude of 27 regions.

The raw data contains some unnecessary metadata and headers so I remove them manually and upload the two part as two csv to Box.

ghd = read_csv("https://uwmadison.box.com/shared/static/phoz9eco2dpfk5inpquipibwsm7i00em.csv", show_col_types = FALSE)
tail(ghd, 3)
## # A tibble: 3 x 28
##    year Alsace Auvergne `Auxerre-Avalon` `Beaujolais and Maco~ Bordeaux Burgundy
##   <dbl>  <dbl>    <dbl>            <dbl>                 <dbl>    <dbl>    <dbl>
## 1  2005     29       NA               NA                    NA     -4.9     13.5
## 2  2006     NA       NA               NA                    NA     16.1     16  
## 3  2007     NA       NA               NA                    NA     NA       NA  
## # ... with 21 more variables: Champagne 1 <dbl>, Champagne 2 <dbl>,
## #   Gaillac- South-West <dbl>, Germany <dbl>, High Loire Valley <dbl>,
## #   Ile de France <dbl>, Jura <dbl>, Languedoc <dbl>, Low Loire Valley <dbl>,
## #   Luxembourg <dbl>, Maritime alps <dbl>, Northern Italy <dbl>,
## #   Northern  Lorraine <dbl>, Northern Rhone valley <dbl>, Savoie <dbl>,
## #   Spain <dbl>, Southern Lorraine <dbl>, Southern Rhone valley <dbl>,
## #   Switzerland (Leman Lake) <dbl>, Various South-East <dbl>, ...
location = read_csv("https://uwmadison.box.com/shared/static/y1ln5tqxar0ex4finhsn2d3gc3o7x031.csv", show_col_types = FALSE)
head(location, 3)
## # A tibble: 3 x 3
##   Location       Latitude Longitude
##   <chr>             <dbl>     <dbl>
## 1 Alsace             48.2      7.28
## 2 Auvergne           45.6      3.17
## 3 Auxerre-Avalon     47.8      3.57

Data Preparation

  • The time series of dates are very unstable, to see the trend I need to get a moving average by zoo:rollapply.
  • The main data frame is not tidy, so I need to use pivot_longer to tidy the data set.
  • In common sense, we may assume the temperature is colder in high latitude regions, which may influence the ripe of grapes. To explore the influence of latitude, I re-arrage the location by latitude and turn it into a factor.
  • There are lots of missing values in the data set
loc_order = location %>% 
  drop_na() %>%
  arrange(desc(Latitude)) %>% 
  pull(Location)

ghd_roll = ghd %>% 
  mutate(across(!year, rollapply, width=20, FUN=mean, fill=NA)) %>%
  pivot_longer(!year, names_to="Location", values_to = "Date") %>%
  mutate(Location = factor(Location, loc_order)) %>%
  drop_na(Location)

ghd_pivot = pivot_longer(ghd, !year, names_to="Location", values_to = "Date") %>% 
  mutate(Location = factor(Location, loc_order)) %>%
  drop_na(Location)

Plot 1: Time series of GHD

The first plot is about the time series. The dates every change randomly so scatter plot is preferred. Moving average is plotted on top of the scatter plot of the raw data.

ghd_pivot %>%
  ggplot(aes(year, Date)) +
    geom_line(data = ghd_roll, colour="red", size=2) +  
    geom_point(colour="grey", alpha=0.3) +
    facet_wrap("Location", ncol = 4)

## Plot 2: Histogram of To better understand the difference between regions, I made another histogram of all the regions sorted ascendingly by latitude.

ghd_pivot %>%
  # filter(year > 1800) %>%
  ggplot() + 
  geom_boxplot(aes(Location, Date)) +
  coord_flip()

Conclusion

I did not expect the dates fluctuate so drastically. In fact many records before 1800 are just like random noises. However, after moving average we can see many regions have a general descending trend. This means the harvest date gradually become earlier in the last century. This may because the warmer climate let the grape ripe earilier, or maybe something else. Due to the limited records and other variables, we cannot determine the exact reason.

Another finding is that the latitude is not that related to the harvest date as well. Although there is a slight trend of later GHD in the high latitude region, regions around Mediterranean Sea have even latter GHD

Reference

Daux, Valérie, et al. “An open-access database of grape harvest dates for climate research: data description and quality assessment.” Climate of the Past 8.5 (2012): 1403-1418.