Project 2 - Data 110

Twentieth Century Crop Statistics Around the World

The study of Outer Space is not limited to what individuals see looking out into the sky, but also what is here on the land we stand on. What goes on in space can help determine what food we grow. For years, outer space has been used to monitor the atmosphere and weather anomalies on Earth. Food, crops, our farms are all affected heavily by the weather and it is crucial for the health and well-being of our people to have a steady and sustained supply of crops. The dataset presented today is provided by NASA through the SocioEconomic Data and Applications Center (SEDAC) and hosted by Columbia University’s Center for International Earth Science Information Network. We are looking through this dataset to see the trends of global crop growth (or decay) throughout the twentieth century up until 2017. Our interest is to find similarities between Space technology growth within a country and its potential crop yield. Though of course, correlation doesn’t not imply causation, though this view, we may find insights to better understand the working relationship of farming and its reliance on Space technology advancements.

Meteorologists, the popular person you watch on tv in the morning before going out for work, study and forecast weather and its anomalies. But the complexities of Mother Nature cannot be handled through simple observations done through the human eyes. We need tools. We need technology. A common set of tools used to study hydrology is the use of satellites. They can help find rumblings within the atmosphere through physics-based hydrometeorological systems using Multi-sensor Precipitation Estimates (MPE) housed on these satellites. Researchers would be able to take the data gathered from the sensors and create modeling algorithms to predict potentials of rainfall and the appropriate runoff paths (I. Yucel). This information and foresight allows farmers to be able to prepare for what is to come, whether it be to dry off to prevent deluging their crops, or to open up additional spillways to handle lighter rain showers. As our sensor technologies advance, so to do our algorithms and the predictability for major weather events. We take a look at this data to see available trends throughout the years as advancements are made in during the Information Age and beyond.

Definition of Labels

  1. Column A
  • index starting at 0
  1. Harvest_year
  • The year the actual harvest took place
  1. admin0
  • Country
  1. admin1
  • state/region/providence
  1. crop
  • The specific crop recorded
  1. hectares (ha)
  • The amount of area, at 10,000 square meters
  1. production (tonnes)
  • The total amount of production in that given year measured in weight (1000 kg = 1 tonne)
  1. year
  • This is in regards to the year of recorded value
  1. yield(tonnes/ha)
  • Total yield from the specific year and location
  1. admin2
  • Blank
  1. notes
  • This is used for clarification, for example, if the record took place between multiple regions, we will clarify the multiple regions in this field.

Changes to the Dataset

The dataset has a variety of tabs to help split out the information. Our focus will be on a single tab called “CropStats”. This tab has the raw data that each of the additional tabs presents. For purposes of keeping the file size down, we will remove all accessory tabs other than the “Title Page”, “Wheat_Data_Sources”, and “Maize_Data_Sources” tab for purposes of source reference.

Load the Data

Packages <- c("tidyverse","dplyr","readxl","plotly","ggthemes","treemapify","RColorBrewer")
invisible(lapply(Packages, library, character.only = TRUE))
setwd("~/Data Study/Data 110 MC/Project2")
df <- read_excel("food-twentieth-century-crop-statistics-1900-2017-xlsx.xlsx",sheet = "CropStats")

EDA

str(df)
## tibble [36,707 × 11] (S3: tbl_df/tbl/data.frame)
##  $ ...1               : num [1:36707] 0 1 2 3 4 5 6 7 8 9 ...
##  $ Harvest_year       : num [1:36707] 1902 1903 1904 1905 1906 ...
##  $ admin0             : chr [1:36707] "Austria" "Austria" "Austria" "Austria" ...
##  $ admin1             : chr [1:36707] NA NA NA NA ...
##  $ crop               : chr [1:36707] "wheat" "wheat" "wheat" "wheat" ...
##  $ hectares (ha)      : num [1:36707] NA NA NA NA NA NA NA NA NA NA ...
##  $ production (tonnes): num [1:36707] NA NA NA NA NA NA NA NA NA NA ...
##  $ year               : num [1:36707] 1902 1903 1904 1905 1906 ...
##  $ yield(tonnes/ha)   : num [1:36707] 1.31 1.47 1.27 1.33 1.28 1.37 1.36 1.35 1.18 1.37 ...
##  $ admin2             : logi [1:36707] NA NA NA NA NA NA ...
##  $ notes              : logi [1:36707] NA NA NA NA NA NA ...

We find that the first column is an index, specifically starting with 0, we will exclude this from our data as it is not needed. We also find that the notes and admin2 is set to logical, but our initial guess would state that they should be char. However, for this project, we are not reviewing text, and we know that admin2 is blank. Thus, these two columns will also be removed.

colSums(is.na(df))
##                ...1        Harvest_year              admin0              admin1 
##                   0                   0                   0                2934 
##                crop       hectares (ha) production (tonnes)                year 
##                   0                1623                1998                   0 
##    yield(tonnes/ha)              admin2               notes 
##                2013               36707               36707

To aide us in our view, we will remove a few columns as noted earlier, and rename some of the columns for better clarification.

df2 <- df %>%
  select(-`...1`,-`admin2`,-`notes`)
# We are renaming Harvest_year to be harvestYear to follow syntax
# Our initial review of the data indicates that admin0 is country
# Our initial review of the data indicates that admin1 is region/providence/state, simply stated as region
# Our initial review of the data indicates that year is the reportingYear of the information
df2 <- df2 %>%
  rename("harvestYear" = "Harvest_year", "country" = "admin0", "region" = "admin1", "reportingYear" = "year")

Continuing our review of the dataset

table(df2$`country`)
## 
##      Argentina      Australia        Austria        Belgium         Brazil 
##            237           2272            116            116            555 
##         Canada          Chile          China        Croatia Czech Republic 
##            220            176           4002            116            111 
##        Finland         France          India      Indonesia          Italy 
##            116           5236            233             96           2128 
##         Mexico        Morocco    Netherlands       Portugal   South Africa 
##            246             80            234             89            216 
##          Spain         Sweden United Kingdom  United States        Uruguay 
##           1638            256            134          17846            238
sumdf <- df2 %>%
  group_by(country) %>%
  summarise("Total Yield" = sum(`yield(tonnes/ha)`,na.rm=T), 
            "Average Yield" = mean(`yield(tonnes/ha)`,na.rm=T),
            "Count of Data Records" = n(),
            "Smallest Yield" = min(`yield(tonnes/ha)`,na.rm=T),
            "Greatest Yield" = max(`yield(tonnes/ha)`,na.rm=T)) %>%
  mutate("Total Yield per 10,000" = `Total Yield`/10000)



plot1 <- ggplot(sumdf, aes(`Average Yield`,
                           `Total Yield per 10,000`, 
                           color=`country`, 
                           text = paste("Number of Records:",`Count of Data Records`))) +
  geom_point(size = 3, alpha = 0.6) +
  ggtitle("Average Yield x Total Yield by Country") +
  xlab("Average Yield (tonnes/ha)") +
  ylab("Total Yield in thousands (tonnes/ha)") +
  theme_minimal()
plot1 <- ggplotly(plot1)
plot1

Categorize countries

northAmerica <- c("United States", "Canada", "Mexico")
southAmerica <- c("Argentina","Brazil","Chile","Uruguay")
europe <- c("Austria","Belgium","Croatia","Czech Republic","Finland","France","Italy","Netherlands","Portugal","Spain","Sweden","United Kingdom")
africa <- c("Morocco","South Africa")
asia <- c("China","India","Indonesia")
australia <- c("Australia")

df3 <- df2 %>%
  mutate(continent = factor(ifelse(country %in% northAmerica,"North America",
                                   ifelse(country %in% southAmerica,"South America",
                                   ifelse(country %in% europe,"Europe",
                                   ifelse(country %in% africa,"Africa",
                                   ifelse(country %in% asia,"Asia",
                                   ifelse(country %in% australia,"Australia","Not Listed"))))))),
         numberOfCountries = ifelse(country %in% northAmerica,3,
                                   ifelse(country %in% southAmerica,4,
                                   ifelse(country %in% europe,12,
                                   ifelse(country %in% africa,2,
                                   ifelse(country %in% asia,3,
                                   ifelse(country %in% australia,1,0)))))))
str(df3)
## tibble [36,707 × 10] (S3: tbl_df/tbl/data.frame)
##  $ harvestYear        : num [1:36707] 1902 1903 1904 1905 1906 ...
##  $ country            : chr [1:36707] "Austria" "Austria" "Austria" "Austria" ...
##  $ region             : chr [1:36707] NA NA NA NA ...
##  $ crop               : chr [1:36707] "wheat" "wheat" "wheat" "wheat" ...
##  $ hectares (ha)      : num [1:36707] NA NA NA NA NA NA NA NA NA NA ...
##  $ production (tonnes): num [1:36707] NA NA NA NA NA NA NA NA NA NA ...
##  $ reportingYear      : num [1:36707] 1902 1903 1904 1905 1906 ...
##  $ yield(tonnes/ha)   : num [1:36707] 1.31 1.47 1.27 1.33 1.28 1.37 1.36 1.35 1.18 1.37 ...
##  $ continent          : Factor w/ 6 levels "Africa","Asia",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ numberOfCountries  : num [1:36707] 12 12 12 12 12 12 12 12 12 12 ...

Now that we have countries categorized by continent, how does our total versus averages look?

sumdf2 <- df3 %>%
  group_by(continent) %>%
  summarise("Total Yield" = sum(`yield(tonnes/ha)`,na.rm=T), 
            "Average Yield" = mean(`yield(tonnes/ha)`,na.rm=T),
            "Count of Data Records" = n(),
            "Smallest Yield" = min(`yield(tonnes/ha)`,na.rm=T),
            "Greatest Yield" = max(`yield(tonnes/ha)`,na.rm=T)) %>%
  mutate("Total Yield per 10,000" = `Total Yield`/10000,
         numberOfCountries = ifelse(continent == "North America",3,
                                   ifelse(continent == "South America",4,
                                   ifelse(continent == "Europe",12,
                                   ifelse(continent == "Africa",2,
                                   ifelse(continent == "Asia",3,
                                   ifelse(continent == "Australia",1,0)))))))



plot2 <- ggplot(sumdf2, aes(`Average Yield`,
                           `Total Yield per 10,000`, 
                           text = paste("Number of Countries Represented:",`numberOfCountries`))) +
  geom_point(size = 3, alpha = 0.6, color = "#FFFFFF", aes(fill=`continent`)) +
  ggtitle("Average Yield x Total Yield by Continent") +
  xlab("Average Yield (tonnes/ha)") +
  ylab("Total Yield in thousands (tonnes/ha)") +
  theme_minimal()
plot2 <- ggplotly(plot2) %>%
  layout(legend = list(x = 0.1, y = 0.85))
plot2

Analysis

We would like to take a look at two specific countries, United States and France as they have the most amount of data points to be able to look at.

dfUSA <- df3 %>%
  filter(country == "United States")
sumdfUSA <- dfUSA %>%
  group_by(harvestYear) %>%
  summarise("Total Yield" = sum(`yield(tonnes/ha)`,na.rm=T), 
            "Average Yield" = mean(`yield(tonnes/ha)`,na.rm=T),
            "Count of Data Records" = as.numeric(n()),
            "Smallest Yield" = min(`yield(tonnes/ha)`,na.rm=T),
            "Greatest Yield" = max(`yield(tonnes/ha)`,na.rm=T))

plotUSA <- ggplot(sumdfUSA, aes(x = harvestYear, y = `Average Yield`)) +
  labs(title = "Gap Between Greatest Yields and the Average Crop Yields in USA") +
  geom_line() +
  geom_area(aes(y = `Greatest Yield`), alpha = 0.3) +
  geom_area(aes(y = `Average Yield`), fill = "white", alpha = 1) +
  xlab("Year") +
  ylab("Average Yield (tonnes/ha)") +
  theme_minimal() +
  geom_line(data = sumdfUSA, color = "red", aes(x = harvestYear, y = `Greatest Yield`))

plotUSA <- ggplotly(plotUSA)
plotUSA
dfFr <- df3 %>%
  filter(country == "France")
sumdfFr <- dfFr %>%
  group_by(harvestYear) %>%
  summarise("Total Yield" = sum(`yield(tonnes/ha)`,na.rm=T), 
            "Average Yield" = mean(`yield(tonnes/ha)`,na.rm=T),
            "Count of Data Records" = as.numeric(n()),
            "Smallest Yield" = min(`yield(tonnes/ha)`,na.rm=T),
            "Greatest Yield" = max(`yield(tonnes/ha)`,na.rm=T))

plotFr <- ggplot(sumdfFr, aes(x = harvestYear, y = `Average Yield`)) +
  labs(title = "Gap Between Greatest Yields and the Average Crop Yields in France") +
  geom_line() +
  geom_area(aes(y = `Greatest Yield`), alpha = 0.3) +
  geom_area(aes(y = `Average Yield`), fill = "white", alpha = 1) +
  xlab("Year") +
  ylab("Average Yield (tonnes/ha)") +
  theme_minimal() +
  geom_line(data = sumdfFr, color = "red", aes(x = harvestYear, y = `Greatest Yield`))

plotFr <- ggplotly(plotFr)
plotFr

Through these graphs, we find a large spike that occurs in the France dataset, so we will want to dig in further into what this means.

str(dfFr)
## tibble [5,236 × 10] (S3: tbl_df/tbl/data.frame)
##  $ harvestYear        : num [1:5236] 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 ...
##  $ country            : chr [1:5236] "France" "France" "France" "France" ...
##  $ region             : chr [1:5236] "Île-de-France" "Champagne-Ardenne" "Picardy" "Upper Normandy" ...
##  $ crop               : chr [1:5236] "wheat" "wheat" "wheat" "wheat" ...
##  $ hectares (ha)      : num [1:5236] 93450 339930 363490 211700 590100 ...
##  $ production (tonnes): num [1:5236] 187052 449776 597784 311956 906851 ...
##  $ reportingYear      : num [1:5236] 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 ...
##  $ yield(tonnes/ha)   : num [1:5236] 2 1.32 1.64 1.47 1.54 ...
##  $ continent          : Factor w/ 6 levels "Africa","Asia",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ numberOfCountries  : num [1:5236] 12 12 12 12 12 12 12 12 12 12 ...
outliersFrance <- c("Provence-Alpes-Côte dAzur","Languedoc-Roussillon","Val-de-Marne")

dfFr <- dfFr %>%
  mutate(type = ifelse(region %in% outliersFrance, "Highlighted","Normal"))
statsFr <- ggplot(dfFr, aes(x = region, y = `yield(tonnes/ha)`, fill=type)) +
  geom_boxplot() +
  labs(title = "Boxplot Review of French Regional Yields with \nHighlight to Outliers") +
  scale_fill_manual(values=c("lightblue","grey")) +
  scale_alpha_manual(values=c(.8,0.1)) +
  theme(axis.text.x = element_text(angle = 45),
        legend.position = "none") +
  xlab("French Region") +
  ylab("Total Yield (tonnes/ha)")
statsFr <- ggplotly(statsFr)
statsFr

Want to look further into outliers as they visably look to have a significant impact on the data.

outlierdfFr <- dfFr %>%
  filter(region %in% outliersFrance) %>%
  arrange(region, desc(`yield(tonnes/ha)`))
head(outlierdfFr,5)
## # A tibble: 5 × 11
##   harvest…¹ country region crop  hecta…² produ…³ repor…⁴ yield…⁵ conti…⁶ numbe…⁷
##       <dbl> <chr>   <chr>  <chr>   <dbl>   <dbl>   <dbl>   <dbl> <fct>     <dbl>
## 1      2014 France  Langu… maize    1590  16015     2014   10.1  Europe       12
## 2      2015 France  Langu… maize    1655  12945     2015    7.82 Europe       12
## 3      1997 France  Langu… maize    5140  37475.    1997    7.29 Europe       12
## 4      1998 France  Langu… maize    5288  38211.    1998    7.23 Europe       12
## 5      1992 France  Langu… maize    5917  42164.    1992    7.13 Europe       12
## # … with 1 more variable: type <chr>, and abbreviated variable names
## #   ¹​harvestYear, ²​`hectares (ha)`, ³​`production (tonnes)`, ⁴​reportingYear,
## #   ⁵​`yield(tonnes/ha)`, ⁶​continent, ⁷​numberOfCountries

After reviewing some of the data points between area and production amounts, Provence-Alpes-Côte dAzur is showing a strange change in 1979. Comparing similar years, and similar areas (hectares), the values represented do not actually seem to be to considered an impossible

outlierdfFr2 <- dfFr %>%
  filter(region == "Provence-Alpes-Côte dAzur")
fit <- lm(`production (tonnes)` ~ `hectares (ha)`, data = outlierdfFr2)
outlierdfFr2$predicted <- predict(fit)
outlierdfFr2$residuals <- residuals(fit)


plotFrOutlier <- outlierdfFr2 %>%
  ggplot(aes(`hectares (ha)`, 
             `production (tonnes)`)) +
  geom_segment(aes(xend = `hectares (ha)`, yend = predicted), alpha = 0.2) +
  geom_point(aes(color = residuals)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  scale_color_gradient2(low = "purple", mid="white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted),shape = 1) +
  labs(title = "A Look Into Outliers from Provence-Alpes-Côte dAzur, France")

plotFrOutlier <- ggplotly(plotFrOutlier)
plotFrOutlier

Lets take a look at the quick statistical relations in yield

min(outlierdfFr2$`yield(tonnes/ha)`)
## [1] 0.6234753
max(outlierdfFr2$`yield(tonnes/ha)`)
## [1] 17.16848
sd(outlierdfFr2$`yield(tonnes/ha)`)
## [1] 2.518399
mean(outlierdfFr2$`yield(tonnes/ha)`)
## [1] 3.053758
# General outlier calculation

genOutlier <- mean(outlierdfFr2$`yield(tonnes/ha)`) + 3*sd(outlierdfFr2$`yield(tonnes/ha)`)
outlierdfFr2$IsOutlier <- outlierdfFr2$`yield(tonnes/ha)` > genOutlier

outlierdfFr2 %>%
  filter(IsOutlier == TRUE) %>% head()
## # A tibble: 2 × 14
##   harvest…¹ country region crop  hecta…² produ…³ repor…⁴ yield…⁵ conti…⁶ numbe…⁷
##       <dbl> <chr>   <chr>  <chr>   <dbl>   <dbl>   <dbl>   <dbl> <fct>     <dbl>
## 1      1979 France  Prove… maize    3959   67970    1979    17.2 Europe       12
## 2      2014 France  Prove… maize    1994   21457    2014    10.8 Europe       12
## # … with 4 more variables: type <chr>, predicted <dbl>, residuals <dbl>,
## #   IsOutlier <lgl>, and abbreviated variable names ¹​harvestYear,
## #   ²​`hectares (ha)`, ³​`production (tonnes)`, ⁴​reportingYear,
## #   ⁵​`yield(tonnes/ha)`, ⁶​continent, ⁷​numberOfCountries

We find that 1979 really is a significant outlier, where as the 2014 data only barely does not make the cut of 3 standard deviations.

Summary

With the amount of data presented, we find that as the years progress, the general average yield by nations does tend to show favorable advancements. We specifically focused on two major players, USA and France as they were the two leading countries with high amounts of yield. They also have additional data points to breakdown by region as well. USA showed a stable trend line of average yields positively, whereas France had more of a larger jump. On the plots, I added information on the max yields for each of the year to see if there was any major differences between the yields versus the average yield, and to my surprise, there was a significant spike in one year in France. We decided to dig into outliers to see the major significance of this outlier to the rest of the data. Initial thoughts showed that all outliers were focused on the maize crop. Though in two of our generalized outlier tests, only two points really fell out, 1979 and 2014. 2014 only barely did not make the cut, but 1979 had a major difference. We determined that this was caused due to the region of Provence-Alpes-Côte dAzur. With additional access and insights into 1979 France, I would like to explore this year further to see what may have caused such an influx of production. I would like to review major global events or even specific internal to France events that could have occurred in that time frame to have caused this spike. Or maybe, it truly is data entry error…

Overall, the project’s main was to show that overtime, yield would increase as advancements in technology would increase as well. We did fall short on studying the potential correlations as this is only a one sided view on the argument. I did spend a lot more time going down the rabbit hole of outliers and its significance with the data, which may lead to future research opportunities.

Source

Anderson, W., W. Baethgen, F. Capitanio, P. Ciais, G. Rocca da Cunha, L. Goddard, B. Schauberger , K. Sonder, G. Podesta, M. van der Velde, L. You, and Y. Ru. 2022. Twentieth Century Crop Statistics, 1900-2017. Palisades, New York: NASA Socioeconomic Data and Applications Center (SEDAC). https://doi.org/10.7927/tmsp-sg82. Accessed 15-11-2022.

I. Yucel, A. Onen, K.K. Yilmaz, D.J. Gochis, Calibration and evaluation of a flood forecasting system: Utility of numerical weather prediction model, data assimilation and satellite-based rainfall, Journal of Hydrology, Volume 523, 2015, Pages 49-66, ISSN 0022-1694, https://doi.org/10.1016/j.jhydrol.2015.01.042. (https://www.sciencedirect.com/science/article/pii/S0022169415000591) Accessed 15-11-2022.