The study of Outer Space is not limited to what individuals see looking out into the sky, but also what is here on the land we stand on. What goes on in space can help determine what food we grow. For years, outer space has been used to monitor the atmosphere and weather anomalies on Earth. Food, crops, our farms are all affected heavily by the weather and it is crucial for the health and well-being of our people to have a steady and sustained supply of crops. The dataset presented today is provided by NASA through the SocioEconomic Data and Applications Center (SEDAC) and hosted by Columbia University’s Center for International Earth Science Information Network. We are looking through this dataset to see the trends of global crop growth (or decay) throughout the twentieth century up until 2017. Our interest is to find similarities between Space technology growth within a country and its potential crop yield. Though of course, correlation doesn’t not imply causation, though this view, we may find insights to better understand the working relationship of farming and its reliance on Space technology advancements.
Meteorologists, the popular person you watch on tv in the morning before going out for work, study and forecast weather and its anomalies. But the complexities of Mother Nature cannot be handled through simple observations done through the human eyes. We need tools. We need technology. A common set of tools used to study hydrology is the use of satellites. They can help find rumblings within the atmosphere through physics-based hydrometeorological systems using Multi-sensor Precipitation Estimates (MPE) housed on these satellites. Researchers would be able to take the data gathered from the sensors and create modeling algorithms to predict potentials of rainfall and the appropriate runoff paths (I. Yucel). This information and foresight allows farmers to be able to prepare for what is to come, whether it be to dry off to prevent deluging their crops, or to open up additional spillways to handle lighter rain showers. As our sensor technologies advance, so to do our algorithms and the predictability for major weather events. We take a look at this data to see available trends throughout the years as advancements are made in during the Information Age and beyond.
The dataset has a variety of tabs to help split out the information. Our focus will be on a single tab called “CropStats”. This tab has the raw data that each of the additional tabs presents. For purposes of keeping the file size down, we will remove all accessory tabs other than the “Title Page”, “Wheat_Data_Sources”, and “Maize_Data_Sources” tab for purposes of source reference.
Packages <- c("tidyverse","dplyr","readxl","plotly","ggthemes","treemapify","RColorBrewer")
invisible(lapply(Packages, library, character.only = TRUE))
setwd("~/Data Study/Data 110 MC/Project2")
df <- read_excel("food-twentieth-century-crop-statistics-1900-2017-xlsx.xlsx",sheet = "CropStats")
str(df)
## tibble [36,707 × 11] (S3: tbl_df/tbl/data.frame)
## $ ...1 : num [1:36707] 0 1 2 3 4 5 6 7 8 9 ...
## $ Harvest_year : num [1:36707] 1902 1903 1904 1905 1906 ...
## $ admin0 : chr [1:36707] "Austria" "Austria" "Austria" "Austria" ...
## $ admin1 : chr [1:36707] NA NA NA NA ...
## $ crop : chr [1:36707] "wheat" "wheat" "wheat" "wheat" ...
## $ hectares (ha) : num [1:36707] NA NA NA NA NA NA NA NA NA NA ...
## $ production (tonnes): num [1:36707] NA NA NA NA NA NA NA NA NA NA ...
## $ year : num [1:36707] 1902 1903 1904 1905 1906 ...
## $ yield(tonnes/ha) : num [1:36707] 1.31 1.47 1.27 1.33 1.28 1.37 1.36 1.35 1.18 1.37 ...
## $ admin2 : logi [1:36707] NA NA NA NA NA NA ...
## $ notes : logi [1:36707] NA NA NA NA NA NA ...
We find that the first column is an index, specifically starting with 0, we will exclude this from our data as it is not needed. We also find that the notes and admin2 is set to logical, but our initial guess would state that they should be char. However, for this project, we are not reviewing text, and we know that admin2 is blank. Thus, these two columns will also be removed.
colSums(is.na(df))
## ...1 Harvest_year admin0 admin1
## 0 0 0 2934
## crop hectares (ha) production (tonnes) year
## 0 1623 1998 0
## yield(tonnes/ha) admin2 notes
## 2013 36707 36707
To aide us in our view, we will remove a few columns as noted earlier, and rename some of the columns for better clarification.
df2 <- df %>%
select(-`...1`,-`admin2`,-`notes`)
# We are renaming Harvest_year to be harvestYear to follow syntax
# Our initial review of the data indicates that admin0 is country
# Our initial review of the data indicates that admin1 is region/providence/state, simply stated as region
# Our initial review of the data indicates that year is the reportingYear of the information
df2 <- df2 %>%
rename("harvestYear" = "Harvest_year", "country" = "admin0", "region" = "admin1", "reportingYear" = "year")
Continuing our review of the dataset
table(df2$`country`)
##
## Argentina Australia Austria Belgium Brazil
## 237 2272 116 116 555
## Canada Chile China Croatia Czech Republic
## 220 176 4002 116 111
## Finland France India Indonesia Italy
## 116 5236 233 96 2128
## Mexico Morocco Netherlands Portugal South Africa
## 246 80 234 89 216
## Spain Sweden United Kingdom United States Uruguay
## 1638 256 134 17846 238
sumdf <- df2 %>%
group_by(country) %>%
summarise("Total Yield" = sum(`yield(tonnes/ha)`,na.rm=T),
"Average Yield" = mean(`yield(tonnes/ha)`,na.rm=T),
"Count of Data Records" = n(),
"Smallest Yield" = min(`yield(tonnes/ha)`,na.rm=T),
"Greatest Yield" = max(`yield(tonnes/ha)`,na.rm=T)) %>%
mutate("Total Yield per 10,000" = `Total Yield`/10000)
plot1 <- ggplot(sumdf, aes(`Average Yield`,
`Total Yield per 10,000`,
color=`country`,
text = paste("Number of Records:",`Count of Data Records`))) +
geom_point(size = 3, alpha = 0.6) +
ggtitle("Average Yield x Total Yield by Country") +
xlab("Average Yield (tonnes/ha)") +
ylab("Total Yield in thousands (tonnes/ha)") +
theme_minimal()
plot1 <- ggplotly(plot1)
plot1
Categorize countries
northAmerica <- c("United States", "Canada", "Mexico")
southAmerica <- c("Argentina","Brazil","Chile","Uruguay")
europe <- c("Austria","Belgium","Croatia","Czech Republic","Finland","France","Italy","Netherlands","Portugal","Spain","Sweden","United Kingdom")
africa <- c("Morocco","South Africa")
asia <- c("China","India","Indonesia")
australia <- c("Australia")
df3 <- df2 %>%
mutate(continent = factor(ifelse(country %in% northAmerica,"North America",
ifelse(country %in% southAmerica,"South America",
ifelse(country %in% europe,"Europe",
ifelse(country %in% africa,"Africa",
ifelse(country %in% asia,"Asia",
ifelse(country %in% australia,"Australia","Not Listed"))))))),
numberOfCountries = ifelse(country %in% northAmerica,3,
ifelse(country %in% southAmerica,4,
ifelse(country %in% europe,12,
ifelse(country %in% africa,2,
ifelse(country %in% asia,3,
ifelse(country %in% australia,1,0)))))))
str(df3)
## tibble [36,707 × 10] (S3: tbl_df/tbl/data.frame)
## $ harvestYear : num [1:36707] 1902 1903 1904 1905 1906 ...
## $ country : chr [1:36707] "Austria" "Austria" "Austria" "Austria" ...
## $ region : chr [1:36707] NA NA NA NA ...
## $ crop : chr [1:36707] "wheat" "wheat" "wheat" "wheat" ...
## $ hectares (ha) : num [1:36707] NA NA NA NA NA NA NA NA NA NA ...
## $ production (tonnes): num [1:36707] NA NA NA NA NA NA NA NA NA NA ...
## $ reportingYear : num [1:36707] 1902 1903 1904 1905 1906 ...
## $ yield(tonnes/ha) : num [1:36707] 1.31 1.47 1.27 1.33 1.28 1.37 1.36 1.35 1.18 1.37 ...
## $ continent : Factor w/ 6 levels "Africa","Asia",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ numberOfCountries : num [1:36707] 12 12 12 12 12 12 12 12 12 12 ...
Now that we have countries categorized by continent, how does our total versus averages look?
sumdf2 <- df3 %>%
group_by(continent) %>%
summarise("Total Yield" = sum(`yield(tonnes/ha)`,na.rm=T),
"Average Yield" = mean(`yield(tonnes/ha)`,na.rm=T),
"Count of Data Records" = n(),
"Smallest Yield" = min(`yield(tonnes/ha)`,na.rm=T),
"Greatest Yield" = max(`yield(tonnes/ha)`,na.rm=T)) %>%
mutate("Total Yield per 10,000" = `Total Yield`/10000,
numberOfCountries = ifelse(continent == "North America",3,
ifelse(continent == "South America",4,
ifelse(continent == "Europe",12,
ifelse(continent == "Africa",2,
ifelse(continent == "Asia",3,
ifelse(continent == "Australia",1,0)))))))
plot2 <- ggplot(sumdf2, aes(`Average Yield`,
`Total Yield per 10,000`,
text = paste("Number of Countries Represented:",`numberOfCountries`))) +
geom_point(size = 3, alpha = 0.6, color = "#FFFFFF", aes(fill=`continent`)) +
ggtitle("Average Yield x Total Yield by Continent") +
xlab("Average Yield (tonnes/ha)") +
ylab("Total Yield in thousands (tonnes/ha)") +
theme_minimal()
plot2 <- ggplotly(plot2) %>%
layout(legend = list(x = 0.1, y = 0.85))
plot2
We would like to take a look at two specific countries, United States and France as they have the most amount of data points to be able to look at.
dfUSA <- df3 %>%
filter(country == "United States")
sumdfUSA <- dfUSA %>%
group_by(harvestYear) %>%
summarise("Total Yield" = sum(`yield(tonnes/ha)`,na.rm=T),
"Average Yield" = mean(`yield(tonnes/ha)`,na.rm=T),
"Count of Data Records" = as.numeric(n()),
"Smallest Yield" = min(`yield(tonnes/ha)`,na.rm=T),
"Greatest Yield" = max(`yield(tonnes/ha)`,na.rm=T))
plotUSA <- ggplot(sumdfUSA, aes(x = harvestYear, y = `Average Yield`)) +
labs(title = "Gap Between Greatest Yields and the Average Crop Yields in USA") +
geom_line() +
geom_area(aes(y = `Greatest Yield`), alpha = 0.3) +
geom_area(aes(y = `Average Yield`), fill = "white", alpha = 1) +
xlab("Year") +
ylab("Average Yield (tonnes/ha)") +
theme_minimal() +
geom_line(data = sumdfUSA, color = "red", aes(x = harvestYear, y = `Greatest Yield`))
plotUSA <- ggplotly(plotUSA)
plotUSA
dfFr <- df3 %>%
filter(country == "France")
sumdfFr <- dfFr %>%
group_by(harvestYear) %>%
summarise("Total Yield" = sum(`yield(tonnes/ha)`,na.rm=T),
"Average Yield" = mean(`yield(tonnes/ha)`,na.rm=T),
"Count of Data Records" = as.numeric(n()),
"Smallest Yield" = min(`yield(tonnes/ha)`,na.rm=T),
"Greatest Yield" = max(`yield(tonnes/ha)`,na.rm=T))
plotFr <- ggplot(sumdfFr, aes(x = harvestYear, y = `Average Yield`)) +
labs(title = "Gap Between Greatest Yields and the Average Crop Yields in France") +
geom_line() +
geom_area(aes(y = `Greatest Yield`), alpha = 0.3) +
geom_area(aes(y = `Average Yield`), fill = "white", alpha = 1) +
xlab("Year") +
ylab("Average Yield (tonnes/ha)") +
theme_minimal() +
geom_line(data = sumdfFr, color = "red", aes(x = harvestYear, y = `Greatest Yield`))
plotFr <- ggplotly(plotFr)
plotFr
Through these graphs, we find a large spike that occurs in the France dataset, so we will want to dig in further into what this means.
str(dfFr)
## tibble [5,236 × 10] (S3: tbl_df/tbl/data.frame)
## $ harvestYear : num [1:5236] 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 ...
## $ country : chr [1:5236] "France" "France" "France" "France" ...
## $ region : chr [1:5236] "Île-de-France" "Champagne-Ardenne" "Picardy" "Upper Normandy" ...
## $ crop : chr [1:5236] "wheat" "wheat" "wheat" "wheat" ...
## $ hectares (ha) : num [1:5236] 93450 339930 363490 211700 590100 ...
## $ production (tonnes): num [1:5236] 187052 449776 597784 311956 906851 ...
## $ reportingYear : num [1:5236] 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 ...
## $ yield(tonnes/ha) : num [1:5236] 2 1.32 1.64 1.47 1.54 ...
## $ continent : Factor w/ 6 levels "Africa","Asia",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ numberOfCountries : num [1:5236] 12 12 12 12 12 12 12 12 12 12 ...
outliersFrance <- c("Provence-Alpes-Côte dAzur","Languedoc-Roussillon","Val-de-Marne")
dfFr <- dfFr %>%
mutate(type = ifelse(region %in% outliersFrance, "Highlighted","Normal"))
statsFr <- ggplot(dfFr, aes(x = region, y = `yield(tonnes/ha)`, fill=type)) +
geom_boxplot() +
labs(title = "Boxplot Review of French Regional Yields with \nHighlight to Outliers") +
scale_fill_manual(values=c("lightblue","grey")) +
scale_alpha_manual(values=c(.8,0.1)) +
theme(axis.text.x = element_text(angle = 45),
legend.position = "none") +
xlab("French Region") +
ylab("Total Yield (tonnes/ha)")
statsFr <- ggplotly(statsFr)
statsFr
Want to look further into outliers as they visably look to have a significant impact on the data.
outlierdfFr <- dfFr %>%
filter(region %in% outliersFrance) %>%
arrange(region, desc(`yield(tonnes/ha)`))
head(outlierdfFr,5)
## # A tibble: 5 × 11
## harvest…¹ country region crop hecta…² produ…³ repor…⁴ yield…⁵ conti…⁶ numbe…⁷
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 2014 France Langu… maize 1590 16015 2014 10.1 Europe 12
## 2 2015 France Langu… maize 1655 12945 2015 7.82 Europe 12
## 3 1997 France Langu… maize 5140 37475. 1997 7.29 Europe 12
## 4 1998 France Langu… maize 5288 38211. 1998 7.23 Europe 12
## 5 1992 France Langu… maize 5917 42164. 1992 7.13 Europe 12
## # … with 1 more variable: type <chr>, and abbreviated variable names
## # ¹harvestYear, ²`hectares (ha)`, ³`production (tonnes)`, ⁴reportingYear,
## # ⁵`yield(tonnes/ha)`, ⁶continent, ⁷numberOfCountries
After reviewing some of the data points between area and production amounts, Provence-Alpes-Côte dAzur is showing a strange change in 1979. Comparing similar years, and similar areas (hectares), the values represented do not actually seem to be to considered an impossible
outlierdfFr2 <- dfFr %>%
filter(region == "Provence-Alpes-Côte dAzur")
fit <- lm(`production (tonnes)` ~ `hectares (ha)`, data = outlierdfFr2)
outlierdfFr2$predicted <- predict(fit)
outlierdfFr2$residuals <- residuals(fit)
plotFrOutlier <- outlierdfFr2 %>%
ggplot(aes(`hectares (ha)`,
`production (tonnes)`)) +
geom_segment(aes(xend = `hectares (ha)`, yend = predicted), alpha = 0.2) +
geom_point(aes(color = residuals)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
scale_color_gradient2(low = "purple", mid="white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted),shape = 1) +
labs(title = "A Look Into Outliers from Provence-Alpes-Côte dAzur, France")
plotFrOutlier <- ggplotly(plotFrOutlier)
plotFrOutlier
Lets take a look at the quick statistical relations in yield
min(outlierdfFr2$`yield(tonnes/ha)`)
## [1] 0.6234753
max(outlierdfFr2$`yield(tonnes/ha)`)
## [1] 17.16848
sd(outlierdfFr2$`yield(tonnes/ha)`)
## [1] 2.518399
mean(outlierdfFr2$`yield(tonnes/ha)`)
## [1] 3.053758
# General outlier calculation
genOutlier <- mean(outlierdfFr2$`yield(tonnes/ha)`) + 3*sd(outlierdfFr2$`yield(tonnes/ha)`)
outlierdfFr2$IsOutlier <- outlierdfFr2$`yield(tonnes/ha)` > genOutlier
outlierdfFr2 %>%
filter(IsOutlier == TRUE) %>% head()
## # A tibble: 2 × 14
## harvest…¹ country region crop hecta…² produ…³ repor…⁴ yield…⁵ conti…⁶ numbe…⁷
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 1979 France Prove… maize 3959 67970 1979 17.2 Europe 12
## 2 2014 France Prove… maize 1994 21457 2014 10.8 Europe 12
## # … with 4 more variables: type <chr>, predicted <dbl>, residuals <dbl>,
## # IsOutlier <lgl>, and abbreviated variable names ¹harvestYear,
## # ²`hectares (ha)`, ³`production (tonnes)`, ⁴reportingYear,
## # ⁵`yield(tonnes/ha)`, ⁶continent, ⁷numberOfCountries
We find that 1979 really is a significant outlier, where as the 2014 data only barely does not make the cut of 3 standard deviations.
With the amount of data presented, we find that as the years progress, the general average yield by nations does tend to show favorable advancements. We specifically focused on two major players, USA and France as they were the two leading countries with high amounts of yield. They also have additional data points to breakdown by region as well. USA showed a stable trend line of average yields positively, whereas France had more of a larger jump. On the plots, I added information on the max yields for each of the year to see if there was any major differences between the yields versus the average yield, and to my surprise, there was a significant spike in one year in France. We decided to dig into outliers to see the major significance of this outlier to the rest of the data. Initial thoughts showed that all outliers were focused on the maize crop. Though in two of our generalized outlier tests, only two points really fell out, 1979 and 2014. 2014 only barely did not make the cut, but 1979 had a major difference. We determined that this was caused due to the region of Provence-Alpes-Côte dAzur. With additional access and insights into 1979 France, I would like to explore this year further to see what may have caused such an influx of production. I would like to review major global events or even specific internal to France events that could have occurred in that time frame to have caused this spike. Or maybe, it truly is data entry error…
Overall, the project’s main was to show that overtime, yield would increase as advancements in technology would increase as well. We did fall short on studying the potential correlations as this is only a one sided view on the argument. I did spend a lot more time going down the rabbit hole of outliers and its significance with the data, which may lead to future research opportunities.
Anderson, W., W. Baethgen, F. Capitanio, P. Ciais, G. Rocca da Cunha, L. Goddard, B. Schauberger , K. Sonder, G. Podesta, M. van der Velde, L. You, and Y. Ru. 2022. Twentieth Century Crop Statistics, 1900-2017. Palisades, New York: NASA Socioeconomic Data and Applications Center (SEDAC). https://doi.org/10.7927/tmsp-sg82. Accessed 15-11-2022.
I. Yucel, A. Onen, K.K. Yilmaz, D.J. Gochis, Calibration and evaluation of a flood forecasting system: Utility of numerical weather prediction model, data assimilation and satellite-based rainfall, Journal of Hydrology, Volume 523, 2015, Pages 49-66, ISSN 0022-1694, https://doi.org/10.1016/j.jhydrol.2015.01.042. (https://www.sciencedirect.com/science/article/pii/S0022169415000591) Accessed 15-11-2022.