Introduction

In this new analysis I will add two variables:

The main goal of the analysis is to explore the relation between all the variables across all the continents.

Getting and Cleaning Data

Data has been obtained and downloaded from Gapminder.

There are 3 datasets:

Note: Since the goal of this course is not Getting and Cleaning data, I’ve manually removed years from Agriculture data to make it easier to read and clean. This is not a recomended practice.

a.coverage <- read.csv("./data//agriculture land.csv",
                   col.names = c("Country","1990","2000","2005"))
f.coverage <- read.csv("./data//indicator_forest coverage.csv",
                   col.names = c("Country","1990","2000","2005"))
a.gdp <- read.csv("./data//agriculture p of GDP.csv", 
                  col.names = c("Country","1990","2000","2005"))
continents <- read.csv("./data//Countries-Continents.csv")

a.coverage <- melt(a.coverage, id.vars = c("Country"))
names(a.coverage) <- c("Country","Year","Agriculture.Coverage")
f.coverage <- melt(f.coverage, id.vars = c("Country"))
names(f.coverage) <- c("Country","Year","Forest.Coverage")
a.gdp <- melt(a.gdp, id.vars = c("Country"))
names(a.gdp) <- c("Country","Year","Agriculture.GDP")


data <- merge(a.coverage,f.coverage, by=c("Country","Year"))
data <- merge(data, a.gdp, by =c("Country", "Year"))
data <- merge(data, continents, by = c("Country"))

data$Year <- gsub("X","",data$Year)
data$Year <- as.factor(data$Year)
data$Agriculture.Coverage <- gsub(",",".",data$Agriculture.Coverage)
data$Agriculture.Coverage <- as.numeric(data$Agriculture.Coverage)
data$Forest.Coverage <- gsub(",",".",data$Forest.Coverage)
data$Forest.Coverage <- as.numeric(data$Forest.Coverage)
data$Agriculture.GDP <- gsub(",",".", data$Agriculture.GDP)
data$Agriculture.GDP <- as.numeric(data$Agriculture.GDP)

data <- data[complete.cases(data),]

rm(a.coverage,f.coverage, a.gdp, continents)

head(data)
##       Country Year Agriculture.Coverage Forest.Coverage Agriculture.GDP
## 2 Afghanistan 2005             58.12367            1.33       39.480416
## 4     Albania 2005             39.30657           28.98       22.800000
## 5     Albania 1990             40.91241           28.80       35.900790
## 6     Albania 2000             41.75182           28.07       29.130583
## 7     Algeria 2000             16.80326            0.90        8.879884
## 8     Algeria 1990             16.23855            0.75       11.358267
##   Continent
## 2      Asia
## 4    Europe
## 5    Europe
## 6    Europe
## 7    Africa
## 8    Africa

Seems that data is ready to be processed…

Data Processing

Let’s do some explorative charts, nothing too formal, just to take a look at the data…

pairs.panels(data)

There isn’t too much correlation between variables. The most interesting relation is between Agriculture.Coverage and Forest.Coverage as I analized in my previuous exercise. By looking at the Agriculture.GDP histogram it seems that only a few countries depends on its agriculture.

Let’s take a closer look to Agriculture.GDP in 2005:

summary(subset(data, Year == 2005)$Agriculture.GDP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0542  3.5210 10.1300 13.8400 21.2100 51.5700
ggplot(subset(data, Year == 2005), aes(x = Agriculture.GDP)) +
  geom_histogram(binwidth = 1)

ggplot(subset(data, Year == 2005), aes(x= Agriculture.GDP)) + 
  geom_histogram(aes(fill = Continent),binwidth = 1) + 
  facet_wrap(~ Continent)

Plotting Variable Means over Time

ggplot(data, aes(x = Year, y= Agriculture.Coverage, group=Continent)) +
  geom_line(stat = "summary", aes(color = Continent), fun.y = mean) +
  ggtitle("Agriculture Coverage mean per Continent")

ggplot(data, aes(x = Year, y= Forest.Coverage, group=Continent)) +
  geom_line(stat = "summary", aes(color = Continent), fun.y = mean) +
  ggtitle("Forest Coverage mean per Continent")

ggplot(data, aes(x = Year, y= Agriculture.GDP, group=Continent)) +
  geom_line(stat = "summary", aes(color = Continent), fun.y = mean) +
  ggtitle("Agriculture % of GDP Coverage mean per Continent")

It looks like all the Agriculture % of GDP decreased in the World from 1990 to 2005 even when Agriculture Coverage increased in Africa, Asia, North America, and South America.

Agriculture Ratio

Can Agriculture.GDP over Agriculture.Coverage may be used as a measure for the performance of the farms?

If two countries have different Agriculture.GDP but the same Agriculture.Coverage could imply that the biggest Agriculture.GDP has a better use of the soil. If 10% of my land produces the 20% of my GDP, then I’m making a better use of the soil than a country that uses 10% of their land to produce the 5% of the GDP.

data$Agriculture.Ratio <- data$Agriculture.GDP / data$Agriculture.Coverage
summary(data$Agriculture.Ratio)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.008936  0.146700  0.349400  0.834700  0.619700 19.790000
ggplot(subset(data, Agriculture.Ratio < quantile(Agriculture.Ratio, probs = 0.95)),
              aes(x = Year, y = Agriculture.Ratio)) + geom_boxplot()

ggplot(subset(data, Agriculture.Ratio < quantile(Agriculture.Ratio, probs = 0.95)),
              aes(x = Year, y = Agriculture.Ratio)) + geom_boxplot() + facet_wrap(~Continent)

Conclusion

Facts extracted so far:

ggplot(subset(data, Continent == "Asia"),
       aes(x = Year, y = Agriculture.Coverage, group = Continent)) +
  geom_line(stat = "summary", aes(color = Continent), fun.y = mean)

ggplot(subset(data, Continent == "Asia"),
       aes(x = Year, y = Agriculture.GDP, group = Continent)) +
  geom_line(stat = "summary", aes(color = Continent), fun.y = mean)