In this new analysis I will add two variables:
The main goal of the analysis is to explore the relation between all the variables across all the continents.
Data has been obtained and downloaded from Gapminder.
There are 3 datasets:
Note: Since the goal of this course is not Getting and Cleaning data, I’ve manually removed years from Agriculture data to make it easier to read and clean. This is not a recomended practice.
a.coverage <- read.csv("./data//agriculture land.csv",
col.names = c("Country","1990","2000","2005"))
f.coverage <- read.csv("./data//indicator_forest coverage.csv",
col.names = c("Country","1990","2000","2005"))
a.gdp <- read.csv("./data//agriculture p of GDP.csv",
col.names = c("Country","1990","2000","2005"))
continents <- read.csv("./data//Countries-Continents.csv")
a.coverage <- melt(a.coverage, id.vars = c("Country"))
names(a.coverage) <- c("Country","Year","Agriculture.Coverage")
f.coverage <- melt(f.coverage, id.vars = c("Country"))
names(f.coverage) <- c("Country","Year","Forest.Coverage")
a.gdp <- melt(a.gdp, id.vars = c("Country"))
names(a.gdp) <- c("Country","Year","Agriculture.GDP")
data <- merge(a.coverage,f.coverage, by=c("Country","Year"))
data <- merge(data, a.gdp, by =c("Country", "Year"))
data <- merge(data, continents, by = c("Country"))
data$Year <- gsub("X","",data$Year)
data$Year <- as.factor(data$Year)
data$Agriculture.Coverage <- gsub(",",".",data$Agriculture.Coverage)
data$Agriculture.Coverage <- as.numeric(data$Agriculture.Coverage)
data$Forest.Coverage <- gsub(",",".",data$Forest.Coverage)
data$Forest.Coverage <- as.numeric(data$Forest.Coverage)
data$Agriculture.GDP <- gsub(",",".", data$Agriculture.GDP)
data$Agriculture.GDP <- as.numeric(data$Agriculture.GDP)
data <- data[complete.cases(data),]
rm(a.coverage,f.coverage, a.gdp, continents)
head(data)
## Country Year Agriculture.Coverage Forest.Coverage Agriculture.GDP
## 2 Afghanistan 2005 58.12367 1.33 39.480416
## 4 Albania 2005 39.30657 28.98 22.800000
## 5 Albania 1990 40.91241 28.80 35.900790
## 6 Albania 2000 41.75182 28.07 29.130583
## 7 Algeria 2000 16.80326 0.90 8.879884
## 8 Algeria 1990 16.23855 0.75 11.358267
## Continent
## 2 Asia
## 4 Europe
## 5 Europe
## 6 Europe
## 7 Africa
## 8 Africa
Seems that data is ready to be processed…
Let’s do some explorative charts, nothing too formal, just to take a look at the data…
pairs.panels(data)
There isn’t too much correlation between variables. The most interesting relation is between Agriculture.Coverage and Forest.Coverage as I analized in my previuous exercise. By looking at the Agriculture.GDP histogram it seems that only a few countries depends on its agriculture.
Let’s take a closer look to Agriculture.GDP in 2005:
summary(subset(data, Year == 2005)$Agriculture.GDP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0542 3.5210 10.1300 13.8400 21.2100 51.5700
ggplot(subset(data, Year == 2005), aes(x = Agriculture.GDP)) +
geom_histogram(binwidth = 1)
ggplot(subset(data, Year == 2005), aes(x= Agriculture.GDP)) +
geom_histogram(aes(fill = Continent),binwidth = 1) +
facet_wrap(~ Continent)
ggplot(data, aes(x = Year, y= Agriculture.Coverage, group=Continent)) +
geom_line(stat = "summary", aes(color = Continent), fun.y = mean) +
ggtitle("Agriculture Coverage mean per Continent")
ggplot(data, aes(x = Year, y= Forest.Coverage, group=Continent)) +
geom_line(stat = "summary", aes(color = Continent), fun.y = mean) +
ggtitle("Forest Coverage mean per Continent")
ggplot(data, aes(x = Year, y= Agriculture.GDP, group=Continent)) +
geom_line(stat = "summary", aes(color = Continent), fun.y = mean) +
ggtitle("Agriculture % of GDP Coverage mean per Continent")
It looks like all the Agriculture % of GDP decreased in the World from 1990 to 2005 even when Agriculture Coverage increased in Africa, Asia, North America, and South America.
Can Agriculture.GDP over Agriculture.Coverage may be used as a measure for the performance of the farms?
If two countries have different Agriculture.GDP but the same Agriculture.Coverage could imply that the biggest Agriculture.GDP has a better use of the soil. If 10% of my land produces the 20% of my GDP, then I’m making a better use of the soil than a country that uses 10% of their land to produce the 5% of the GDP.
data$Agriculture.Ratio <- data$Agriculture.GDP / data$Agriculture.Coverage
summary(data$Agriculture.Ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.008936 0.146700 0.349400 0.834700 0.619700 19.790000
ggplot(subset(data, Agriculture.Ratio < quantile(Agriculture.Ratio, probs = 0.95)),
aes(x = Year, y = Agriculture.Ratio)) + geom_boxplot()
ggplot(subset(data, Agriculture.Ratio < quantile(Agriculture.Ratio, probs = 0.95)),
aes(x = Year, y = Agriculture.Ratio)) + geom_boxplot() + facet_wrap(~Continent)
Facts extracted so far:
ggplot(subset(data, Continent == "Asia"),
aes(x = Year, y = Agriculture.Coverage, group = Continent)) +
geom_line(stat = "summary", aes(color = Continent), fun.y = mean)
ggplot(subset(data, Continent == "Asia"),
aes(x = Year, y = Agriculture.GDP, group = Continent)) +
geom_line(stat = "summary", aes(color = Continent), fun.y = mean)