In my last analysis I went through the Forest Coverage in the main Countries of the world. Now I’m practicing 2 Variables analysis so I will add the Agriculture Coverage of each country to analyze the relationship between deforestation and agriculture in each country.
My main concern: Is the World deforesting in order to have more land for Agriculture?
Data has been obtained and downloaded from Gapminder.
There are 2 datasets: Agricultural land (% of land area) and Forest coverage (%).
Note: Since the goal of this course is not Getting and Cleaning data, I’ve manually removed years from Agriculture data to make it easier to read and clean. This is not a recomended practice.
a.coverage <- read.csv("./data//agriculture land.csv",
col.names = c("Country","1990","2000","2005"))
f.coverage <- read.csv("./data//indicator_forest coverage.csv",
col.names = c("Country","1990","2000","2005"))
a.coverage <- melt(a.coverage, id.vars = c("Country"))
names(a.coverage) <- c("Country","Year","Agriculture.Coverage")
f.coverage <- melt(f.coverage, id.vars = c("Country"))
names(f.coverage) <- c("Country","Year","Forest.Coverage")
data <- merge(a.coverage,f.coverage, by=c("Country","Year"))
data$Year <- gsub("X","",data$Year)
data$Year <- as.factor(data$Year)
data$Agriculture.Coverage <- gsub(",",".",data$Agriculture.Coverage)
data$Agriculture.Coverage <- as.numeric(data$Agriculture.Coverage,na.action)
data$Forest.Coverage <- gsub(",",".",data$Forest.Coverage)
data$Forest.Coverage <- as.numeric(data$Forest.Coverage,na.action)
rm(a.coverage,f.coverage)
head(data)
## Country Year Agriculture.Coverage Forest.Coverage
## 1 Afghanistan 1990 58.32298 2.01
## 2 Afghanistan 2000 57.88296 1.56
## 3 Afghanistan 2005 58.12367 1.33
## 4 Albania 1990 40.91241 28.80
## 5 Albania 2000 41.75182 28.07
## 6 Albania 2005 39.30657 28.98
Wow! I’m already learning something. I didn’t know that Afghanistan main economy was Agriculture
p1 <- ggplot(data, aes(x=Forest.Coverage)) + geom_histogram(binwidth = 3) +
facet_wrap(~ Year) +
ggtitle("Forest Coverage around the World per Year")
p2 <- ggplot(data, aes(x=Agriculture.Coverage)) + geom_histogram(binwidth = 3) +
facet_wrap(~ Year) +
ggtitle("Agriculture Coverage around the World per Year")
grid.arrange(p1,p2,ncol = 1)
It look’s like the Agriculture Coverage tends to a normal distribution. It’s a pity I don’t have data from the last 10 years!
ggplot(data, aes(x = Year, y = Forest.Coverage, fill = Year)) +
geom_boxplot(alpha = 0.5) +
scale_y_continuous(breaks = seq(0,100,5)) +
coord_cartesian(ylim = c(00,100)) +
ylab("Average Forest Coverage (%)") +
ggtitle("Average Forest Coverage (%) by Year")
ggplot(data, aes(x = Year, y = Agriculture.Coverage, fill = Year)) +
geom_boxplot(alpha = 0.5) +
scale_y_continuous(breaks = seq(0,100,5)) +
coord_cartesian(ylim = c(00,100)) +
ylab("Average Agriculture Coverage (%)") +
ggtitle("Average Agriculture Coverage (%) by Year")
Regarding Agriculture seems that 75% of the countries has 20% or more Agriculture Coverage and the median is almost 40%.
argentina <- subset(data, Country == "Argentina")
argentina
## Country Year Agriculture.Coverage Forest.Coverage
## 22 Argentina 1990 46.54893 12.88
## 23 Argentina 2000 47.05319 12.34
## 24 Argentina 2005 49.11042 12.07
Agriculture Coverage in Argentina has grown almost a 3% Between 1990 and 2005.
ggplot(argentina) +
geom_line(aes(x=Year,y=Agriculture.Coverage, group = Country), stat= "identity",
size = 1, color = "red") +
geom_boxplot(data = data , aes(x=Year,y=Agriculture.Coverage, fill = Year), alpha = 0.25) +
ggtitle("Agriculture Coverage (%) Changes in Argentina between 1990 and 2005\n
compared with the average of the World") +
ylab("Average Agriculture Coverage (%)")
It’s not surprising to see Argentina closer to the 75% Quantile than the median, Argentina has a long agriculture tradition. Also remind that Argentina is the 8th biggest country in the world which gives Argentina a lot of square kilometers dedicated to the agriculture.
Now let’s start plotting some scatterplots:
ggplot(data, aes(x=Forest.Coverage, y=Agriculture.Coverage)) + geom_point()
…. Okay…. No idea of how to read this plot. Let’s try plotting only 1 Year, lets say, the last one:
ggplot(subset(data, Year == 2005),
aes(x=Forest.Coverage, y=Agriculture.Coverage)) +
geom_point() +
ggtitle("Forest Coverage vs Agriculture Coverage (2005)")
Still no idea… :_)
But I can see a few interesting things:
* There is a Diagonal tendence with negative slope
* It seems to be a Vertical line near the 0% of Forest Coverage
* There are only a few points above the diagonal
Regarding the third point… There shouldn’t be points above the main diagonal, that will mean that Agriculture Coverage + Forest Coverage is bigger than the 100% of the land. Two options: bad data or some countries uses their forest as agriculture land… Let’s dig a little bit.
ggplot(subset(data, Year == 2005),
aes(x=Forest.Coverage, y=Agriculture.Coverage)) +
geom_point() +
ggtitle("Forest Coverage vs Agriculture Coverage (2005)") +
geom_abline(intercept = 100, slope = -1, linetype = 2, color = "red") +
scale_y_continuous(breaks = seq(0,100,5)) +
scale_x_continuous(breaks = seq(0,100,5))
subset(data, Forest.Coverage + Agriculture.Coverage > 100)
## Country Year Agriculture.Coverage Forest.Coverage
## 10 American Samoa 1990 20.000000 90.00
## 11 American Samoa 2000 25.000000 90.00
## 12 American Samoa 2005 25.000000 90.00
## 202 Gabon 1990 20.013971 85.10
## 203 Gabon 2000 20.025614 84.71
## 204 Gabon 2005 19.947995 84.51
## 205 Gambia 1990 63.700000 44.20
## 206 Gambia 2000 68.000000 46.10
## 207 Gambia 2005 63.500000 47.10
## 235 Guinea-Bissau 1990 51.458037 78.81
## 236 Guinea-Bissau 2000 57.894737 75.39
## 237 Guinea-Bissau 2005 57.610242 73.68
## 374 Micronesia, Fed. Sts. 2000 31.428571 90.00
## 375 Micronesia, Fed. Sts. 2005 31.428571 90.00
## 508 Solomon Islands 1990 2.429439 98.89
## 530 Swaziland 2000 71.104651 30.12
## 531 Swaziland 2005 71.162791 31.45
It’s logical to say that Forest Coverage and Agriculture Coverage has a negative correlation, after all, you need to deforest in order to have more land to grow crops. Let’s see if there is a Correlation between the Forest Coverage and Agriculture Coverage for 2005 (removing outliers):
with(subset(data, Year == 2005 & Forest.Coverage + Agriculture.Coverage < 100),
cor.test(Forest.Coverage,Agriculture.Coverage))
##
## Pearson's product-moment correlation
##
## data: Forest.Coverage and Agriculture.Coverage
## t = -8.6851, df = 189, p-value = 1.766e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6284219 -0.4242911
## sample estimates:
## cor
## -0.5340962
ggplot(subset(data, Year == 2005 & Forest.Coverage + Agriculture.Coverage < 100),
aes(x=Forest.Coverage, y=Agriculture.Coverage)) +
geom_point() +
ggtitle("Forest Coverage vs Agriculture Coverage (2005)") +
geom_abline(intercept = 100, slope = -1, linetype = 2, color = "red") +
scale_y_continuous(breaks = seq(0,100,5)) +
scale_x_continuous(breaks = seq(0,100,5)) +
geom_smooth(method = "lm")
Let’s check South America Countries:
south.america <- c("Argentina", "Chile", "Uruguay", "Paraguay", "Brazil",
"Peru", "Ecuador", "Bolivia", "Venezuela", "Colombia",
"Suriname", "French Guiana", "Guyana")
with(subset(data, Year == 2005 & Country %in% south.america),
cor.test(Forest.Coverage,Agriculture.Coverage))
##
## Pearson's product-moment correlation
##
## data: Forest.Coverage and Agriculture.Coverage
## t = -3.672, df = 10, p-value = 0.004303
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.928058 -0.325299
## sample estimates:
## cor
## -0.7577386
ggplot(subset(data, Year == 2005 & Country %in% south.america),
aes(x=Forest.Coverage, y=Agriculture.Coverage)) +
geom_point(aes(color = Country), size = 5, alpha = 1/2) +
ggtitle("Forest Coverage vs Agriculture Coverage (2005)") +
geom_abline(intercept = 100, slope = -1, linetype = 2, color = "red") +
scale_y_continuous(breaks = seq(0,100,5)) +
scale_x_continuous(breaks = seq(0,100,5)) +
geom_smooth(method = "lm")
The correlation in South America countries seems to be stronger than the rest of the world.
Does it mean that countries in South America whit less forest has more agriculture lands? Does it mean that countries in South America deforest in order to have more land for agriculture?
The main question here is: Is there a correlation between the % of Deforestation and the increment in Agriculture Coverage?
Country <- as.character(unique(data$Country))
Forest.Changes <- subset(data, Year == 2005)$Forest.Coverage - subset(data, Year == 1990)$Forest.Coverage
Agriculture.Changes <- subset(data, Year == 2005)$Agriculture.Coverage - subset(data, Year == 1990)$Agriculture.Coverage
countries.Changes <- data.frame(Country, Forest.Changes, Agriculture.Changes)
countries.Changes$Country <- factor(countries.Changes$Country,
levels = countries.Changes$Country)
summary(countries.Changes)
## Country Forest.Changes Agriculture.Changes
## Afghanistan : 1 Min. :-24.460 Min. :-40.7370
## Albania : 1 1st Qu.: -2.500 1st Qu.: -1.4486
## Algeria : 1 Median : 0.000 Median : 0.0000
## American Samoa: 1 Mean : -1.241 Mean : -0.1648
## Andorra : 1 3rd Qu.: 0.480 3rd Qu.: 2.2901
## Angola : 1 Max. : 14.480 Max. : 23.3690
## (Other) :201 NA's :10 NA's :31
It seems that there is a lot of missing data. Regarding Forest.Coverage it is no surprising to see those values, but Is it possible that a countrie has change its Ariculture Coverage in -40%?
Let’s look at the Worst 5%…
subset(countries.Changes,
Agriculture.Changes < quantile(Agriculture.Changes, probs = 0.05, na.rm = TRUE))
## Country Forest.Changes Agriculture.Changes
## 74 Greece 3.52 -40.73701
## 90 Ireland 3.31 -19.55291
## 119 Malta NA -12.50000
## 136 New Zealand 2.19 -16.97619
## 152 Puerto Rico 0.45 -26.26832
## 157 Saint Kitts and Nevis 0.00 -26.92308
## 158 Saint Lucia 0.00 -14.75410
## 192 Tuvalu 0.00 -33.33333
## 203 Virgin Islands (U.S.) -5.88 -11.42857
I can’t found anything about drastic changes on those countries so I’m going to treat them as outliers. What about other outliers?
subset(countries.Changes,
Agriculture.Changes > quantile(Agriculture.Changes, probs = 0.95, na.rm = TRUE))
## Country Forest.Changes Agriculture.Changes
## 21 Benin -8.78 11.29995
## 43 Comoros -3.76 10.75269
## 53 Djibouti 0.00 17.34254
## 72 Ghana -8.48 10.96511
## 137 Nicaragua -11.12 10.59498
## 162 Sao Tome and Principe 0.00 13.54167
## 163 Saudi Arabia 0.00 23.36895
## 166 Sierra Leone -4.05 11.17006
## 202 Vietnam 10.96 11.76071
It seems that Saudi Arabia had an amazing increase in its agriculture without even cutting a single tree. Not so sure about other outliers…
ggplot(subset(countries.Changes,
Agriculture.Changes > quantile(Agriculture.Changes, probs = 0.05, na.rm = TRUE)),
aes(x=Forest.Changes, y=Agriculture.Changes)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("Forest Coverage Changes vs Agriculture Coverage Changes \n between 1990 and 2005")
ggplot(subset(countries.Changes,
Agriculture.Changes > quantile(Agriculture.Changes, probs = 0.05, na.rm = TRUE) &
Country %in% south.america),
aes(x=Forest.Changes, y=Agriculture.Changes)) +
geom_point(aes(color = Country), alpha = 1/2, size = 5) +
geom_smooth(method = "lm") +
ggtitle("Forest Coverage Changes vs Agriculture Coverage Changes \n between 1990 and 2005 \n in South America")
So it seems that Latin America has a greater negative correlation in Changes than the rest of the World. That could imply indeed that Deforestation main cause is Agriculture (rather than increment of population, cities, infraestructure, etc…)
Although there seems to be a negative correlation between Forest Coverage and Agriculture Coverage, mostly in South America Countries, I feel I cannot rely in the data due to some strange outliers described above and missing data.