Introduction

In my last analysis I went through the Forest Coverage in the main Countries of the world. Now I’m practicing 2 Variables analysis so I will add the Agriculture Coverage of each country to analyze the relationship between deforestation and agriculture in each country.

My main concern: Is the World deforesting in order to have more land for Agriculture?

Getting and Cleaning Data

Data has been obtained and downloaded from Gapminder.

There are 2 datasets: Agricultural land (% of land area) and Forest coverage (%).

Note: Since the goal of this course is not Getting and Cleaning data, I’ve manually removed years from Agriculture data to make it easier to read and clean. This is not a recomended practice.

a.coverage <- read.csv("./data//agriculture land.csv",
                   col.names = c("Country","1990","2000","2005"))
f.coverage <- read.csv("./data//indicator_forest coverage.csv",
                   col.names = c("Country","1990","2000","2005"))

a.coverage <- melt(a.coverage, id.vars = c("Country"))
names(a.coverage) <- c("Country","Year","Agriculture.Coverage")
f.coverage <- melt(f.coverage, id.vars = c("Country"))
names(f.coverage) <- c("Country","Year","Forest.Coverage")

data <- merge(a.coverage,f.coverage, by=c("Country","Year"))

data$Year <- gsub("X","",data$Year)
data$Year <- as.factor(data$Year)
data$Agriculture.Coverage <- gsub(",",".",data$Agriculture.Coverage)
data$Agriculture.Coverage <- as.numeric(data$Agriculture.Coverage,na.action)
data$Forest.Coverage <- gsub(",",".",data$Forest.Coverage)
data$Forest.Coverage <- as.numeric(data$Forest.Coverage,na.action)

rm(a.coverage,f.coverage)

head(data)
##       Country Year Agriculture.Coverage Forest.Coverage
## 1 Afghanistan 1990             58.32298            2.01
## 2 Afghanistan 2000             57.88296            1.56
## 3 Afghanistan 2005             58.12367            1.33
## 4     Albania 1990             40.91241           28.80
## 5     Albania 2000             41.75182           28.07
## 6     Albania 2005             39.30657           28.98

Wow! I’m already learning something. I didn’t know that Afghanistan main economy was Agriculture

Data Processing

First some basic histograms of the Variables:

p1 <- ggplot(data, aes(x=Forest.Coverage)) + geom_histogram(binwidth = 3) + 
  facet_wrap(~ Year) +
  ggtitle("Forest Coverage around the World per Year")

p2 <- ggplot(data, aes(x=Agriculture.Coverage)) + geom_histogram(binwidth = 3) + 
  facet_wrap(~ Year) +
  ggtitle("Agriculture Coverage around the World per Year")

grid.arrange(p1,p2,ncol = 1)

It look’s like the Agriculture Coverage tends to a normal distribution. It’s a pity I don’t have data from the last 10 years!

Let’s plot a few BoxPlots:

ggplot(data, aes(x = Year, y = Forest.Coverage, fill = Year)) + 
  geom_boxplot(alpha = 0.5) + 
  scale_y_continuous(breaks = seq(0,100,5)) +
  coord_cartesian(ylim = c(00,100)) + 
  ylab("Average Forest Coverage (%)") + 
  ggtitle("Average Forest Coverage (%) by Year")

ggplot(data, aes(x = Year, y = Agriculture.Coverage, fill = Year)) + 
  geom_boxplot(alpha = 0.5) + 
  scale_y_continuous(breaks = seq(0,100,5)) +
  coord_cartesian(ylim = c(00,100)) + 
  ylab("Average Agriculture Coverage (%)") + 
  ggtitle("Average Agriculture Coverage (%) by Year")

Regarding Agriculture seems that 75% of the countries has 20% or more Agriculture Coverage and the median is almost 40%.

Argentina’s Agriculture:

argentina <- subset(data, Country == "Argentina")
argentina
##      Country Year Agriculture.Coverage Forest.Coverage
## 22 Argentina 1990             46.54893           12.88
## 23 Argentina 2000             47.05319           12.34
## 24 Argentina 2005             49.11042           12.07

Agriculture Coverage in Argentina has grown almost a 3% Between 1990 and 2005.

ggplot(argentina) + 
  geom_line(aes(x=Year,y=Agriculture.Coverage, group = Country), stat= "identity",
            size = 1, color = "red") +
  geom_boxplot(data = data , aes(x=Year,y=Agriculture.Coverage, fill = Year), alpha = 0.25) + 
  ggtitle("Agriculture Coverage (%) Changes in Argentina between 1990 and 2005\n 
          compared with the average of the World") + 
  ylab("Average Agriculture Coverage (%)") 

It’s not surprising to see Argentina closer to the 75% Quantile than the median, Argentina has a long agriculture tradition. Also remind that Argentina is the 8th biggest country in the world which gives Argentina a lot of square kilometers dedicated to the agriculture.

Agriculture Coverage vs Forest Coverage

Now let’s start plotting some scatterplots:

ggplot(data, aes(x=Forest.Coverage, y=Agriculture.Coverage)) + geom_point()

…. Okay…. No idea of how to read this plot. Let’s try plotting only 1 Year, lets say, the last one:

ggplot(subset(data, Year == 2005),
       aes(x=Forest.Coverage, y=Agriculture.Coverage)) + 
  geom_point() +
  ggtitle("Forest Coverage vs Agriculture Coverage (2005)")

Still no idea… :_)

But I can see a few interesting things:
* There is a Diagonal tendence with negative slope
* It seems to be a Vertical line near the 0% of Forest Coverage
* There are only a few points above the diagonal

Regarding the third point… There shouldn’t be points above the main diagonal, that will mean that Agriculture Coverage + Forest Coverage is bigger than the 100% of the land. Two options: bad data or some countries uses their forest as agriculture land… Let’s dig a little bit.

ggplot(subset(data, Year == 2005),
       aes(x=Forest.Coverage, y=Agriculture.Coverage)) + 
  geom_point() +
  ggtitle("Forest Coverage vs Agriculture Coverage (2005)") + 
  geom_abline(intercept = 100, slope = -1, linetype = 2, color = "red") +
  scale_y_continuous(breaks = seq(0,100,5)) +
  scale_x_continuous(breaks = seq(0,100,5)) 

subset(data, Forest.Coverage + Agriculture.Coverage > 100)
##                   Country Year Agriculture.Coverage Forest.Coverage
## 10         American Samoa 1990            20.000000           90.00
## 11         American Samoa 2000            25.000000           90.00
## 12         American Samoa 2005            25.000000           90.00
## 202                 Gabon 1990            20.013971           85.10
## 203                 Gabon 2000            20.025614           84.71
## 204                 Gabon 2005            19.947995           84.51
## 205                Gambia 1990            63.700000           44.20
## 206                Gambia 2000            68.000000           46.10
## 207                Gambia 2005            63.500000           47.10
## 235         Guinea-Bissau 1990            51.458037           78.81
## 236         Guinea-Bissau 2000            57.894737           75.39
## 237         Guinea-Bissau 2005            57.610242           73.68
## 374 Micronesia, Fed. Sts. 2000            31.428571           90.00
## 375 Micronesia, Fed. Sts. 2005            31.428571           90.00
## 508       Solomon Islands 1990             2.429439           98.89
## 530             Swaziland 2000            71.104651           30.12
## 531             Swaziland 2005            71.162791           31.45

Correlation

It’s logical to say that Forest Coverage and Agriculture Coverage has a negative correlation, after all, you need to deforest in order to have more land to grow crops. Let’s see if there is a Correlation between the Forest Coverage and Agriculture Coverage for 2005 (removing outliers):

with(subset(data, Year == 2005  & Forest.Coverage + Agriculture.Coverage < 100),
     cor.test(Forest.Coverage,Agriculture.Coverage))
## 
##  Pearson's product-moment correlation
## 
## data:  Forest.Coverage and Agriculture.Coverage
## t = -8.6851, df = 189, p-value = 1.766e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6284219 -0.4242911
## sample estimates:
##        cor 
## -0.5340962
ggplot(subset(data, Year == 2005 & Forest.Coverage + Agriculture.Coverage < 100),
       aes(x=Forest.Coverage, y=Agriculture.Coverage)) + 
  geom_point() +
  ggtitle("Forest Coverage vs Agriculture Coverage (2005)") + 
  geom_abline(intercept = 100, slope = -1, linetype = 2, color = "red") +
  scale_y_continuous(breaks = seq(0,100,5)) +
  scale_x_continuous(breaks = seq(0,100,5)) +
  geom_smooth(method = "lm")

South America Countries

Let’s check South America Countries:

south.america <- c("Argentina", "Chile", "Uruguay", "Paraguay", "Brazil",
                   "Peru", "Ecuador", "Bolivia", "Venezuela", "Colombia",
                   "Suriname", "French Guiana", "Guyana")

with(subset(data, Year == 2005 & Country %in% south.america),
     cor.test(Forest.Coverage,Agriculture.Coverage))
## 
##  Pearson's product-moment correlation
## 
## data:  Forest.Coverage and Agriculture.Coverage
## t = -3.672, df = 10, p-value = 0.004303
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.928058 -0.325299
## sample estimates:
##        cor 
## -0.7577386
ggplot(subset(data, Year == 2005 & Country %in% south.america),
       aes(x=Forest.Coverage, y=Agriculture.Coverage)) + 
  geom_point(aes(color = Country), size = 5, alpha = 1/2) +
  ggtitle("Forest Coverage vs Agriculture Coverage (2005)") +
  geom_abline(intercept = 100, slope = -1, linetype = 2, color = "red") +
  scale_y_continuous(breaks = seq(0,100,5)) +
  scale_x_continuous(breaks = seq(0,100,5)) +
  geom_smooth(method = "lm")

The correlation in South America countries seems to be stronger than the rest of the world.

Does it mean that countries in South America whit less forest has more agriculture lands? Does it mean that countries in South America deforest in order to have more land for agriculture?

Correlation between changes in Forest Coverage and Changes in Agriculture Coverage

The main question here is: Is there a correlation between the % of Deforestation and the increment in Agriculture Coverage?

Country <- as.character(unique(data$Country))
Forest.Changes <- subset(data, Year == 2005)$Forest.Coverage - subset(data, Year == 1990)$Forest.Coverage
Agriculture.Changes <- subset(data, Year == 2005)$Agriculture.Coverage - subset(data, Year == 1990)$Agriculture.Coverage

countries.Changes <- data.frame(Country, Forest.Changes, Agriculture.Changes)
countries.Changes$Country <- factor(countries.Changes$Country,
                                      levels = countries.Changes$Country) 

summary(countries.Changes)
##            Country    Forest.Changes    Agriculture.Changes
##  Afghanistan   :  1   Min.   :-24.460   Min.   :-40.7370   
##  Albania       :  1   1st Qu.: -2.500   1st Qu.: -1.4486   
##  Algeria       :  1   Median :  0.000   Median :  0.0000   
##  American Samoa:  1   Mean   : -1.241   Mean   : -0.1648   
##  Andorra       :  1   3rd Qu.:  0.480   3rd Qu.:  2.2901   
##  Angola        :  1   Max.   : 14.480   Max.   : 23.3690   
##  (Other)       :201   NA's   :10        NA's   :31

It seems that there is a lot of missing data. Regarding Forest.Coverage it is no surprising to see those values, but Is it possible that a countrie has change its Ariculture Coverage in -40%?

Let’s look at the Worst 5%…

subset(countries.Changes, 
       Agriculture.Changes < quantile(Agriculture.Changes, probs = 0.05, na.rm = TRUE))
##                   Country Forest.Changes Agriculture.Changes
## 74                 Greece           3.52           -40.73701
## 90                Ireland           3.31           -19.55291
## 119                 Malta             NA           -12.50000
## 136           New Zealand           2.19           -16.97619
## 152           Puerto Rico           0.45           -26.26832
## 157 Saint Kitts and Nevis           0.00           -26.92308
## 158           Saint Lucia           0.00           -14.75410
## 192                Tuvalu           0.00           -33.33333
## 203 Virgin Islands (U.S.)          -5.88           -11.42857

I can’t found anything about drastic changes on those countries so I’m going to treat them as outliers. What about other outliers?

subset(countries.Changes, 
       Agriculture.Changes > quantile(Agriculture.Changes, probs = 0.95, na.rm = TRUE))
##                   Country Forest.Changes Agriculture.Changes
## 21                  Benin          -8.78            11.29995
## 43                Comoros          -3.76            10.75269
## 53               Djibouti           0.00            17.34254
## 72                  Ghana          -8.48            10.96511
## 137             Nicaragua         -11.12            10.59498
## 162 Sao Tome and Principe           0.00            13.54167
## 163          Saudi Arabia           0.00            23.36895
## 166          Sierra Leone          -4.05            11.17006
## 202               Vietnam          10.96            11.76071

It seems that Saudi Arabia had an amazing increase in its agriculture without even cutting a single tree. Not so sure about other outliers…

Forest Changes vs Agriculture Changes

ggplot(subset(countries.Changes, 
       Agriculture.Changes > quantile(Agriculture.Changes, probs = 0.05, na.rm = TRUE)),
       aes(x=Forest.Changes, y=Agriculture.Changes)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  ggtitle("Forest Coverage Changes vs Agriculture Coverage Changes \n between 1990 and 2005")

ggplot(subset(countries.Changes, 
       Agriculture.Changes > quantile(Agriculture.Changes, probs = 0.05, na.rm = TRUE) &
         Country %in% south.america),
       aes(x=Forest.Changes, y=Agriculture.Changes)) +
  geom_point(aes(color = Country), alpha = 1/2, size = 5) + 
  geom_smooth(method = "lm") +
  ggtitle("Forest Coverage Changes vs Agriculture Coverage Changes \n between 1990 and 2005 \n in South America")

So it seems that Latin America has a greater negative correlation in Changes than the rest of the World. That could imply indeed that Deforestation main cause is Agriculture (rather than increment of population, cities, infraestructure, etc…)

Conclusion

Although there seems to be a negative correlation between Forest Coverage and Agriculture Coverage, mostly in South America Countries, I feel I cannot rely in the data due to some strange outliers described above and missing data.