Date: May 01, 2016

Synopsis

The goal is to analyze the Mango tracing system data and to garner insights about the data. We have explored the relationship between a set of variables and qute_mangue_prod (outcome). We have choosen on the basis of specific criteria a model that give us a better out-of-sample accuracy. With this model we have illustrated that water-stressed trees yield was less greater than irrigated trees yield.

Data processing

The data for this report come from the googlesheet server. We have downloaded the data using an interactive request. For issuing time-consuming dowloading or SSL connection error, we have stored a backup copy of the dataset in our server.

library(googlesheets)
dtable <- "agridata"
# Grab the Google Sheet
sheet <- gs_title(dtable)
if (exists("sheet")) {
    sysdata <- gs_read_csv(sheet)
} else sysdata <- mango_data
# transform parc_irrigue into numeric variable
sysdata$parc_irriguee <- gsub("Oui", "1", sysdata$parc_irriguee)
sysdata$parc_irriguee <- gsub("Non", "0", sysdata$parc_irriguee)
sysdata$parc_irriguee <- as.numeric(sysdata$parc_irriguee)

Eploratory data analysis

The goal of this task is to understand the basic relationships observed between amount of mango in production in the dataset and prepare to build a model. We have first subsetted variables that seem to have strong relationship with amount of mango in production. The graph below shows us that qute_mangue seems to have a better relationship over age_plantation and prod_an_mangue. We will be focusing on qute_mangue_prod as outcome and the set of variables: qute_mangue and parc_irrigue as explanatory variables.

library(ggplot2)
par(mfrow = c(2,2))
pairs(sysdata[ ,c(20:24)], panel = panel.smooth, main = "Exploring Mango tracing system data", col = 3 + (sysdata$parc_irriguee))

We smooth the relationship between qute_mangue and qute_mangue_prod (outcome). We are looking how this relationship is different betwen water-stressed and irrigated mango trees. There’s a positive relationship and it seems to be weak for both water-stressed and irrigated mango trees.

library(ggplot2)
sysdata = transform(sysdata, Irrigation=factor(parc_irriguee,labels=c("water-stressed","irrigated")))

# Function to generate correlation coefficient for the charts
corr_eqn <- function(x,y, digits = 2) {
    corr_coef <- round(cor(x, y), digits = digits)
    paste("italic(r) == ", corr_coef)
}
corr_data <- sysdata[ ,c(22,21)]
xy <- corr_data[which(complete.cases(corr_data)==TRUE), ]

labels = data.frame(x = 500, y = 300, label = corr_eqn(xy$qute_mangue_prod, xy$qute_mangue))

# linear relationships between computed and reading area
c <- ggplot(sysdata, aes(y=qute_mangue, x=qute_mangue_prod))
c + stat_smooth(method=lm) + geom_point() + ggtitle("Quantite mangue dans la production versus\n quantite mangue sur la parcelle") +
    geom_point(aes(colour = Irrigation)) + theme(legend.position = "bottom", legend.title = element_blank()) + 
    geom_text(data = labels, aes(x = x, y = y, label = label), size = 14, parse = TRUE)

For the explanatory variable (irrigation), which is factor variable, we create a frequency table to display the counts. We can also aggregate the amount of qute_mango_prod by parc_irrigue and group it by:

  1. Type of traitement phytosanitaire
  2. régime foncier
  3. Type culture
  4. fertilisation chimique
  5. fertilisation organique
  6. fréquence inondation
table(sysdata$Irrigation)
## 
## water-stressed      irrigated 
##           1619           3723
aggr1 <- aggregate(qute_mangue_prod ~ Irrigation + type_trait_phyto, sysdata, sum)
aggr2 <- aggregate(qute_mangue_prod ~ Irrigation + regime_foncier, sysdata, sum)
aggr3 <- aggregate(qute_mangue_prod ~ Irrigation + type_culture, sysdata, sum)
aggr4 <- aggregate(qute_mangue_prod ~ Irrigation + ferti_chimique, sysdata, sum)
aggr5 <- aggregate(qute_mangue_prod ~ Irrigation + ferti_organique, sysdata, sum)
aggr6 <- aggregate(qute_mangue_prod ~ Irrigation + freq_inond, sysdata, sum)

ggplot(aggr1, aes(Irrigation, qute_mangue_prod )) +   
  geom_bar(aes(fill = type_trait_phyto), position = "dodge", stat="identity")

ggplot(aggr2, aes(Irrigation, qute_mangue_prod )) +   
  geom_bar(aes(fill = regime_foncier), position = "dodge", stat="identity")

ggplot(aggr3, aes(Irrigation, qute_mangue_prod )) +   
  geom_bar(aes(fill = type_culture), position = "dodge", stat="identity")

ggplot(aggr4, aes(Irrigation, qute_mangue_prod )) +   
  geom_bar(aes(fill = ferti_chimique), position = "dodge", stat="identity")

ggplot(aggr5, aes(Irrigation, qute_mangue_prod )) +   
  geom_bar(aes(fill = ferti_organique), position = "dodge", stat="identity")

ggplot(aggr6, aes(Irrigation, qute_mangue_prod )) +   
  geom_bar(aes(fill = freq_inond), position = "dodge", stat="identity")

Statistical inference

T-test

We start by looking at whether there are group differences by mean for water-stressed and irrigated, which are the supplement types. The t-value is 0.278, the p-value is 0.781, and the confidence interval contains the zero value, which denotes no effect so I fail to reject the null hypothesis that there is no difference on quantité de mangues dans la production across water-stressed and irrigated conditions

t.test(qute_mangue_prod ~ Irrigation, data = sysdata)
## 
##  Welch Two Sample t-test
## 
## data:  qute_mangue_prod by Irrigation
## t = 0.27808, df = 2097.2, p-value = 0.781
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9016031  1.1995396
## sample estimates:
## mean in group water-stressed      mean in group irrigated 
##                     3.542027                     3.393059

T-test for mean differences by sex

To examine the yield of mango trees by farmer sex, I subset the data set into 2 smaller data sets according to sex type of farmers. I then use the t-tests to see if there are differences between the 2 dosage groups. The large absolute value of t-statistics and very small p-values suggest that all differences for each test are statistically significant so that we can reject the null hypothesis that the difference in means is 0 for each of the hypothesis tests. Also, the confidence intervals do not contain zero.

sysdata_male <- subset(sysdata, sexe_prop == "M")
sysdata_female <- subset(sysdata, sexe_prop == "F")
t.test(qute_mangue_prod ~ Irrigation, data = sysdata_male)
## 
##  Welch Two Sample t-test
## 
## data:  qute_mangue_prod by Irrigation
## t = 0.43989, df = 1467.1, p-value = 0.6601
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.097216  1.731574
## sample estimates:
## mean in group water-stressed      mean in group irrigated 
##                     3.897890                     3.580711
t.test(qute_mangue_prod ~ Irrigation, data = sysdata_female)
## 
##  Welch Two Sample t-test
## 
## data:  qute_mangue_prod by Irrigation
## t = -0.5059, df = 1073.1, p-value = 0.613
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6488806  0.3828694
## sample estimates:
## mean in group water-stressed      mean in group irrigated 
##                     2.568129                     2.701135

Regression Model

Let’s include a subset of variables in the agridata dataset as predictors and qute_mangue_prod as the outcome in our first model to understand multiple variable regression

fit0 <- lm(qute_mangue_prod ~ ., data=sysdata[ ,20:24])
summary(fit0)$coef
##                     Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)     2.0887247009 0.996496410  2.0960685 3.624673e-02
## age_plantation -0.0004816246 0.003894429 -0.1236702 9.015933e-01
## qute_mangue     0.5007819691 0.049529824 10.1107157 2.800003e-23
## prod_an_mangue  0.0027656863 0.000815800  3.3901523 7.170418e-04
## parc_irriguee  -1.6632678464 1.166483921 -1.4258815 1.541149e-01

We grap only parc_irriguee coefficient. Its estimated value is -1.66. So we estimate an expected -1.66 decrease in Numer of mango in production for every 1% decrease in number of water-stressed/irrigated trees in holding the remaining variables constant.
Notice that The t-test for \(H_0: \beta_{Irrigation} = 0\) versus \(H_a: \beta_{Irrigation} \neq 0\) has a P-value equal to 0.15. it is not significant.

fit1<-lm(qute_mangue_prod ~ Irrigation, data=sysdata)
summary(fit1)$coef
##                       Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)          3.5420272  0.3658964  9.6804104 5.534353e-22
## Irrigationirrigated -0.1489683  0.4383581 -0.3398324 7.339962e-01

If we include only parc_irrigue variable, the estimate is now -0.14, but the P-value is 0.73 > 0.05. The test statistic is not significative. Notice that we have only irrigated in the table. It is because R has elected to choose water-stressed as the reference category. The number -0.15 is the estimated decrease in mpg comparing irrigated to water-stressed

fit2 <- lm(qute_mangue_prod ~ Irrigation - 1, data = sysdata)
summary(fit2)$coef
##                          Estimate Std. Error  t value     Pr(>|t|)
## Irrigationwater-stressed 3.542027  0.3658964  9.68041 5.534353e-22
## Irrigationirrigated      3.393059  0.2414077 14.05531 4.339727e-44

If we omit the intercept, then the model includes both irrigated and water-stressed irrigation condition. Now irrigated is not a linear combination of water-stressed, there’s 2 means in the dataset. The expected value of the outcome should be the mean for irrigated or water-stressed hydric condition. As we can see in the table the estimate of irrigated is about 3.39 and the estimate of water-stressed is about 3.54 and it is clearly illustrated in the boxplot below.

boxplot(qute_mangue_prod ~ Irrigation, data = sysdata,
        xlab = "hydric condition", ylab = "Nbre mangue dans la production",
        main = "Water-stressed versus irrigated condition", varwidth = TRUE, col = "lightgray")


If I were to substract 3.54 to 3.39, I would get exactly -0.14. In other words, the model are perfectly consistent. If you are going to fix the mean of the irrigated and water-stressed irrigation condition, any way you do this, the model will alway be consistent. In this model, the coefficients are interpreted as the mean. Notice the importance of outliers values in the box plot. For example we 800 effective mango production for only one parcel.

Residuals and Diagnostic

The goal of the graph below is to highlight residuals values. The first graph is the unajusted one. It shows a poor relationship. You can also see the confounder qute_mangue is poorly correlated with par_irrigue. We can remove the effect of qute_mangue by taking it out of parc_irrigue. You can see that once we factor out qute_mangue, we have a relationonship slightly in the opposite direction, and it’s basically saying if we were to fix the unajusted model fit2, what you would see is a negative slope between mpg and wt.

n<-nrow(sysdata); x2 <- sysdata$qute_mangue ; x1 <-sysdata$parc_irriguee; y = sysdata$qute_mangue_prod
plot(x1, y, pch=21,col="black",bg=topo.colors(n)[x2], frame = FALSE, cex = 1.5,main
='Unadjusted, color is Weight',cex.main=.9)
abline(lm(y ~ x1), lwd = 2)

plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2))[1:5334], pch = 21, col = "black", bg = "lightblue", frame = FALSE, cex = 1.5)
title('Adjusted',cex.main=.9)
abline(0, coef(lm(y ~ x1 + x2))[2], lwd = 2)
x<-resid(lm(x1 ~ x2))
e<-resid(lm(y ~ x2))
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red", lwd = 1)

par(mfrow = c(1, 2))
plot(fit2, which=c(1:2))


Conclusion

In the second model, We have demonstrated that water-stressed trees yield was less greater than irrigated trees yield and that the qute_mangue_prod difference is -1.66. We have attempted to fix the model 1 using Model 2. Nonetheless, Model 2 shows a poor relationship. We have used a formal model selection based on the P-value for the likelihood ratio test comparing model 1 and model 2. Further analysis is necessary in orther to find the best model for the agritech dataset