Introduction

Exploratory data analysis (EDA), as opposed to confirmatory data analysis (CDA), was founded by John Tukey (1977). Tukey often related EDA to detective work. In EDA, the role of the researcher is to explore the data in as many ways as possible until a plausible “story” of the data emerges. A detective does not collect just any information. Instead she collects evidence and clues related to the central question of the case. So, from now on, each time you are conducting EDA, you can call yourself “Detective ABC” where ABC is your name.

In this document, EDA is illustrated using data and code borrowed from Plant (2012).

Data download

Data Set 3 is an agriculture dataset collected over three cropping seasons from a total of 16 rice fields in Treinta y Tres (Uruguay). Fields are about 30 to 50 ha. Rice is typically grown in Uruguay in a 5 year rotation, with 2 years of rice followed by 3 years of pasture. Rice fields are located in three geographic regions: northern, central and southern.

Data reading

A folder with data needed to replicate this exercise can be downloaded from http://goo.gl/qXDwhv

To show locations of the Uruguayan rice fields write following code:

# install.packages("maps")
setwd("/Users/laura/documents/rpubs")
library(maps)
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
uy.map <- map("world", "uruguay",
   fill=TRUE, col="transparent",  plot = FALSE)
uy.poly <- map2SpatialPolygons(uy.map, "Uruguay")
proj4string(uy.poly) <- CRS("+proj=longlat +datum=WGS84")
data.Set3 <- read.csv("set3/Set3data.csv", header = TRUE)
data.disp <- data.frame(Field = 1:16)
data.disp$Longitude <- tapply(data.Set3$Longitude, data.Set3$Field, mean)
data.disp$Latitude <- tapply(data.Set3$Latitude, data.Set3$Field, mean)
data.dispN <- data.disp[which(data.disp$Latitude > -32.8316),]
data.dispS <- data.disp[which(data.disp$Latitude < -33.7008),]
data.dispC <- data.disp[which(data.disp$Latitude < -32.8314 &
  data.disp$Latitude > -33.7007),]
data.dispN$Longitude <- data.dispN$Longitude + 4 +
  30  * (data.dispN$Longitude + 53.7548)
data.dispN$Latitude <- data.dispN$Latitude + 3 +
  30  * (data.dispN$Latitude + 32.8316)
data.dispC$Longitude <- data.dispC$Longitude + 7 +
  45  * (data.dispC$Longitude + 53.90967)
data.dispC$Latitude <- data.dispC$Latitude +
  45  * (data.dispC$Latitude + 33.21518)
data.dispS$Longitude <- data.dispS$Longitude + 3 +
  30  * (data.dispS$Longitude + 54.51)
data.dispS$Latitude <- data.dispS$Latitude - 2 +
  15  * (data.dispS$Latitude + 33.72)
coordinates(data.disp) <- c("Longitude", "Latitude")
proj4string(data.disp) <- CRS("+proj=longlat +datum=WGS84")
coordinates(data.dispN) <- c("Longitude", "Latitude")
proj4string(data.dispN) <- CRS("+proj=longlat +datum=WGS84")
coordinates(data.dispC) <- c("Longitude", "Latitude")
proj4string(data.dispC) <- CRS("+proj=longlat +datum=WGS84")
coordinates(data.dispS) <- c("Longitude", "Latitude")
proj4string(data.dispS) <- CRS("+proj=longlat +datum=WGS84")
plot(uy.poly, xlim = c(-59, -46), axes = FALSE)
title(main = "Data Set 3 Field Locations", cex.main = 2)
plot(data.disp, add = TRUE, pch = 1, cex = 0.5)
plot(data.dispN, add = TRUE, pch = 1)
plot(data.dispC, add = TRUE, pch = 1)
plot(data.dispS, add = TRUE, pch = 1)
text(coordinates(data.dispN)[,1]+0.25, coordinates(data.dispN)[,2]+0.25,
   as.character(data.dispN$Field))
text(coordinates(data.dispC)[,1]+0.25, coordinates(data.dispC)[,2]+0.25,
   as.character(data.dispC$Field))
text(coordinates(data.dispS)[,1]+0.25, coordinates(data.dispS)[,2]+0.25,
   as.character(data.dispS$Field))
lines(c(-51.8,-49), c(-30,-30))
lines(c(-51.8,-49), c(-28,-28))
lines(c(-51.8,-51.8), c(-30,-28))
lines(c(-49,-49), c(-30,-28))
arrows(-51.8,-29.7, -53.65,-32.65, length = 0.1)
lines(c(-51.1,-46.3), c(-34.2,-34.2))
lines(c(-51.1,-46.3), c(-31.9,-31.9))
lines(c(-51.1,-51.1), c(-34.2,-31.9))
lines(c(-46.3,-46.3), c(-31.9,-34.2))
arrows(-51.1,-33, -53.8,-33.2, length = 0.1)
lines(c(-52.1,-50.5), c(-36.45,-36.45))
lines(c(-52.1,-50.5), c(-35,-35))
lines(c(-52.1,-52.1), c(-36.45,-35))
lines(c(-50.5,-50.5), c(-36.45,-35))
arrows(-52.1,-35.5, -54.4,-33.8, length = 0.1)

deg.per.km <- 360 / 40075
lat <- -36
text(-57, lat+0.35, "0")
text(-55, lat+0.35, "200 km")
deg.per.km <- deg.per.km * cos(pi*lat / 180)
SpatialPolygonsRescale(layout.scale.bar(), offset = c(-57, lat),
  scale = 200 * deg.per.km, fill=c("transparent","black"),
  plot.grid = FALSE)
SpatialPolygonsRescale(layout.north.arrow(), offset = c(-55, -30),
  scale = 1, plot.grid = FALSE)

Data visualization

A plot of UTM coordinates can show that all fields in the northern region have a Northing greater than 6.340.000 m and all fields in the southern region have a Northing less than 6.280.000 m. Following code allows to confim Northing coordinates of fields. Furthermore, a boxplot of yield by farmer is created.

library(lattice)
# 

#data.Set3 <- read.table("./set3/Set3data.csv", header = TRUE, sep = ",")
names(data.Set3)
##  [1] "RiceYear"  "Season"    "Field"     "ID"        "Latitude" 
##  [6] "Longitude" "pH"        "Corg"      "SoilP"     "SoilK"    
## [11] "Sand"      "Silt"      "Clay"      "DPL"       "Emer"     
## [16] "Weeds"     "Cont"      "Irrig"     "D50"       "Yield"    
## [21] "Fert"      "N"         "P"         "K"         "Var"      
## [26] "Northing"  "Easting"   "Farmer"
summary(data.Set3)
##     RiceYear         Season          Field              ID       
##  Min.   :1.000   Min.   :1.000   Min.   : 1.000   Min.   :  1.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 3.000   1st Qu.: 81.5  
##  Median :1.000   Median :2.000   Median : 8.000   Median :162.0  
##  Mean   :1.288   Mean   :1.923   Mean   : 8.241   Mean   :162.0  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:13.000   3rd Qu.:242.5  
##  Max.   :2.000   Max.   :3.000   Max.   :16.000   Max.   :323.0  
##                                                                  
##     Latitude        Longitude            pH             Corg      
##  Min.   :-33.76   Min.   :-54.53   Min.   :4.500   Min.   :0.720  
##  1st Qu.:-33.23   1st Qu.:-53.97   1st Qu.:5.300   1st Qu.:1.175  
##  Median :-33.21   Median :-53.95   Median :5.600   Median :1.370  
##  Mean   :-33.16   Mean   :-53.98   Mean   :5.647   Mean   :1.537  
##  3rd Qu.:-32.83   3rd Qu.:-53.79   3rd Qu.:5.900   3rd Qu.:1.770  
##  Max.   :-32.78   Max.   :-53.75   Max.   :8.100   Max.   :4.210  
##                                                                   
##      SoilP            SoilK             Sand            Silt      
##  Min.   : 1.200   Min.   :0.1000   Min.   : 8.00   Min.   :13.00  
##  1st Qu.: 2.800   1st Qu.:0.1700   1st Qu.:14.00   1st Qu.:35.00  
##  Median : 4.100   Median :0.2100   Median :22.00   Median :49.00  
##  Mean   : 5.448   Mean   :0.2255   Mean   :27.87   Mean   :48.27  
##  3rd Qu.: 6.750   3rd Qu.:0.2600   3rd Qu.:40.00   3rd Qu.:62.00  
##  Max.   :21.300   Max.   :0.5700   Max.   :77.00   Max.   :71.00  
##                                                                   
##       Clay            DPL             Emer           Weeds      
##  Min.   :10.00   Min.   :1.000   Min.   :2.000   Min.   :1.000  
##  1st Qu.:21.00   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :24.00   Median :3.000   Median :4.000   Median :3.000  
##  Mean   :23.87   Mean   :2.941   Mean   :3.635   Mean   :3.019  
##  3rd Qu.:27.00   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :40.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##       Cont           Irrig            D50            Yield      
##  Min.   :2.000   Min.   :2.000   Min.   :15.00   Min.   : 2747  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:45.00   1st Qu.: 6104  
##  Median :4.000   Median :4.000   Median :46.00   Median : 6987  
##  Mean   :3.988   Mean   :3.709   Mean   :48.99   Mean   : 7232  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:59.00   3rd Qu.: 8346  
##  Max.   :5.000   Max.   :5.000   Max.   :75.00   Max.   :14192  
##                                                                 
##       Fert             N               P               K         
##  Min.   : 96.0   Min.   :39.20   Min.   :46.00   Min.   : 0.000  
##  1st Qu.:114.0   1st Qu.:58.40   1st Qu.:55.20   1st Qu.: 0.000  
##  Median :114.0   Median :58.40   Median :62.40   Median : 0.000  
##  Mean   :127.7   Mean   :57.78   Mean   :62.99   Mean   : 7.607  
##  3rd Qu.:158.0   3rd Qu.:62.20   3rd Qu.:72.00   3rd Qu.:23.400  
##  Max.   :158.0   Max.   :62.80   Max.   :72.80   Max.   :23.400  
##                                                                  
##       Var           Northing          Easting           Farmer  
##  Min.   :1.000   Min.   :6259282   Min.   :173303   J      :55  
##  1st Qu.:2.000   1st Qu.:6319757   1st Qu.:223275   B      :50  
##  Median :2.000   Median :6321628   Median :225138   L      :50  
##  Mean   :2.214   Mean   :6326727   Mean   :222302   K      :33  
##  3rd Qu.:2.000   3rd Qu.:6364566   3rd Qu.:238383   A      :25  
##  Max.   :4.000   Max.   :6370168   Max.   :242585   H      :23  
##                                                     (Other):87
head(data.Set3)
##   RiceYear Season Field ID  Latitude Longitude  pH Corg SoilP SoilK Sand
## 1        1      1     6  1 -33.21457 -53.91451 6.0 1.07   3.4  0.18   21
## 2        1      1     6  2 -33.21631 -53.91287 6.0 1.19   4.3  0.15   21
## 3        1      1     6  3 -33.21902 -53.91334 5.8 0.72  11.0  0.31   30
## 4        1      1     6  4 -33.21755 -53.91350 8.0 1.17   4.4  0.12   29
## 5        1      1     6  5 -33.21841 -53.91543 5.8 1.00   1.8  0.14   33
## 6        1      1     6  6 -33.21831 -53.91136 5.8 0.96   4.7  0.33   18
##   Silt Clay DPL Emer Weeds Cont Irrig D50 Yield Fert    N    P K Var
## 1   58   21   3    4     3    4     3  46  6427  114 58.4 55.2 0   2
## 2   60   19   3    4     2    4     3  46  6267  114 58.4 55.2 0   2
## 3   50   20   3    3     4    4     4  46  5147  114 58.4 55.2 0   2
## 4   53   18   3    3     4    3     4  46  4853  114 58.4 55.2 0   2
## 5   48   19   3    4     3    4     4  46  6267  114 58.4 55.2 0   2
## 6   58   24   3    4     3    4     3  46  5733  114 58.4 55.2 0   2
##   Northing  Easting Farmer
## 1  6321139 228352.7      K
## 2  6320950 228510.2      K
## 3  6320649 228475.0      K
## 4  6320811 228456.1      K
## 5  6320711 228278.1      K
## 6  6320732 228657.4      K
tail(data.Set3)
##     RiceYear Season Field  ID  Latitude Longitude  pH Corg SoilP SoilK
## 318        1      3     4 318 -32.82817 -53.76871 4.8 1.46  12.7  0.23
## 319        1      3     4 319 -32.82728 -53.76559 4.7 1.17   5.3  0.14
## 320        1      3     4 320 -32.82704 -53.76273 4.7 0.99   3.1  0.11
## 321        1      3     4 321 -32.82533 -53.76451 4.7 1.05   2.8  0.16
## 322        1      3     4 322 -32.82323 -53.76446 4.9 1.14   4.1  0.21
## 323        1      3     4 323 -32.82583 -53.76868 4.9 1.12  11.1  0.24
##     Sand Silt Clay DPL Emer Weeds Cont Irrig D50 Yield Fert    N  P    K
## 318   46   34   20   3    4     3    5     5  46 11430  158 62.2 72 23.4
## 319   44   33   23   3    4     3    5     5  46  9985  158 62.2 72 23.4
## 320   35   42   23   3    4     3    5     5  46  9260  158 62.2 72 23.4
## 321   41   35   24   3    4     3    4     4  46  9130  158 62.2 72 23.4
## 322   46   30   24   3    4     3    5     4  46  8970  158 62.2 72 23.4
## 323   42   33   25   3    3     3    4     4  46  9235  158 62.2 72 23.4
##     Var Northing  Easting Farmer
## 318   1  6364366 240816.7      D
## 319   1  6364472 241105.8      D
## 320   1  6364505 241373.7      D
## 321   1  6364690 241201.7      D
## 322   1  6364924 241200.5      D
## 323   1  6364625 240812.5      D
nrow(data.Set3)
## [1] 323
plot(data.Set3$Northing)

par(mai = c(1,1,1,1))
boxplot(Yield ~ Farmer, data = data.Set3,
        main = "Rice Yields by Farmer", xlab = "Farmer", cex.main = 2,
        ylab = "Yield (kg/ha)", cex.lab = 1.5) # Fig. 7.14

data.Set3$Location <- "Center"
data.Set3$Location[(data.Set3$Northing > 6340000)] <- "North"
data.Set3$Location[(data.Set3$Northing < 6280000)] <- "South"
trellis.par.set(par.main.text = list(cex = 2))
trellis.device(color = FALSE)
data.Set3$YearFac <- factor(data.Set3$Season,
                            labels = c("2002-03", "2003-04", "2004-05"))
bwplot(Yield ~ Location | YearFac, data = data.Set3,
       main = "Rice Yields by Season",  # Fig. 7.15
       xlab = "Location", ylab = "Yield (kg/ha)", layout = c(3,1),
       aspect = 1)

data.Set3$LocN <- 2
data.Set3$LocN[(data.Set3$Northing > 6340000)] <- 1
data.Set3$LocN[(data.Set3$Northing < 6280000)] <- 3
data.Set3$LocSeason <- with(data.Set3, Season + 10 * LocN)
print(round(mean.yields <- tapply(data.Set3$Yield,
                                  data.Set3$LocSeason, mean)))
##   12   13   21   22   23   32   33 
## 8605 9173 5999 5900 5909 8832 7064
unique(data.Set3$RiceYear) 
## [1] 1 2
sort(unique(data.Set3$Field[which(data.Set3$RiceYear == 1)]))
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
sort(unique(data.Set3$Field[which(data.Set3$RiceYear == 2)]))
## [1]  3  5 12 13 14 16
t.test(data.Set3$Yield[which((data.Set3$Field == 3)
                             & (data.Set3$RiceYear == 1))],
       data.Set3$Yield[which((data.Set3$Field == 3)
                             & (data.Set3$RiceYear == 2))])
## 
##  Welch Two Sample t-test
## 
## data:  data.Set3$Yield[which((data.Set3$Field == 3) & (data.Set3$RiceYear ==  and data.Set3$Yield[which((data.Set3$Field == 3) & (data.Set3$RiceYear ==     1))] and     2))]
## t = 1.4235, df = 43.61, p-value = 0.1617
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -211.4933 1227.8933
## sample estimates:
## mean of x mean of y 
##   8472.84   7964.64
t.test(data.Set3$Yield[which((data.Set3$Field == 5) & (data.Set3$RiceYear == 1))],
       data.Set3$Yield[which((data.Set3$Field == 5) & (data.Set3$RiceYear == 2))])
## 
##  Welch Two Sample t-test
## 
## data:  data.Set3$Yield[which((data.Set3$Field == 5) & (data.Set3$RiceYear ==  and data.Set3$Yield[which((data.Set3$Field == 5) & (data.Set3$RiceYear ==     1))] and     2))]
## t = 11.513, df = 19.493, p-value = 3.834e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1982.219 2861.236
## sample estimates:
## mean of x mean of y 
##  7403.636  4981.909
t.test(data.Set3$Yield[which((data.Set3$Field == 12) & (data.Set3$RiceYear == 1))],
       data.Set3$Yield[which((data.Set3$Field == 12) & (data.Set3$RiceYear == 2))])
## 
##  Welch Two Sample t-test
## 
## data:  data.Set3$Yield[which((data.Set3$Field == 12) & (data.Set3$RiceYear ==  and data.Set3$Yield[which((data.Set3$Field == 12) & (data.Set3$RiceYear ==     1))] and     2))]
## t = -3.3103, df = 15.01, p-value = 0.004753
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1439.2478  -311.8292
## sample estimates:
## mean of x mean of y 
##  6022.923  6898.462
t.test(data.Set3$Yield[which((data.Set3$Field == 13) & (data.Set3$RiceYear == 1))],
       data.Set3$Yield[which((data.Set3$Field == 13) & (data.Set3$RiceYear == 2))])
## 
##  Welch Two Sample t-test
## 
## data:  data.Set3$Yield[which((data.Set3$Field == 13) & (data.Set3$RiceYear ==  and data.Set3$Yield[which((data.Set3$Field == 13) & (data.Set3$RiceYear ==     1))] and     2))]
## t = -2.5365, df = 11.431, p-value = 0.02695
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1323.25115   -96.74885
## sample estimates:
## mean of x mean of y 
##  5697.714  6407.714
t.test(data.Set3$Yield[which((data.Set3$Field == 14) & (data.Set3$RiceYear == 1))],
       data.Set3$Yield[which((data.Set3$Field == 14) & (data.Set3$RiceYear == 2))])
## 
##  Welch Two Sample t-test
## 
## data:  data.Set3$Yield[which((data.Set3$Field == 14) & (data.Set3$RiceYear ==  and data.Set3$Yield[which((data.Set3$Field == 14) & (data.Set3$RiceYear ==     1))] and     2))]
## t = 2.1041, df = 20.432, p-value = 0.04794
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##    4.105528 819.394472
## sample estimates:
## mean of x mean of y 
##  7153.417  6741.667
sort(unique(data.Set3$Field[which(data.Set3$Season == 1)]))
##  [1]  5  6  7  8  9 10 11 12 13 14
sort(unique(data.Set3$Field[which(data.Set3$Season == 2)]))
## [1]  1  3  5 14 15
sort(unique(data.Set3$Field[which(data.Set3$Season == 3)]))
## [1]  2  3  4 12 13 16
data.Set3$YearFarmerField <- with(data.Set3, paste(as.character(Season),Farmer,
                                                   as.character(Field)))
print(tapply(data.Set3$Yield, data.Set3$YearFarmerField, mean),
      digits = 4)
## 1 E 11  1 F 7  1 G 8 1 H 10 1 J 12 1 J 13 1 J 14  1 J 5  1 K 6  1 K 9 
##   4771   6331   5559   5995   6898   6408   7153   7404   5904   4091 
##  2 A 1  2 B 3  2 I 5 2 J 14 2 L 15  3 B 3  3 C 2  3 D 4 3 E 13 3 H 12 
##   8738   8473   4982   6742   8832   7965  11456   9542   5698   6023 
## 3 L 16 
##   7064

DEM and slope of study area

A DEM of study area, downloaded from CGIAR site, can be cropped to match study area extent. In addition, slope can be calculated. The raster package will be our alliad to conduct such a task.

library(raster)
# This file was downloaded from the CGIAR site
# http://srtm.csi.cgiar.org
dem.ras <- raster("./set3/dem.asc")
projection(dem.ras) <- CRS("+proj=longlat +datum=WGS84")
range(data.Set3$Latitude)
## [1] -33.75644 -32.77624
range(data.Set3$Longitude)
## [1] -54.52640 -53.74849
crop.extent <- matrix(c(-54.6,-53.7,-33.8,-32.7),
                      nrow = 2, byrow = TRUE)
dem.Set3 <- crop(dem.ras, extent(crop.extent))
slope.Set3 <- slopeAspect(dem.Set3, type = "slope")
## Warning in slopeAspect(dem.Set3, type = "slope"): this function is
## deprecated. Please use function "terrain" instead
dem.Set3
## class       : RasterLayer 
## dimensions  : 1320, 1080, 1425600  (nrow, ncol, ncell)
## resolution  : 0.0008333333, 0.0008333333  (x, y)
## extent      : -54.60042, -53.70042, -33.80042, -32.70042  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       : dem 
## values      : -4, 386  (min, max)
slope.Set3
## class       : RasterStack 
## dimensions  : 1320, 1080, 1425600, 2  (nrow, ncol, ncell, nlayers)
## resolution  : 0.0008333333, 0.0008333333  (x, y)
## extent      : -54.60042, -53.70042, -33.80042, -32.70042  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## names       :     slope,    aspect 
## min values  :         0,         0 
## max values  : 0.4669779, 6.2746734
par(mfrow=c(1,2))
plot(dem.Set3, main="DEM of Study Area")
plot(slope.Set3[[1]], main="Slope of Study Area")

To find slopes in the fiels, copy data.Set3 to a new file and then convert it to a SpatialPointsDataFrame. Then, extract the slope values from the grid and calculates the histogram:

Set3.WGS <- data.Set3
coordinates(Set3.WGS) <- c("Longitude", "Latitude")
slopes <- extract(slope.Set3, Set3.WGS)
print(range(slopes), digits = 3)
## [1] 0.00 6.18
sort(unique(data.Set3$Field[which(slopes > 0.03)]))
## [1]  3 13 14
print(cor(data.Set3$Yield,slopes), digits = 2)
##       slope aspect
## [1,] -0.034   0.11
summary(slopes)
##      slope              aspect     
##  Min.   :0.000000   Min.   :0.000  
##  1st Qu.:0.004376   1st Qu.:1.571  
##  Median :0.006473   Median :2.767  
##  Mean   :0.010050   Mean   :2.971  
##  3rd Qu.:0.011209   3rd Qu.:4.665  
##  Max.   :0.089254   Max.   :6.176
length(slopes)
## [1] 646
head(slopes)
##            slope    aspect
## [1,] 0.002104529 2.2712794
## [2,] 0.002104564 0.8703272
## [3,] 0.009015847 4.2440803
## [4,] 0.002104576 0.8703319
## [5,] 0.002104587 5.4128487
## [6,] 0.006313687 0.8703366

A set of box-and-whisker plot of rice yields for each farmer and season for the three regions follows:

bwplot(Yield ~ Farmer | Location + YearFac, data = data.Set3,
       main = "Rice Yields by Farmer and Season",
       xlab = "Farmer", ylab = "Yield (kg/ha)") # 

Above plot shows that median yield in the central region is similar from one season to the next, although variability is a bit higher in the first season. Yield in the south is smaller, and yield variability in the north has increased a lot in the third year. A closer observation of above plot can be useful to answer following questions:

  1. Which farmers obtained the highest yields?
  2. Which farmers obtained the lowest yields?
  3. Which farmers obtained the highest average yield in each geographic zone?
  4. Which farmers obtained the lowest average yield in each geographic zone?
  5. Which farmers did not participated all seasons in the study?

Exploration of mean of variables for each region

This study spanned four calendar years (three Southern Hemisphere summers). The variable Season represents growing season (i.e., 1,2, or 3). The function tapply() can be used to compute the mean yields over region. It also allows to obtain mean values of every variable over region (i.e. weeds, irrigation, SoilP, SoilK, and applied fertilizer).

# Check a few means 
tapply(data.Set3$Fert, data.Set3$Location, mean)
##   Center    North    South 
## 109.4405 158.0000 125.3200
tapply(data.Set3$Weeds, data.Set3$Location, mean)
##   Center    North    South 
## 3.458333 2.314286 3.020000
tapply(data.Set3$Cont, data.Set3$Location, mean)
##   Center    North    South 
## 3.476190 4.752381 4.100000
tapply(data.Set3$Irrig, data.Set3$Location, mean)
##   Center    North    South 
## 3.357143 4.228571 3.800000
tapply(data.Set3$SoilP, data.Set3$Location, mean)
##   Center    North    South 
## 5.562500 6.157143 3.574000
tapply(data.Set3$SoilK, data.Set3$Location, mean)
##    Center     North     South 
## 0.2185119 0.2011429 0.3004000
tapply(data.Set3$N, data.Set3$Location, mean)
##   Center    North    South 
## 54.10952 62.20000 60.80800
tapply(data.Set3$P, data.Set3$Location, mean)
##   Center    North    South 
## 57.00476 72.00000 64.16800
tapply(data.Set3$K, data.Set3$Location, mean)
## Center  North  South 
##    0.0   23.4    0.0

As seen in last results, farmers in the north applied the most fertilizer, farmers in the center applied the least, and farmers in the south applied in between. Did northen farmers applied too much fertilizer? Soil in the center is suitable for higher levels of fertilizer?

Exploration of climatic factors

Temperature, hours of sun per day and rainfall were recorded three times a month at an agricultural experiment station. Let’s plot temperature, sun hour, and rainfall trends in each year.

wthr.data <- read.csv("./set3/Set3Weather.csv", header = TRUE)
par(mai = c(1,1,1,1))
plot(wthr.data$Temp0203, type = "l", lty = 2, xaxt = "n",
     ylim = c(10,27),   main = "Regional Temperature Trends",
     ylab = expression(Temperature~"("*degree*C*")"),
     xlab = "Month", cex.main = 2, cex.lab = 1.5) # Fig. 7.17a
lines(wthr.data$Temp0304, lty = 3)
lines(wthr.data$Temp0405, lty = 4)
lines(wthr.data$TempAvg, lty = 1)
axis(side = 1, at = c(1,4,7,10,13,16,19),
     labels = c("Oct","Nov","Dec","Jan","Feb","Mar","Apr"))
legend(8, 15, c("2002-03", "2003-04", "2004-05",
                "Normal"), lty = c(2,3,4,1))

d.0203 <- wthr.data$Temp0203 - wthr.data$TempAvg
d.0304 <- wthr.data$Temp0304 - wthr.data$TempAvg
d.0405 <- wthr.data$Temp0405 - wthr.data$TempAvg
plot(d.0203, type = "l", lty = 2, xaxt = "n",
     main = "Regional Temperature Difference Trends", ylim = c(-5, 10),
     ylab = expression(Temperature~"("*degree*C*")"),
     xlab = "Month", cex.main = 2, cex.lab = 1.5)
lines(d.0304, lty = 3)
lines(d.0405, lty = 4)     # Fig. 7.17b
segments(1, 0, 21, 0)
axis(side = 1, at = c(1,4,7,10,13,16,19),
     labels = c("Oct","Nov","Dec","Jan","Feb","Mar","Apr"))
legend(4, 9, c("2002-03", "2003-04", "2004-05"), lty = c(2,3,4))

d.0203 <- wthr.data$Sun0203 - wthr.data$SunAvg
d.0304 <- wthr.data$Sun0304 - wthr.data$SunAvg
d.0405 <- wthr.data$Sun0405 - wthr.data$SunAvg
plot(d.0203, type = "l", lty = 2, xaxt = "n",
     main = "Regional Sun Hour Difference Trends", ylim = c(-5, 5),
     ylab = "Sun Hours Difference", xlab = "Month",
     cex.main = 2, cex.lab = 1.5)    # Fig. 7.17c
lines(d.0304, lty = 3)
lines(d.0405, lty = 4)
segments(1, 0, 21, 0)
axis(side = 1, at = c(1,4,7,10,13,16,19),
     labels = c("Oct","Nov","Dec","Jan","Feb","Mar","Apr"))
legend(5, 5, c("2002-03", "2003-04", "2004-05"), lty = c(2,3,4))

d.0203 <- wthr.data$Rain0203 - wthr.data$RainAvg
d.0304 <- wthr.data$Rain0304 - wthr.data$RainAvg
d.0405 <- wthr.data$Rain0405 - wthr.data$RainAvg
plot(d.0203, type = "l", lty = 2, xaxt = "n",
     main = "Regional Rain Difference Trends", ylim = c(-75,200),
     ylab = "Rain Difference (cm)", xlab = "Month",
     cex.main = 2, cex.lab = 1.5)  # Fig. 7.17d
lines(d.0304, lty = 3)
lines(d.0405, lty = 4)
axis(side = 1, at = c(1,4,7,10,13,16,19),
     labels = c("Oct","Nov","Dec","Jan","Feb","Mar","Apr"))
legend(3, 175, c("2002-03", "2003-04", "2004-05"), lty = c(2,3,4))
segments(1, 0, 21, 0)

From above temperature plot, it is clear that in each season the temperatures late in the season tended to be higher than normal, while these temperatures tended to be lower than normal earlier in the season. However, it is not easy to establish how much higher.

Regarding difference trends in terms of solar irradiation and rain, can you determine any particular pattern?

Exploration of rice varieties

In fields under study, four rice varieties were planted. They are coded numerically in the data set. Let’s assign names to them and then compute mean yields by variety:

data.Set3$Variety <- "INIA Tacuarí"
data.Set3$Variety[which(data.Set3$Var == 2)] <- "El Pasol"
data.Set3$Variety[which(data.Set3$Var == 3)] <- "Perla"
data.Set3$Variety[which(data.Set3$Var == 4)] <- "INIA Olimar"
with(data.Set3, tapply(Yield, Variety, mean))
##     El Pasol  INIA Olimar INIA Tacuarí        Perla 
##     7064.099     7737.281     6931.632     9307.667
# pay attention to how to query which farmers planted a given variety
with(data.Set3, unique(Farmer[which(Variety == "INIA Tacuarí")]))
## [1] K I L D
## Levels: A B C D E F G H I J K L
with(data.Set3, unique(Farmer[which(Variety == "El Pasol")]))
## [1] K F J G H E A L C
## Levels: A B C D E F G H I J K L
with(data.Set3, unique(Farmer[which(Variety == "INIA Olimar")]))
## [1] B E H
## Levels: A B C D E F G H I J K L
with(data.Set3, unique(Farmer[which(Variety == "Perla")]))
## [1] A
## Levels: A B C D E F G H I J K L
library(lattice)
trellis.device(color = FALSE)
bwplot(Yield ~ Farmer | Variety, data = data.Set3, xlab = "Farmer") 

Based on above code and plot answer these questions: 1. Is there a substantial difference in yields between rice varieties? 2. How many farmers planted all of the varieties? 3. How many farmers used INIA Olimar? 4. How many farmers used Perla?

Summary

According to results of this exploratory analysis, it can be said that any farmer(s) obtain consistently better yields than the rest? Does climate have an important infuence on year-to-year variability? Does terrain have much effect on yield? Does variety have a strong association with yield? Does any other management variable have influence on yield?

Now, it’s your turn to type code

Do not forget to replicate this exercise by typing the code. Remember: do not copy and paste. When you copy and paste, you almost always take a macro view and conclude with ‘claro, ya entiendo la logica’ (but you don’t). When you type it in line by line, you are seeing each line and are forced to inspect it more (and to try your own variations to improve it). As a result, you learn much better.

Again, it would of benefit if you apply code here to your own data.

REFERENCES

Plant, R.E., 2012. Spatial Data Analysis in Ecology and Agriculture using R. CRC Press, Boca Raton, FL.

Teetor, P., 2011. R Cookbox. O’Reilly, Sebastopol, CA.

Tukey, J.W., 1977. Exploratoty Data Analysis. Addison-Wesley, Reading, MA.