Exploratory data analysis (EDA), as opposed to confirmatory data analysis (CDA), was founded by John Tukey (1977). Tukey often related EDA to detective work. In EDA, the role of the researcher is to explore the data in as many ways as possible until a plausible “story” of the data emerges. A detective does not collect just any information. Instead she collects evidence and clues related to the central question of the case. So, from now on, each time you are conducting EDA, you can call yourself “Detective ABC” where ABC is your name.
In this document, EDA is illustrated using data and code borrowed from Plant (2012).
Data Set 3 is an agriculture dataset collected over three cropping seasons from a total of 16 rice fields in Treinta y Tres (Uruguay). Fields are about 30 to 50 ha. Rice is typically grown in Uruguay in a 5 year rotation, with 2 years of rice followed by 3 years of pasture. Rice fields are located in three geographic regions: northern, central and southern.
A folder with data needed to replicate this exercise can be downloaded from http://goo.gl/qXDwhv
To show locations of the Uruguayan rice fields write following code:
# install.packages("maps")
setwd("/Users/laura/documents/rpubs")
library(maps)
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
uy.map <- map("world", "uruguay",
fill=TRUE, col="transparent", plot = FALSE)
uy.poly <- map2SpatialPolygons(uy.map, "Uruguay")
proj4string(uy.poly) <- CRS("+proj=longlat +datum=WGS84")
data.Set3 <- read.csv("set3/Set3data.csv", header = TRUE)
data.disp <- data.frame(Field = 1:16)
data.disp$Longitude <- tapply(data.Set3$Longitude, data.Set3$Field, mean)
data.disp$Latitude <- tapply(data.Set3$Latitude, data.Set3$Field, mean)
data.dispN <- data.disp[which(data.disp$Latitude > -32.8316),]
data.dispS <- data.disp[which(data.disp$Latitude < -33.7008),]
data.dispC <- data.disp[which(data.disp$Latitude < -32.8314 &
data.disp$Latitude > -33.7007),]
data.dispN$Longitude <- data.dispN$Longitude + 4 +
30 * (data.dispN$Longitude + 53.7548)
data.dispN$Latitude <- data.dispN$Latitude + 3 +
30 * (data.dispN$Latitude + 32.8316)
data.dispC$Longitude <- data.dispC$Longitude + 7 +
45 * (data.dispC$Longitude + 53.90967)
data.dispC$Latitude <- data.dispC$Latitude +
45 * (data.dispC$Latitude + 33.21518)
data.dispS$Longitude <- data.dispS$Longitude + 3 +
30 * (data.dispS$Longitude + 54.51)
data.dispS$Latitude <- data.dispS$Latitude - 2 +
15 * (data.dispS$Latitude + 33.72)
coordinates(data.disp) <- c("Longitude", "Latitude")
proj4string(data.disp) <- CRS("+proj=longlat +datum=WGS84")
coordinates(data.dispN) <- c("Longitude", "Latitude")
proj4string(data.dispN) <- CRS("+proj=longlat +datum=WGS84")
coordinates(data.dispC) <- c("Longitude", "Latitude")
proj4string(data.dispC) <- CRS("+proj=longlat +datum=WGS84")
coordinates(data.dispS) <- c("Longitude", "Latitude")
proj4string(data.dispS) <- CRS("+proj=longlat +datum=WGS84")
plot(uy.poly, xlim = c(-59, -46), axes = FALSE)
title(main = "Data Set 3 Field Locations", cex.main = 2)
plot(data.disp, add = TRUE, pch = 1, cex = 0.5)
plot(data.dispN, add = TRUE, pch = 1)
plot(data.dispC, add = TRUE, pch = 1)
plot(data.dispS, add = TRUE, pch = 1)
text(coordinates(data.dispN)[,1]+0.25, coordinates(data.dispN)[,2]+0.25,
as.character(data.dispN$Field))
text(coordinates(data.dispC)[,1]+0.25, coordinates(data.dispC)[,2]+0.25,
as.character(data.dispC$Field))
text(coordinates(data.dispS)[,1]+0.25, coordinates(data.dispS)[,2]+0.25,
as.character(data.dispS$Field))
lines(c(-51.8,-49), c(-30,-30))
lines(c(-51.8,-49), c(-28,-28))
lines(c(-51.8,-51.8), c(-30,-28))
lines(c(-49,-49), c(-30,-28))
arrows(-51.8,-29.7, -53.65,-32.65, length = 0.1)
lines(c(-51.1,-46.3), c(-34.2,-34.2))
lines(c(-51.1,-46.3), c(-31.9,-31.9))
lines(c(-51.1,-51.1), c(-34.2,-31.9))
lines(c(-46.3,-46.3), c(-31.9,-34.2))
arrows(-51.1,-33, -53.8,-33.2, length = 0.1)
lines(c(-52.1,-50.5), c(-36.45,-36.45))
lines(c(-52.1,-50.5), c(-35,-35))
lines(c(-52.1,-52.1), c(-36.45,-35))
lines(c(-50.5,-50.5), c(-36.45,-35))
arrows(-52.1,-35.5, -54.4,-33.8, length = 0.1)
deg.per.km <- 360 / 40075
lat <- -36
text(-57, lat+0.35, "0")
text(-55, lat+0.35, "200 km")
deg.per.km <- deg.per.km * cos(pi*lat / 180)
SpatialPolygonsRescale(layout.scale.bar(), offset = c(-57, lat),
scale = 200 * deg.per.km, fill=c("transparent","black"),
plot.grid = FALSE)
SpatialPolygonsRescale(layout.north.arrow(), offset = c(-55, -30),
scale = 1, plot.grid = FALSE)
A plot of UTM coordinates can show that all fields in the northern region have a Northing greater than 6.340.000 m and all fields in the southern region have a Northing less than 6.280.000 m. Following code allows to confim Northing coordinates of fields. Furthermore, a boxplot of yield by farmer is created.
library(lattice)
#
#data.Set3 <- read.table("./set3/Set3data.csv", header = TRUE, sep = ",")
names(data.Set3)
## [1] "RiceYear" "Season" "Field" "ID" "Latitude"
## [6] "Longitude" "pH" "Corg" "SoilP" "SoilK"
## [11] "Sand" "Silt" "Clay" "DPL" "Emer"
## [16] "Weeds" "Cont" "Irrig" "D50" "Yield"
## [21] "Fert" "N" "P" "K" "Var"
## [26] "Northing" "Easting" "Farmer"
summary(data.Set3)
## RiceYear Season Field ID
## Min. :1.000 Min. :1.000 Min. : 1.000 Min. : 1.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.: 81.5
## Median :1.000 Median :2.000 Median : 8.000 Median :162.0
## Mean :1.288 Mean :1.923 Mean : 8.241 Mean :162.0
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:13.000 3rd Qu.:242.5
## Max. :2.000 Max. :3.000 Max. :16.000 Max. :323.0
##
## Latitude Longitude pH Corg
## Min. :-33.76 Min. :-54.53 Min. :4.500 Min. :0.720
## 1st Qu.:-33.23 1st Qu.:-53.97 1st Qu.:5.300 1st Qu.:1.175
## Median :-33.21 Median :-53.95 Median :5.600 Median :1.370
## Mean :-33.16 Mean :-53.98 Mean :5.647 Mean :1.537
## 3rd Qu.:-32.83 3rd Qu.:-53.79 3rd Qu.:5.900 3rd Qu.:1.770
## Max. :-32.78 Max. :-53.75 Max. :8.100 Max. :4.210
##
## SoilP SoilK Sand Silt
## Min. : 1.200 Min. :0.1000 Min. : 8.00 Min. :13.00
## 1st Qu.: 2.800 1st Qu.:0.1700 1st Qu.:14.00 1st Qu.:35.00
## Median : 4.100 Median :0.2100 Median :22.00 Median :49.00
## Mean : 5.448 Mean :0.2255 Mean :27.87 Mean :48.27
## 3rd Qu.: 6.750 3rd Qu.:0.2600 3rd Qu.:40.00 3rd Qu.:62.00
## Max. :21.300 Max. :0.5700 Max. :77.00 Max. :71.00
##
## Clay DPL Emer Weeds
## Min. :10.00 Min. :1.000 Min. :2.000 Min. :1.000
## 1st Qu.:21.00 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000
## Median :24.00 Median :3.000 Median :4.000 Median :3.000
## Mean :23.87 Mean :2.941 Mean :3.635 Mean :3.019
## 3rd Qu.:27.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :40.00 Max. :5.000 Max. :5.000 Max. :5.000
##
## Cont Irrig D50 Yield
## Min. :2.000 Min. :2.000 Min. :15.00 Min. : 2747
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:45.00 1st Qu.: 6104
## Median :4.000 Median :4.000 Median :46.00 Median : 6987
## Mean :3.988 Mean :3.709 Mean :48.99 Mean : 7232
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:59.00 3rd Qu.: 8346
## Max. :5.000 Max. :5.000 Max. :75.00 Max. :14192
##
## Fert N P K
## Min. : 96.0 Min. :39.20 Min. :46.00 Min. : 0.000
## 1st Qu.:114.0 1st Qu.:58.40 1st Qu.:55.20 1st Qu.: 0.000
## Median :114.0 Median :58.40 Median :62.40 Median : 0.000
## Mean :127.7 Mean :57.78 Mean :62.99 Mean : 7.607
## 3rd Qu.:158.0 3rd Qu.:62.20 3rd Qu.:72.00 3rd Qu.:23.400
## Max. :158.0 Max. :62.80 Max. :72.80 Max. :23.400
##
## Var Northing Easting Farmer
## Min. :1.000 Min. :6259282 Min. :173303 J :55
## 1st Qu.:2.000 1st Qu.:6319757 1st Qu.:223275 B :50
## Median :2.000 Median :6321628 Median :225138 L :50
## Mean :2.214 Mean :6326727 Mean :222302 K :33
## 3rd Qu.:2.000 3rd Qu.:6364566 3rd Qu.:238383 A :25
## Max. :4.000 Max. :6370168 Max. :242585 H :23
## (Other):87
head(data.Set3)
## RiceYear Season Field ID Latitude Longitude pH Corg SoilP SoilK Sand
## 1 1 1 6 1 -33.21457 -53.91451 6.0 1.07 3.4 0.18 21
## 2 1 1 6 2 -33.21631 -53.91287 6.0 1.19 4.3 0.15 21
## 3 1 1 6 3 -33.21902 -53.91334 5.8 0.72 11.0 0.31 30
## 4 1 1 6 4 -33.21755 -53.91350 8.0 1.17 4.4 0.12 29
## 5 1 1 6 5 -33.21841 -53.91543 5.8 1.00 1.8 0.14 33
## 6 1 1 6 6 -33.21831 -53.91136 5.8 0.96 4.7 0.33 18
## Silt Clay DPL Emer Weeds Cont Irrig D50 Yield Fert N P K Var
## 1 58 21 3 4 3 4 3 46 6427 114 58.4 55.2 0 2
## 2 60 19 3 4 2 4 3 46 6267 114 58.4 55.2 0 2
## 3 50 20 3 3 4 4 4 46 5147 114 58.4 55.2 0 2
## 4 53 18 3 3 4 3 4 46 4853 114 58.4 55.2 0 2
## 5 48 19 3 4 3 4 4 46 6267 114 58.4 55.2 0 2
## 6 58 24 3 4 3 4 3 46 5733 114 58.4 55.2 0 2
## Northing Easting Farmer
## 1 6321139 228352.7 K
## 2 6320950 228510.2 K
## 3 6320649 228475.0 K
## 4 6320811 228456.1 K
## 5 6320711 228278.1 K
## 6 6320732 228657.4 K
tail(data.Set3)
## RiceYear Season Field ID Latitude Longitude pH Corg SoilP SoilK
## 318 1 3 4 318 -32.82817 -53.76871 4.8 1.46 12.7 0.23
## 319 1 3 4 319 -32.82728 -53.76559 4.7 1.17 5.3 0.14
## 320 1 3 4 320 -32.82704 -53.76273 4.7 0.99 3.1 0.11
## 321 1 3 4 321 -32.82533 -53.76451 4.7 1.05 2.8 0.16
## 322 1 3 4 322 -32.82323 -53.76446 4.9 1.14 4.1 0.21
## 323 1 3 4 323 -32.82583 -53.76868 4.9 1.12 11.1 0.24
## Sand Silt Clay DPL Emer Weeds Cont Irrig D50 Yield Fert N P K
## 318 46 34 20 3 4 3 5 5 46 11430 158 62.2 72 23.4
## 319 44 33 23 3 4 3 5 5 46 9985 158 62.2 72 23.4
## 320 35 42 23 3 4 3 5 5 46 9260 158 62.2 72 23.4
## 321 41 35 24 3 4 3 4 4 46 9130 158 62.2 72 23.4
## 322 46 30 24 3 4 3 5 4 46 8970 158 62.2 72 23.4
## 323 42 33 25 3 3 3 4 4 46 9235 158 62.2 72 23.4
## Var Northing Easting Farmer
## 318 1 6364366 240816.7 D
## 319 1 6364472 241105.8 D
## 320 1 6364505 241373.7 D
## 321 1 6364690 241201.7 D
## 322 1 6364924 241200.5 D
## 323 1 6364625 240812.5 D
nrow(data.Set3)
## [1] 323
plot(data.Set3$Northing)
par(mai = c(1,1,1,1))
boxplot(Yield ~ Farmer, data = data.Set3,
main = "Rice Yields by Farmer", xlab = "Farmer", cex.main = 2,
ylab = "Yield (kg/ha)", cex.lab = 1.5) # Fig. 7.14
data.Set3$Location <- "Center"
data.Set3$Location[(data.Set3$Northing > 6340000)] <- "North"
data.Set3$Location[(data.Set3$Northing < 6280000)] <- "South"
trellis.par.set(par.main.text = list(cex = 2))
trellis.device(color = FALSE)
data.Set3$YearFac <- factor(data.Set3$Season,
labels = c("2002-03", "2003-04", "2004-05"))
bwplot(Yield ~ Location | YearFac, data = data.Set3,
main = "Rice Yields by Season", # Fig. 7.15
xlab = "Location", ylab = "Yield (kg/ha)", layout = c(3,1),
aspect = 1)
data.Set3$LocN <- 2
data.Set3$LocN[(data.Set3$Northing > 6340000)] <- 1
data.Set3$LocN[(data.Set3$Northing < 6280000)] <- 3
data.Set3$LocSeason <- with(data.Set3, Season + 10 * LocN)
print(round(mean.yields <- tapply(data.Set3$Yield,
data.Set3$LocSeason, mean)))
## 12 13 21 22 23 32 33
## 8605 9173 5999 5900 5909 8832 7064
unique(data.Set3$RiceYear)
## [1] 1 2
sort(unique(data.Set3$Field[which(data.Set3$RiceYear == 1)]))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sort(unique(data.Set3$Field[which(data.Set3$RiceYear == 2)]))
## [1] 3 5 12 13 14 16
t.test(data.Set3$Yield[which((data.Set3$Field == 3)
& (data.Set3$RiceYear == 1))],
data.Set3$Yield[which((data.Set3$Field == 3)
& (data.Set3$RiceYear == 2))])
##
## Welch Two Sample t-test
##
## data: data.Set3$Yield[which((data.Set3$Field == 3) & (data.Set3$RiceYear == and data.Set3$Yield[which((data.Set3$Field == 3) & (data.Set3$RiceYear == 1))] and 2))]
## t = 1.4235, df = 43.61, p-value = 0.1617
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -211.4933 1227.8933
## sample estimates:
## mean of x mean of y
## 8472.84 7964.64
t.test(data.Set3$Yield[which((data.Set3$Field == 5) & (data.Set3$RiceYear == 1))],
data.Set3$Yield[which((data.Set3$Field == 5) & (data.Set3$RiceYear == 2))])
##
## Welch Two Sample t-test
##
## data: data.Set3$Yield[which((data.Set3$Field == 5) & (data.Set3$RiceYear == and data.Set3$Yield[which((data.Set3$Field == 5) & (data.Set3$RiceYear == 1))] and 2))]
## t = 11.513, df = 19.493, p-value = 3.834e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1982.219 2861.236
## sample estimates:
## mean of x mean of y
## 7403.636 4981.909
t.test(data.Set3$Yield[which((data.Set3$Field == 12) & (data.Set3$RiceYear == 1))],
data.Set3$Yield[which((data.Set3$Field == 12) & (data.Set3$RiceYear == 2))])
##
## Welch Two Sample t-test
##
## data: data.Set3$Yield[which((data.Set3$Field == 12) & (data.Set3$RiceYear == and data.Set3$Yield[which((data.Set3$Field == 12) & (data.Set3$RiceYear == 1))] and 2))]
## t = -3.3103, df = 15.01, p-value = 0.004753
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1439.2478 -311.8292
## sample estimates:
## mean of x mean of y
## 6022.923 6898.462
t.test(data.Set3$Yield[which((data.Set3$Field == 13) & (data.Set3$RiceYear == 1))],
data.Set3$Yield[which((data.Set3$Field == 13) & (data.Set3$RiceYear == 2))])
##
## Welch Two Sample t-test
##
## data: data.Set3$Yield[which((data.Set3$Field == 13) & (data.Set3$RiceYear == and data.Set3$Yield[which((data.Set3$Field == 13) & (data.Set3$RiceYear == 1))] and 2))]
## t = -2.5365, df = 11.431, p-value = 0.02695
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1323.25115 -96.74885
## sample estimates:
## mean of x mean of y
## 5697.714 6407.714
t.test(data.Set3$Yield[which((data.Set3$Field == 14) & (data.Set3$RiceYear == 1))],
data.Set3$Yield[which((data.Set3$Field == 14) & (data.Set3$RiceYear == 2))])
##
## Welch Two Sample t-test
##
## data: data.Set3$Yield[which((data.Set3$Field == 14) & (data.Set3$RiceYear == and data.Set3$Yield[which((data.Set3$Field == 14) & (data.Set3$RiceYear == 1))] and 2))]
## t = 2.1041, df = 20.432, p-value = 0.04794
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.105528 819.394472
## sample estimates:
## mean of x mean of y
## 7153.417 6741.667
sort(unique(data.Set3$Field[which(data.Set3$Season == 1)]))
## [1] 5 6 7 8 9 10 11 12 13 14
sort(unique(data.Set3$Field[which(data.Set3$Season == 2)]))
## [1] 1 3 5 14 15
sort(unique(data.Set3$Field[which(data.Set3$Season == 3)]))
## [1] 2 3 4 12 13 16
data.Set3$YearFarmerField <- with(data.Set3, paste(as.character(Season),Farmer,
as.character(Field)))
print(tapply(data.Set3$Yield, data.Set3$YearFarmerField, mean),
digits = 4)
## 1 E 11 1 F 7 1 G 8 1 H 10 1 J 12 1 J 13 1 J 14 1 J 5 1 K 6 1 K 9
## 4771 6331 5559 5995 6898 6408 7153 7404 5904 4091
## 2 A 1 2 B 3 2 I 5 2 J 14 2 L 15 3 B 3 3 C 2 3 D 4 3 E 13 3 H 12
## 8738 8473 4982 6742 8832 7965 11456 9542 5698 6023
## 3 L 16
## 7064
A DEM of study area, downloaded from CGIAR site, can be cropped to match study area extent. In addition, slope can be calculated. The raster package will be our alliad to conduct such a task.
library(raster)
# This file was downloaded from the CGIAR site
# http://srtm.csi.cgiar.org
dem.ras <- raster("./set3/dem.asc")
projection(dem.ras) <- CRS("+proj=longlat +datum=WGS84")
range(data.Set3$Latitude)
## [1] -33.75644 -32.77624
range(data.Set3$Longitude)
## [1] -54.52640 -53.74849
crop.extent <- matrix(c(-54.6,-53.7,-33.8,-32.7),
nrow = 2, byrow = TRUE)
dem.Set3 <- crop(dem.ras, extent(crop.extent))
slope.Set3 <- slopeAspect(dem.Set3, type = "slope")
## Warning in slopeAspect(dem.Set3, type = "slope"): this function is
## deprecated. Please use function "terrain" instead
dem.Set3
## class : RasterLayer
## dimensions : 1320, 1080, 1425600 (nrow, ncol, ncell)
## resolution : 0.0008333333, 0.0008333333 (x, y)
## extent : -54.60042, -53.70042, -33.80042, -32.70042 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : dem
## values : -4, 386 (min, max)
slope.Set3
## class : RasterStack
## dimensions : 1320, 1080, 1425600, 2 (nrow, ncol, ncell, nlayers)
## resolution : 0.0008333333, 0.0008333333 (x, y)
## extent : -54.60042, -53.70042, -33.80042, -32.70042 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## names : slope, aspect
## min values : 0, 0
## max values : 0.4669779, 6.2746734
par(mfrow=c(1,2))
plot(dem.Set3, main="DEM of Study Area")
plot(slope.Set3[[1]], main="Slope of Study Area")
To find slopes in the fiels, copy data.Set3 to a new file and then convert it to a SpatialPointsDataFrame. Then, extract the slope values from the grid and calculates the histogram:
Set3.WGS <- data.Set3
coordinates(Set3.WGS) <- c("Longitude", "Latitude")
slopes <- extract(slope.Set3, Set3.WGS)
print(range(slopes), digits = 3)
## [1] 0.00 6.18
sort(unique(data.Set3$Field[which(slopes > 0.03)]))
## [1] 3 13 14
print(cor(data.Set3$Yield,slopes), digits = 2)
## slope aspect
## [1,] -0.034 0.11
summary(slopes)
## slope aspect
## Min. :0.000000 Min. :0.000
## 1st Qu.:0.004376 1st Qu.:1.571
## Median :0.006473 Median :2.767
## Mean :0.010050 Mean :2.971
## 3rd Qu.:0.011209 3rd Qu.:4.665
## Max. :0.089254 Max. :6.176
length(slopes)
## [1] 646
head(slopes)
## slope aspect
## [1,] 0.002104529 2.2712794
## [2,] 0.002104564 0.8703272
## [3,] 0.009015847 4.2440803
## [4,] 0.002104576 0.8703319
## [5,] 0.002104587 5.4128487
## [6,] 0.006313687 0.8703366
A set of box-and-whisker plot of rice yields for each farmer and season for the three regions follows:
bwplot(Yield ~ Farmer | Location + YearFac, data = data.Set3,
main = "Rice Yields by Farmer and Season",
xlab = "Farmer", ylab = "Yield (kg/ha)") #
Above plot shows that median yield in the central region is similar from one season to the next, although variability is a bit higher in the first season. Yield in the south is smaller, and yield variability in the north has increased a lot in the third year. A closer observation of above plot can be useful to answer following questions:
This study spanned four calendar years (three Southern Hemisphere summers). The variable Season represents growing season (i.e., 1,2, or 3). The function tapply() can be used to compute the mean yields over region. It also allows to obtain mean values of every variable over region (i.e. weeds, irrigation, SoilP, SoilK, and applied fertilizer).
# Check a few means
tapply(data.Set3$Fert, data.Set3$Location, mean)
## Center North South
## 109.4405 158.0000 125.3200
tapply(data.Set3$Weeds, data.Set3$Location, mean)
## Center North South
## 3.458333 2.314286 3.020000
tapply(data.Set3$Cont, data.Set3$Location, mean)
## Center North South
## 3.476190 4.752381 4.100000
tapply(data.Set3$Irrig, data.Set3$Location, mean)
## Center North South
## 3.357143 4.228571 3.800000
tapply(data.Set3$SoilP, data.Set3$Location, mean)
## Center North South
## 5.562500 6.157143 3.574000
tapply(data.Set3$SoilK, data.Set3$Location, mean)
## Center North South
## 0.2185119 0.2011429 0.3004000
tapply(data.Set3$N, data.Set3$Location, mean)
## Center North South
## 54.10952 62.20000 60.80800
tapply(data.Set3$P, data.Set3$Location, mean)
## Center North South
## 57.00476 72.00000 64.16800
tapply(data.Set3$K, data.Set3$Location, mean)
## Center North South
## 0.0 23.4 0.0
As seen in last results, farmers in the north applied the most fertilizer, farmers in the center applied the least, and farmers in the south applied in between. Did northen farmers applied too much fertilizer? Soil in the center is suitable for higher levels of fertilizer?
Temperature, hours of sun per day and rainfall were recorded three times a month at an agricultural experiment station. Let’s plot temperature, sun hour, and rainfall trends in each year.
wthr.data <- read.csv("./set3/Set3Weather.csv", header = TRUE)
par(mai = c(1,1,1,1))
plot(wthr.data$Temp0203, type = "l", lty = 2, xaxt = "n",
ylim = c(10,27), main = "Regional Temperature Trends",
ylab = expression(Temperature~"("*degree*C*")"),
xlab = "Month", cex.main = 2, cex.lab = 1.5) # Fig. 7.17a
lines(wthr.data$Temp0304, lty = 3)
lines(wthr.data$Temp0405, lty = 4)
lines(wthr.data$TempAvg, lty = 1)
axis(side = 1, at = c(1,4,7,10,13,16,19),
labels = c("Oct","Nov","Dec","Jan","Feb","Mar","Apr"))
legend(8, 15, c("2002-03", "2003-04", "2004-05",
"Normal"), lty = c(2,3,4,1))
d.0203 <- wthr.data$Temp0203 - wthr.data$TempAvg
d.0304 <- wthr.data$Temp0304 - wthr.data$TempAvg
d.0405 <- wthr.data$Temp0405 - wthr.data$TempAvg
plot(d.0203, type = "l", lty = 2, xaxt = "n",
main = "Regional Temperature Difference Trends", ylim = c(-5, 10),
ylab = expression(Temperature~"("*degree*C*")"),
xlab = "Month", cex.main = 2, cex.lab = 1.5)
lines(d.0304, lty = 3)
lines(d.0405, lty = 4) # Fig. 7.17b
segments(1, 0, 21, 0)
axis(side = 1, at = c(1,4,7,10,13,16,19),
labels = c("Oct","Nov","Dec","Jan","Feb","Mar","Apr"))
legend(4, 9, c("2002-03", "2003-04", "2004-05"), lty = c(2,3,4))
d.0203 <- wthr.data$Sun0203 - wthr.data$SunAvg
d.0304 <- wthr.data$Sun0304 - wthr.data$SunAvg
d.0405 <- wthr.data$Sun0405 - wthr.data$SunAvg
plot(d.0203, type = "l", lty = 2, xaxt = "n",
main = "Regional Sun Hour Difference Trends", ylim = c(-5, 5),
ylab = "Sun Hours Difference", xlab = "Month",
cex.main = 2, cex.lab = 1.5) # Fig. 7.17c
lines(d.0304, lty = 3)
lines(d.0405, lty = 4)
segments(1, 0, 21, 0)
axis(side = 1, at = c(1,4,7,10,13,16,19),
labels = c("Oct","Nov","Dec","Jan","Feb","Mar","Apr"))
legend(5, 5, c("2002-03", "2003-04", "2004-05"), lty = c(2,3,4))
d.0203 <- wthr.data$Rain0203 - wthr.data$RainAvg
d.0304 <- wthr.data$Rain0304 - wthr.data$RainAvg
d.0405 <- wthr.data$Rain0405 - wthr.data$RainAvg
plot(d.0203, type = "l", lty = 2, xaxt = "n",
main = "Regional Rain Difference Trends", ylim = c(-75,200),
ylab = "Rain Difference (cm)", xlab = "Month",
cex.main = 2, cex.lab = 1.5) # Fig. 7.17d
lines(d.0304, lty = 3)
lines(d.0405, lty = 4)
axis(side = 1, at = c(1,4,7,10,13,16,19),
labels = c("Oct","Nov","Dec","Jan","Feb","Mar","Apr"))
legend(3, 175, c("2002-03", "2003-04", "2004-05"), lty = c(2,3,4))
segments(1, 0, 21, 0)
From above temperature plot, it is clear that in each season the temperatures late in the season tended to be higher than normal, while these temperatures tended to be lower than normal earlier in the season. However, it is not easy to establish how much higher.
Regarding difference trends in terms of solar irradiation and rain, can you determine any particular pattern?
In fields under study, four rice varieties were planted. They are coded numerically in the data set. Let’s assign names to them and then compute mean yields by variety:
data.Set3$Variety <- "INIA Tacuarí"
data.Set3$Variety[which(data.Set3$Var == 2)] <- "El Pasol"
data.Set3$Variety[which(data.Set3$Var == 3)] <- "Perla"
data.Set3$Variety[which(data.Set3$Var == 4)] <- "INIA Olimar"
with(data.Set3, tapply(Yield, Variety, mean))
## El Pasol INIA Olimar INIA Tacuarí Perla
## 7064.099 7737.281 6931.632 9307.667
# pay attention to how to query which farmers planted a given variety
with(data.Set3, unique(Farmer[which(Variety == "INIA Tacuarí")]))
## [1] K I L D
## Levels: A B C D E F G H I J K L
with(data.Set3, unique(Farmer[which(Variety == "El Pasol")]))
## [1] K F J G H E A L C
## Levels: A B C D E F G H I J K L
with(data.Set3, unique(Farmer[which(Variety == "INIA Olimar")]))
## [1] B E H
## Levels: A B C D E F G H I J K L
with(data.Set3, unique(Farmer[which(Variety == "Perla")]))
## [1] A
## Levels: A B C D E F G H I J K L
library(lattice)
trellis.device(color = FALSE)
bwplot(Yield ~ Farmer | Variety, data = data.Set3, xlab = "Farmer")
Based on above code and plot answer these questions: 1. Is there a substantial difference in yields between rice varieties? 2. How many farmers planted all of the varieties? 3. How many farmers used INIA Olimar? 4. How many farmers used Perla?
According to results of this exploratory analysis, it can be said that any farmer(s) obtain consistently better yields than the rest? Does climate have an important infuence on year-to-year variability? Does terrain have much effect on yield? Does variety have a strong association with yield? Does any other management variable have influence on yield?
Do not forget to replicate this exercise by typing the code. Remember: do not copy and paste. When you copy and paste, you almost always take a macro view and conclude with ‘claro, ya entiendo la logica’ (but you don’t). When you type it in line by line, you are seeing each line and are forced to inspect it more (and to try your own variations to improve it). As a result, you learn much better.
Again, it would of benefit if you apply code here to your own data.
Plant, R.E., 2012. Spatial Data Analysis in Ecology and Agriculture using R. CRC Press, Boca Raton, FL.
Teetor, P., 2011. R Cookbox. O’Reilly, Sebastopol, CA.
Tukey, J.W., 1977. Exploratoty Data Analysis. Addison-Wesley, Reading, MA.