Exercise R Lab 2.1: Pollutants in soils of the Kola Peninsula

Reading and exploring a simple data set in csv format

Agustin.Lobo@ictja.csic.es
20150121
Data compiled from those in the web pages of the book Statistical Data Analysis Explained. Applied Environmental Statistics with R
R script alobokola2_log.R

Goal: Read and inspect a simple data set of the concentrations of 5 elements in soil samples at different horizons.

Get organized

You should have a directory for this exercise (i.e. RLab2) and a subdirectory (i.e RLab2Data) with the kola2.csv file.
Create a new project in the existing RLab2 directory:
- File/New Project/Existing Directory
Create a new script for your exercise (File/New File/R Script) and name it kola2_log.R (or open the provided alobokola2_log.R file, save it with a different name and edit it at your convenience).

Required packages and paths

Packages to be used:

require(lattice)
require(scales)

We state the working directory where this script is developed for future reference (you can check the directory with getwd()) and define the name of the data directory

rwd     <- "/media/alobo/KINGSTON/Rictja/RLab2"
dirdata <- "/media/alobo/KINGSTON/Rictja/RLab2/RLab2Data" #the actual path depends on your system!

1. Read the data

…and make sure you read them right!
Select “Files” at the bottom rigth section of the RStudio GUI, then select the data directory and click on file kola.csv: the file will open in the editor. Give it a look taking care not to modify anything and close it to avoid problems. Navigate up to go back to the working directory. This data set is in text (ascii) csv format: samples ordered by rows, variables by columns, values separated by “,”. Let’s thus read it with function read.csv():

kola <- read.csv(file.path(dirdata,"mikola.csv"),
                 sep=",", header=TRUE, 
                 stringsAsFactors=FALSE)
#units of [element] are mg/kg

Stating that the names of the column variables appear in the first line (“header”) and that “,” is the character separator between values, is strictly not required as those are the values assumed by default, but we state them explicitely for clarity. I routinely set “stringsAsFactors=FALSE” at reading and decide later on my own in the code whether (and how) converting to factor is appropriate.
Let’s have a look to the object “kola” that we have just created:

class(kola)

## [1] "data.frame"

dim(kola)

## [1] 1816   11

nrow(kola)

## [1] 1816

head(kola)

##   ID   XCOO    YCOO ELEV H LITO    Al    K   Pb   S    Cu
## 1  1 547960 7693790  135 M   20  71.2 4360 1.50 783  5.63
## 2  2 770025 7679167  140 M    4 245.0 4040 4.61 880 20.40
## 3  3 498651 7668151  255 M   31 103.0 4060 2.43 809  6.65
## 4  4 795152 7569386  240 M   20 307.0 4770 3.83 894 41.70
## 5  5 437050 7855900   80 M   10 253.0 4220 2.65 714  3.85
## 6  6 752106 7626023  140 M   20 238.0 5560 3.18 953 28.90

tail(kola)

##       ID   XCOO    YCOO ELEV H LITO    Al    K   Pb  S   Cu
## 1811 786 672583 7430000  160 C    1  9510 1000  1.5 24 17.8
## 1812 901 625409 7700000  210 C   52  6090  400  1.4 27 15.2
## 1813 902 657495 7740000   90 C   21 21300 3000  7.5 66 38.3
## 1814 903 685682 7740000  170 C    9 22600 1800 20.2 48 36.7
## 1815 904 683207 7730000  180 C    4  6310 1400  2.8 68 19.8
## 1816 905 680047 7720000  200 C   21  8430 2000  3.9 46 19.5

The meaning of the variables is straigthforward from their names, except H, which is soil horizon (“M”,“0”,“C” stand for, respectively, “mosses”, “horizon 0” and “horizon C”), and LITO, which is the litologic class (unfortunately the meaning of its coding is not provided in the aforementioned data source).
A summary of the data should let us know that the values are within reasonable limits:

summary(kola)

##        ID           XCOO             YCOO              ELEV    
##  Min.   :  1   Min.   :372602   Min.   :7370000   Min.   : 15  
##  1st Qu.:185   1st Qu.:498651   1st Qu.:7510000   1st Qu.:140  
##  Median :383   Median :600600   Median :7610000   Median :200  
##  Mean   :386   Mean   :602356   Mean   :7611125   Mean   :204  
##  3rd Qu.:579   3rd Qu.:700690   3rd Qu.:7700000   3rd Qu.:260  
##  Max.   :906   Max.   :861309   Max.   :7890000   Max.   :540  
##                                                                
##       H                  LITO             Al              K        
##  Length:1816        Min.   :  1.0   Min.   :   34   Min.   :  100  
##  Class :character   1st Qu.:  4.0   1st Qu.:  290   1st Qu.:  900  
##  Mode  :character   Median : 20.0   Median : 1975   Median : 1300  
##                     Mean   : 21.2   Mean   : 5193   Mean   : 2257  
##                     3rd Qu.: 31.0   3rd Qu.: 7065   3rd Qu.: 3800  
##                     Max.   :107.0   Max.   :85900   Max.   :11000  
##                     NA's   :74                                     
##        Pb               S              Cu      
##  Min.   :   0.3   Min.   :   2   Min.   :   2  
##  1st Qu.:   1.9   1st Qu.:  46   1st Qu.:   6  
##  Median :   3.7   Median : 860   Median :  11  
##  Mean   :  10.2   Mean   : 830   Mean   :  27  
##  3rd Qu.:  14.9   3rd Qu.:1370   3rd Qu.:  21  
##  Max.   :1110.0   Max.   :3830   Max.   :4080  
##  NA's   :1

Our attention is called by variable “LITO”

table(kola$LITO)

## 
##   1   4   6   7   9  10  20  21  22  31  32  51  52  81  82  83 105 107 
## 369  70  25 118  68 145 291  67  45 125 101 190  86   3  17  11   8   3

This is an integer actually representing a categorical variable (“LITO” 2 is not twice as “LITO” 1), thus should be represented by a character (or a factor). We thus transform:

kola$LITO <- as.character(kola$LITO)
head(kola,3)

##   ID   XCOO    YCOO ELEV H LITO    Al    K   Pb   S    Cu
## 1  1 547960 7693790  135 M   20  71.2 4360 1.50 783  5.63
## 2  2 770025 7679167  140 M    4 245.0 4040 4.61 880 20.40
## 3  3 498651 7668151  255 M   31 103.0 4060 2.43 809  6.65

2. An initial inspection to the data

Boxplots are a very useful tool to graphically represent a good abridged description of the distribution. The box limits represent the 1st (“Q1”) and 3rd (“Q3”) quartile (thus the limits of the 25% and 75% of the distribution, known as the Inter-Quartile Range, IQR), with a horizontal thicker line indicating the value of the median (50% of the distribution). The vertical lines (“whiskers”) expand -/+ 1.5 * IQR (or the actual min/max of the data in case 1.5IQR expand outside these values). Values outside the 1.5IQR are individually represented by “*“, and can be regarded as suspect of being outliers. It is instructive seeing the boxplot of a gaussian variable compared to its PDF: Wikipedia Graphic of Boxplot vs PDF.
Very often boxplots are used to inspect the distribution of a quantitative variable stratified by the levels of a categorical variable. In our example, we inspect the distributions of the element concentrations at different soil horizons. For Al:

boxplot(Al~H,data=kola, notch=TRUE)

plot of chunk unnamed-chunk-8

Not too nice or useful plot. Note 2 problems:

order of H does not follow depth (levels follow the by default alphabetical order as were made “on the fly” by boxplot() from the character vector H)
boxes are too flattened to appreciate the details.

To solve the ordering issue, we explicitely state the levels at converting to factor. To solve the flattening we customize axis limits:

kola$H <- factor(kola$H, levels= c("M","0","C"))
boxplot(Al~H,data=kola,ylim=c(0,20000))

plot of chunk unnamed-chunk-9

Still not a very informative plot. It is apparent that there are quite a number of observations much higher than Q3. Let’s inspect the histogram:

hist(kola$Al)

plot of chunk unnamed-chunk-10

The histogram indicates a distribution much closer to log-normal than to gaussian, as it is common for element concentrations. We can either log transform the variables or indicate log scale in the plot. We take the second option to keep more interpretable values in the axis:

boxplot(Al~H,data=kola, log="y", main="Al")

plot of chunk unnamed-chunk-11

boxplot(Pb~H,data=kola, log="y",main="Pb")

plot of chunk unnamed-chunk-11

boxplot(S~H, data=kola, log="y",main="S")

plot of chunk unnamed-chunk-11

boxplot(Cu~H,data=kola, log="y",main="Cu")

plot of chunk unnamed-chunk-11

Comparing thr 4 elements would be easier in one single view, which we can achieve by splitting the graphic in 2 rows x 2 columns:

par(mfrow=c(2,2))
boxplot(Al~H,data=kola, log="y", main="Al")
boxplot(Pb~H,data=kola, log="y",main="Pb")
boxplot(S~H, data=kola, log="y",main="S")
boxplot(Cu~H,data=kola, log="y",main="Cu")

plot of chunk unnamed-chunk-12

To save the plot you can use Export in the Plots window or do it by command:

bmp ("kolaplots.bmp",height=512*4/5, width=512)
par(mfrow=c(2,2))
boxplot(Al~H,data=kola, log="y", main="Al")
boxplot(Pb~H,data=kola, log="y",main="Pb")
boxplot(S~H, data=kola, log="y",main="S")
boxplot(Cu~H,data=kola, log="y",main="Cu")
dev.off() #do not forget this!

## RStudioGD 
##         2

Review what we have just done:
we first opened a graphic device as a file with bmp(), then made the plot as above and finally closed the graphic device with dev.off(). Note that only after dev.off we got our bmp file created in the directory.

3. Exploratory graphics: Element concentrations vs. distance to source point

“Monchegorsk is a center of nickel and copper production (a Norilsk Nickel plant is located here), and the area surrounding the town is severely polluted. It is one of the most polluted towns in Russian Federation”. Wikipedia Monchgorsk.
Let’s find out the coordinates of the highest [Cu]:

which.max(kola$Cu)

## [1] 669

kola[which.max(kola$Cu),]

##     ID   XCOO    YCOO ELEV H LITO   Al   K   Pb    S   Cu
## 669 86 743131 7550000  200 0   20 9090 300 33.8 2920 4080

This is close to the actual location of the plant

polplant <- c(748281.486,7546547.750)#UTM coordinates of the plant Monchegorsk

3.1. Distance from plant to sites

We use the coordinates to calculate the Euclidean distance from each sampling site to the plant and insert the new variable in the data frame:

dist2polplant <- sqrt((kola[,2]-polplant[1])^2 + (kola[,3]-polplant[2])^2) 
kola <- cbind(kola[,1:6],dist=dist2polplant,kola[,7:ncol(kola)])
head(kola,3)

##   ID   XCOO    YCOO ELEV H LITO   dist    Al    K   Pb   S    Cu
## 1  1 547960 7693790  135 M   20 248614  71.2 4360 1.50 783  5.63
## 2  2 770025 7679167  140 M    4 134390 245.0 4040 4.61 880 20.40
## 3  3 498651 7668151  255 M   31 277674 103.0 4060 2.43 809  6.65

A simple bulk plot of [Cu] vs distance to the plant:

plot(kola$dist,log10(kola$Cu),cex=0.5)

plot of chunk unnamed-chunk-17

…is a bit confusing. As pollution is wind-driven, we can conjecture that some directions (those of prevailing winds) will show a much tighter [Cu] vs. distance relationship. Therefore, we calculate the angle of the direction from the plant to every site.

3.2. Angle from plant to sites

The tangent of the direction angle is the ratio of the difference between the respective y and x coordinates: (ys-yp)/(xs-xp), We take the arctan to obtain the actual angle using atan2(), taking care of converting from radians (as atan2() output is in radians) to degrees. We subtract from 90 and convert eventual negative values to obtain the direction angle from North between the plant and each sampling site:

ang <- 90-(atan2(y=(kola[,3]-polplant[2]),x=(kola[,2]-polplant[1])))*180/pi
ang[ang<0] <- ang[ang<0] + 360
summary(ang)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.3   236.0   283.0   256.0   315.0   360.0

hist(ang)

plot of chunk unnamed-chunk-18

In order to study the [Cu] vs distance relationship by direction sectors, we create a new variable of angular intervals at 22.5 deg resolution

angc1 <- cut(ang,breaks=seq(0,360,by=22.5))
length(levels(angc1))

## [1] 16

cbind(1:16,levels(angc1)) #trick to inspect the result in vertical order

##       [,1] [,2]       
##  [1,] "1"  "(0,22.5]" 
##  [2,] "2"  "(22.5,45]"
##  [3,] "3"  "(45,67.5]"
##  [4,] "4"  "(67.5,90]"
##  [5,] "5"  "(90,112]" 
##  [6,] "6"  "(112,135]"
##  [7,] "7"  "(135,158]"
##  [8,] "8"  "(158,180]"
##  [9,] "9"  "(180,202]"
## [10,] "10" "(202,225]"
## [11,] "11" "(225,248]"
## [12,] "12" "(248,270]"
## [13,] "13" "(270,292]"
## [14,] "14" "(292,315]"
## [15,] "15" "(315,338]"
## [16,] "16" "(338,360]"

Interval “(22.5,45]” means higher than 22.5 and smaller than or equal to 45.
We create another set of intervals at 45 deg resolution, for which we just recode the previous variable abgc1 (note the interval named “0” includes both 0 to 22.5 and 338 to 360):

angc2 <- angc1
levels(angc2) <- as.character(c(0,45,45,90,90,135,135,180,180,225,225,270,270,315,315,0))
levels(angc2)

## [1] "0"   "45"  "90"  "135" "180" "225" "270" "315"

Let’s add the new variables to the data frame:

kola2 <- kola
kola2 <- cbind(kola[,1:7],ang=ang,angc1=angc1,angc2=angc2,kola[,-(1:7)])
head(kola2,3)

##   ID   XCOO    YCOO ELEV H LITO   dist     ang     angc1 angc2    Al    K
## 1  1 547960 7693790  135 M   20 248614 306.317 (292,315]   315  71.2 4360
## 2  2 770025 7679167  140 M    4 134390   9.311  (0,22.5]     0 245.0 4040
## 3  3 498651 7668151  255 M   31 277674 295.972 (292,315]   315 103.0 4060
##     Pb   S    Cu
## 1 1.50 783  5.63
## 2 4.61 880 20.40
## 3 2.43 809  6.65

And display the boxplots

boxplot(Cu~angc1,data=kola,las=2,cex.axis=0.7,log="y",notch=TRUE)

## Warning: some notches went outside hinges ('box'): maybe set notch=FALSE

plot of chunk unnamed-chunk-22

boxplot(Cu~angc2,data=kola, log="y",notch=TRUE)

plot of chunk unnamed-chunk-22

Both boxplots indicate a lower [Cu] for those sites located SW to NW from the plant. We can also create an aggregated data set, with mean concentrations for each direction

kola2.ang <- aggregate(kola2[,-(1:10)],by=list(kola2$angc2),FUN=median, na.rm=TRUE)
kola2.ang

##   Group.1   Al    K    Pb      S     Cu
## 1       0 1775 1300 4.345  948.0 19.100
## 2      45 1510 1400 3.050  861.0 15.100
## 3      90 3910 2900 5.600  873.0 17.300
## 4     135 3680 1500 4.450  952.0 16.600
## 5     180 2680 1450 4.600 1010.0 22.250
## 6     225 2080 1200 3.590  860.0  8.400
## 7     270 2075 1200 3.260  810.5  7.565
## 8     315 1695 1400 3.800  865.5 11.300

Although as distributions are closer to lognormal, the exponential of the mean (actually better the median) of the logarithms is more meaningful

mifun <- function(x,na.rm=TRUE) {exp(median(log(x),na.rm=na.rm))}
kola2.ang <- aggregate(kola2[,-(1:10)],by=list(kola2$angc2),FUN=mifun, na.rm=TRUE)
kola2.ang

##   Group.1   Al    K    Pb      S     Cu
## 1       0 1775 1300 4.345  948.0 19.100
## 2      45 1510 1400 3.047  861.0 15.100
## 3      90 3865 2900 5.596  873.0 17.299
## 4     135 3680 1500 4.450  952.0 16.600
## 5     180 2680 1449 4.599 1010.0 22.250
## 6     225 2080 1200 3.590  860.0  8.400
## 7     270 2075 1200 3.260  810.5  7.565
## 8     315 1695 1400 3.800  865.5 11.300

3.3. Concentrations vs distance at given directions

Our goal here is to compare plots such as the one we made in section 3.1 but for different directions according to the variables we just made. If we want to compare, better we set common limits:

xl <- range(kola2$dist)
yl <- range(log10(kola2$Cu))

and then select by angle:

sel <- kola2[kola2$angc2 == "90",]
dim(sel)

## [1] 46 15

plot(sel$dist,log10(sel$Cu),col=sel$H,pch=19, xlim=xl, ylim=yl)
legend("topright",col=1:3,pch=19,legend=levels(sel$H))

plot of chunk unnamed-chunk-26

better in one single display:

par(mfrow=c(2,2))
sel <- kola2[kola2$angc2 == "0",]
plot(sel$dist,log10(sel$Cu),xlim=xl, ylim=yl,type="n",main="Direction 0")
text(sel$dist,log10(sel$Cu),label=sel$H,cex=0.7,col=as.numeric(sel$H))

sel <- kola2[kola2$angc2 == "45",]
plot(sel$dist,log10(sel$Cu),type="n", xlim=xl, ylim=yl,main="Direction 45")
text(sel$dist,log10(sel$Cu),label=sel$H,cex=0.7,col=as.numeric(sel$H))

sel <- kola2[kola2$angc2 == "180",]
plot(sel$dist,log10(sel$Cu),type="n", xlim=xl, ylim=yl, main="Direction 180")
text(sel$dist,log10(sel$Cu),label=sel$H,cex=0.7,col=as.numeric(sel$H))

sel <- kola2[kola2$angc2 == "270",]
plot(sel$dist,log10(sel$Cu),type="n", xlim=xl, ylim=yl, main="Direction 270")
text(sel$dist,log10(sel$Cu),label=sel$H,cex=0.7,col=as.numeric(sel$H))

plot of chunk unnamed-chunk-27

Gridded (lattice) plots are actually meant for this kind of panel plots (we stated require(lattice) at the begining of the script for using this function)

kola2$angc2 <- factor(kola2$angc2, levels=c(0,45,90,135,180,225,270,315))
xyplot(log10(Cu)~dist | angc1,groups=H,data=kola2,auto.key=TRUE)

plot of chunk unnamed-chunk-28

xyplot(log10(Cu)~dist | angc2,groups=H,data=kola2,auto.key=TRUE)

plot of chunk unnamed-chunk-28

We have gained much more insight with the last plot: the main reason why we were seeing lower values for sites located W from the plant is just that there are many more sites at a longer distance in that direction. Also, there is a strong indication of the presence of an additional source point located at 315 deg from the Monchegorsk plant. Finally, we observe a much stronger relationship between [Cu] and distance to the plant for shallower horizons. We can further explore this later finding by dedicated plots

sel <- kola2[kola2$angc2 == "0",]
xyplot(log10(Cu)~dist | H, data=sel, main="Direction 0")

plot of chunk unnamed-chunk-29

sel <- kola2[kola2$angc2 == "90",]
xyplot(log10(Cu)~dist | H, data=sel, main="Direction 90")

plot of chunk unnamed-chunk-29

sel <- kola2[kola2$angc2 == "180",]
xyplot(log10(Cu)~dist | H, data=sel, main="Direction 180")

plot of chunk unnamed-chunk-29

sel <- kola2[kola2$angc2 == "315",]
xyplot(log10(Cu)~dist | H, data=sel, main="Direction 315")

plot of chunk unnamed-chunk-29

4. Relationships among variables

An initial display of the realtionships among variables is presented by function pairs()

pairs(kola2[,11:15], log="xy")

plot of chunk unnamed-chunk-30

which can actually be enriched by displaying each horizon with a different color and adding transparency to provide information on density (package scales required at the begining of the script for function alpha())

pairs(kola2[,11:15], log="xy", pch=20,
      col=alpha(as.numeric(kola2$H),alpha=0.25))

plot of chunk unnamed-chunk-31