Reading and exploring a simple data set in csv format
Goal: Read and inspect a simple data set of the concentrations of 5 elements in soil samples at different horizons.
Packages to be used:
require(lattice)
require(scales)
We state the working directory where this script is developed for future reference (you can check the directory with getwd()) and define the name of the data directory
rwd <- "/media/alobo/KINGSTON/Rictja/RLab2"
dirdata <- "/media/alobo/KINGSTON/Rictja/RLab2/RLab2Data" #the actual path depends on your system!
…and make sure you read them right!
Select “Files” at the bottom rigth section of the RStudio GUI, then select the data directory and click on file kola.csv: the file will open in the editor. Give it a look taking care not to modify anything and close it to avoid problems. Navigate up to go back to the working directory. This data set is in text (ascii) csv format: samples ordered by rows, variables by columns, values separated by “,”. Let’s thus read it with function read.csv():
kola <- read.csv(file.path(dirdata,"mikola.csv"),
sep=",", header=TRUE,
stringsAsFactors=FALSE)
#units of [element] are mg/kg
Stating that the names of the column variables appear in the first line (“header”) and that “,” is the character separator between values, is strictly not required as those are the values assumed by default, but we state them explicitely for clarity. I routinely set “stringsAsFactors=FALSE” at reading and decide later on my own in the code whether (and how) converting to factor is appropriate.
Let’s have a look to the object “kola” that we have just created:
class(kola)
## [1] "data.frame"
dim(kola)
## [1] 1816 11
nrow(kola)
## [1] 1816
head(kola)
## ID XCOO YCOO ELEV H LITO Al K Pb S Cu
## 1 1 547960 7693790 135 M 20 71.2 4360 1.50 783 5.63
## 2 2 770025 7679167 140 M 4 245.0 4040 4.61 880 20.40
## 3 3 498651 7668151 255 M 31 103.0 4060 2.43 809 6.65
## 4 4 795152 7569386 240 M 20 307.0 4770 3.83 894 41.70
## 5 5 437050 7855900 80 M 10 253.0 4220 2.65 714 3.85
## 6 6 752106 7626023 140 M 20 238.0 5560 3.18 953 28.90
tail(kola)
## ID XCOO YCOO ELEV H LITO Al K Pb S Cu
## 1811 786 672583 7430000 160 C 1 9510 1000 1.5 24 17.8
## 1812 901 625409 7700000 210 C 52 6090 400 1.4 27 15.2
## 1813 902 657495 7740000 90 C 21 21300 3000 7.5 66 38.3
## 1814 903 685682 7740000 170 C 9 22600 1800 20.2 48 36.7
## 1815 904 683207 7730000 180 C 4 6310 1400 2.8 68 19.8
## 1816 905 680047 7720000 200 C 21 8430 2000 3.9 46 19.5
The meaning of the variables is straigthforward from their names, except H, which is soil horizon (“M”,“0”,“C” stand for, respectively, “mosses”, “horizon 0” and “horizon C”), and LITO, which is the litologic class (unfortunately the meaning of its coding is not provided in the aforementioned data source).
A summary of the data should let us know that the values are within reasonable limits:
summary(kola)
## ID XCOO YCOO ELEV
## Min. : 1 Min. :372602 Min. :7370000 Min. : 15
## 1st Qu.:185 1st Qu.:498651 1st Qu.:7510000 1st Qu.:140
## Median :383 Median :600600 Median :7610000 Median :200
## Mean :386 Mean :602356 Mean :7611125 Mean :204
## 3rd Qu.:579 3rd Qu.:700690 3rd Qu.:7700000 3rd Qu.:260
## Max. :906 Max. :861309 Max. :7890000 Max. :540
##
## H LITO Al K
## Length:1816 Min. : 1.0 Min. : 34 Min. : 100
## Class :character 1st Qu.: 4.0 1st Qu.: 290 1st Qu.: 900
## Mode :character Median : 20.0 Median : 1975 Median : 1300
## Mean : 21.2 Mean : 5193 Mean : 2257
## 3rd Qu.: 31.0 3rd Qu.: 7065 3rd Qu.: 3800
## Max. :107.0 Max. :85900 Max. :11000
## NA's :74
## Pb S Cu
## Min. : 0.3 Min. : 2 Min. : 2
## 1st Qu.: 1.9 1st Qu.: 46 1st Qu.: 6
## Median : 3.7 Median : 860 Median : 11
## Mean : 10.2 Mean : 830 Mean : 27
## 3rd Qu.: 14.9 3rd Qu.:1370 3rd Qu.: 21
## Max. :1110.0 Max. :3830 Max. :4080
## NA's :1
Our attention is called by variable “LITO”
table(kola$LITO)
##
## 1 4 6 7 9 10 20 21 22 31 32 51 52 81 82 83 105 107
## 369 70 25 118 68 145 291 67 45 125 101 190 86 3 17 11 8 3
This is an integer actually representing a categorical variable (“LITO” 2 is not twice as “LITO” 1), thus should be represented by a character (or a factor). We thus transform:
kola$LITO <- as.character(kola$LITO)
head(kola,3)
## ID XCOO YCOO ELEV H LITO Al K Pb S Cu
## 1 1 547960 7693790 135 M 20 71.2 4360 1.50 783 5.63
## 2 2 770025 7679167 140 M 4 245.0 4040 4.61 880 20.40
## 3 3 498651 7668151 255 M 31 103.0 4060 2.43 809 6.65
Boxplots are a very useful tool to graphically represent a good abridged description of the distribution. The box limits represent the 1st (“Q1”) and 3rd (“Q3”) quartile (thus the limits of the 25% and 75% of the distribution, known as the Inter-Quartile Range, IQR), with a horizontal thicker line indicating the value of the median (50% of the distribution). The vertical lines (“whiskers”) expand -/+ 1.5 * IQR (or the actual min/max of the data in case 1.5IQR expand outside these values). Values outside the 1.5IQR are individually represented by “*“, and can be regarded as suspect of being outliers. It is instructive seeing the boxplot of a gaussian variable compared to its PDF: Wikipedia Graphic of Boxplot vs PDF.
Very often boxplots are used to inspect the distribution of a quantitative variable stratified by the levels of a categorical variable. In our example, we inspect the distributions of the element concentrations at different soil horizons. For Al:
boxplot(Al~H,data=kola, notch=TRUE)
Not too nice or useful plot. Note 2 problems:
To solve the ordering issue, we explicitely state the levels at converting to factor. To solve the flattening we customize axis limits:
kola$H <- factor(kola$H, levels= c("M","0","C"))
boxplot(Al~H,data=kola,ylim=c(0,20000))
Still not a very informative plot. It is apparent that there are quite a number of observations much higher than Q3. Let’s inspect the histogram:
hist(kola$Al)
The histogram indicates a distribution much closer to log-normal than to gaussian, as it is common for element concentrations. We can either log transform the variables or indicate log scale in the plot. We take the second option to keep more interpretable values in the axis:
boxplot(Al~H,data=kola, log="y", main="Al")
boxplot(Pb~H,data=kola, log="y",main="Pb")
boxplot(S~H, data=kola, log="y",main="S")
boxplot(Cu~H,data=kola, log="y",main="Cu")
Comparing thr 4 elements would be easier in one single view, which we can achieve by splitting the graphic in 2 rows x 2 columns:
par(mfrow=c(2,2))
boxplot(Al~H,data=kola, log="y", main="Al")
boxplot(Pb~H,data=kola, log="y",main="Pb")
boxplot(S~H, data=kola, log="y",main="S")
boxplot(Cu~H,data=kola, log="y",main="Cu")
To save the plot you can use Export in the Plots window or do it by command:
bmp ("kolaplots.bmp",height=512*4/5, width=512)
par(mfrow=c(2,2))
boxplot(Al~H,data=kola, log="y", main="Al")
boxplot(Pb~H,data=kola, log="y",main="Pb")
boxplot(S~H, data=kola, log="y",main="S")
boxplot(Cu~H,data=kola, log="y",main="Cu")
dev.off() #do not forget this!
## RStudioGD
## 2
Review what we have just done:
we first opened a graphic device as a file with bmp(), then made the plot as above and finally closed the graphic device with dev.off(). Note that only after dev.off we got our bmp file created in the directory.
“Monchegorsk is a center of nickel and copper production (a Norilsk Nickel plant is located here), and the area surrounding the town is severely polluted. It is one of the most polluted towns in Russian Federation”. Wikipedia Monchgorsk.
Let’s find out the coordinates of the highest [Cu]:
which.max(kola$Cu)
## [1] 669
kola[which.max(kola$Cu),]
## ID XCOO YCOO ELEV H LITO Al K Pb S Cu
## 669 86 743131 7550000 200 0 20 9090 300 33.8 2920 4080
This is close to the actual location of the plant
polplant <- c(748281.486,7546547.750)#UTM coordinates of the plant Monchegorsk
We use the coordinates to calculate the Euclidean distance from each sampling site to the plant and insert the new variable in the data frame:
dist2polplant <- sqrt((kola[,2]-polplant[1])^2 + (kola[,3]-polplant[2])^2)
kola <- cbind(kola[,1:6],dist=dist2polplant,kola[,7:ncol(kola)])
head(kola,3)
## ID XCOO YCOO ELEV H LITO dist Al K Pb S Cu
## 1 1 547960 7693790 135 M 20 248614 71.2 4360 1.50 783 5.63
## 2 2 770025 7679167 140 M 4 134390 245.0 4040 4.61 880 20.40
## 3 3 498651 7668151 255 M 31 277674 103.0 4060 2.43 809 6.65
A simple bulk plot of [Cu] vs distance to the plant:
plot(kola$dist,log10(kola$Cu),cex=0.5)
…is a bit confusing. As pollution is wind-driven, we can conjecture that some directions (those of prevailing winds) will show a much tighter [Cu] vs. distance relationship. Therefore, we calculate the angle of the direction from the plant to every site.
The tangent of the direction angle is the ratio of the difference between the respective y and x coordinates: (ys-yp)/(xs-xp), We take the arctan to obtain the actual angle using atan2(), taking care of converting from radians (as atan2() output is in radians) to degrees. We subtract from 90 and convert eventual negative values to obtain the direction angle from North between the plant and each sampling site:
ang <- 90-(atan2(y=(kola[,3]-polplant[2]),x=(kola[,2]-polplant[1])))*180/pi
ang[ang<0] <- ang[ang<0] + 360
summary(ang)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3 236.0 283.0 256.0 315.0 360.0
hist(ang)
In order to study the [Cu] vs distance relationship by direction sectors, we create a new variable of angular intervals at 22.5 deg resolution
angc1 <- cut(ang,breaks=seq(0,360,by=22.5))
length(levels(angc1))
## [1] 16
cbind(1:16,levels(angc1)) #trick to inspect the result in vertical order
## [,1] [,2]
## [1,] "1" "(0,22.5]"
## [2,] "2" "(22.5,45]"
## [3,] "3" "(45,67.5]"
## [4,] "4" "(67.5,90]"
## [5,] "5" "(90,112]"
## [6,] "6" "(112,135]"
## [7,] "7" "(135,158]"
## [8,] "8" "(158,180]"
## [9,] "9" "(180,202]"
## [10,] "10" "(202,225]"
## [11,] "11" "(225,248]"
## [12,] "12" "(248,270]"
## [13,] "13" "(270,292]"
## [14,] "14" "(292,315]"
## [15,] "15" "(315,338]"
## [16,] "16" "(338,360]"
Interval “(22.5,45]” means higher than 22.5 and smaller than or equal to 45.
We create another set of intervals at 45 deg resolution, for which we just recode the previous variable abgc1 (note the interval named “0” includes both 0 to 22.5 and 338 to 360):
angc2 <- angc1
levels(angc2) <- as.character(c(0,45,45,90,90,135,135,180,180,225,225,270,270,315,315,0))
levels(angc2)
## [1] "0" "45" "90" "135" "180" "225" "270" "315"
Let’s add the new variables to the data frame:
kola2 <- kola
kola2 <- cbind(kola[,1:7],ang=ang,angc1=angc1,angc2=angc2,kola[,-(1:7)])
head(kola2,3)
## ID XCOO YCOO ELEV H LITO dist ang angc1 angc2 Al K
## 1 1 547960 7693790 135 M 20 248614 306.317 (292,315] 315 71.2 4360
## 2 2 770025 7679167 140 M 4 134390 9.311 (0,22.5] 0 245.0 4040
## 3 3 498651 7668151 255 M 31 277674 295.972 (292,315] 315 103.0 4060
## Pb S Cu
## 1 1.50 783 5.63
## 2 4.61 880 20.40
## 3 2.43 809 6.65
And display the boxplots
boxplot(Cu~angc1,data=kola,las=2,cex.axis=0.7,log="y",notch=TRUE)
## Warning: some notches went outside hinges ('box'): maybe set notch=FALSE
boxplot(Cu~angc2,data=kola, log="y",notch=TRUE)
Both boxplots indicate a lower [Cu] for those sites located SW to NW from the plant. We can also create an aggregated data set, with mean concentrations for each direction
kola2.ang <- aggregate(kola2[,-(1:10)],by=list(kola2$angc2),FUN=median, na.rm=TRUE)
kola2.ang
## Group.1 Al K Pb S Cu
## 1 0 1775 1300 4.345 948.0 19.100
## 2 45 1510 1400 3.050 861.0 15.100
## 3 90 3910 2900 5.600 873.0 17.300
## 4 135 3680 1500 4.450 952.0 16.600
## 5 180 2680 1450 4.600 1010.0 22.250
## 6 225 2080 1200 3.590 860.0 8.400
## 7 270 2075 1200 3.260 810.5 7.565
## 8 315 1695 1400 3.800 865.5 11.300
Although as distributions are closer to lognormal, the exponential of the mean (actually better the median) of the logarithms is more meaningful
mifun <- function(x,na.rm=TRUE) {exp(median(log(x),na.rm=na.rm))}
kola2.ang <- aggregate(kola2[,-(1:10)],by=list(kola2$angc2),FUN=mifun, na.rm=TRUE)
kola2.ang
## Group.1 Al K Pb S Cu
## 1 0 1775 1300 4.345 948.0 19.100
## 2 45 1510 1400 3.047 861.0 15.100
## 3 90 3865 2900 5.596 873.0 17.299
## 4 135 3680 1500 4.450 952.0 16.600
## 5 180 2680 1449 4.599 1010.0 22.250
## 6 225 2080 1200 3.590 860.0 8.400
## 7 270 2075 1200 3.260 810.5 7.565
## 8 315 1695 1400 3.800 865.5 11.300
Our goal here is to compare plots such as the one we made in section 3.1 but for different directions according to the variables we just made. If we want to compare, better we set common limits:
xl <- range(kola2$dist)
yl <- range(log10(kola2$Cu))
and then select by angle:
sel <- kola2[kola2$angc2 == "90",]
dim(sel)
## [1] 46 15
plot(sel$dist,log10(sel$Cu),col=sel$H,pch=19, xlim=xl, ylim=yl)
legend("topright",col=1:3,pch=19,legend=levels(sel$H))
better in one single display:
par(mfrow=c(2,2))
sel <- kola2[kola2$angc2 == "0",]
plot(sel$dist,log10(sel$Cu),xlim=xl, ylim=yl,type="n",main="Direction 0")
text(sel$dist,log10(sel$Cu),label=sel$H,cex=0.7,col=as.numeric(sel$H))
sel <- kola2[kola2$angc2 == "45",]
plot(sel$dist,log10(sel$Cu),type="n", xlim=xl, ylim=yl,main="Direction 45")
text(sel$dist,log10(sel$Cu),label=sel$H,cex=0.7,col=as.numeric(sel$H))
sel <- kola2[kola2$angc2 == "180",]
plot(sel$dist,log10(sel$Cu),type="n", xlim=xl, ylim=yl, main="Direction 180")
text(sel$dist,log10(sel$Cu),label=sel$H,cex=0.7,col=as.numeric(sel$H))
sel <- kola2[kola2$angc2 == "270",]
plot(sel$dist,log10(sel$Cu),type="n", xlim=xl, ylim=yl, main="Direction 270")
text(sel$dist,log10(sel$Cu),label=sel$H,cex=0.7,col=as.numeric(sel$H))
Gridded (lattice) plots are actually meant for this kind of panel plots (we stated require(lattice) at the begining of the script for using this function)
kola2$angc2 <- factor(kola2$angc2, levels=c(0,45,90,135,180,225,270,315))
xyplot(log10(Cu)~dist | angc1,groups=H,data=kola2,auto.key=TRUE)
xyplot(log10(Cu)~dist | angc2,groups=H,data=kola2,auto.key=TRUE)
We have gained much more insight with the last plot: the main reason why we were seeing lower values for sites located W from the plant is just that there are many more sites at a longer distance in that direction. Also, there is a strong indication of the presence of an additional source point located at 315 deg from the Monchegorsk plant. Finally, we observe a much stronger relationship between [Cu] and distance to the plant for shallower horizons. We can further explore this later finding by dedicated plots
sel <- kola2[kola2$angc2 == "0",]
xyplot(log10(Cu)~dist | H, data=sel, main="Direction 0")
sel <- kola2[kola2$angc2 == "90",]
xyplot(log10(Cu)~dist | H, data=sel, main="Direction 90")
sel <- kola2[kola2$angc2 == "180",]
xyplot(log10(Cu)~dist | H, data=sel, main="Direction 180")
sel <- kola2[kola2$angc2 == "315",]
xyplot(log10(Cu)~dist | H, data=sel, main="Direction 315")
An initial display of the realtionships among variables is presented by function pairs()
pairs(kola2[,11:15], log="xy")
which can actually be enriched by displaying each horizon with a different color and adding transparency to provide information on density (package scales required at the begining of the script for function alpha())
pairs(kola2[,11:15], log="xy", pch=20,
col=alpha(as.numeric(kola2$H),alpha=0.25))