Exercise R Lab 1

Anything after a hash sign “#” is considered “not R code” and it is not processed by R. Thus, the hash sign “#” is used to include comments and sometimes to prevent the execution (we often say “comment out”) of particular command lines that we want to keep in the script rather than to delete (i.e., to indicate an alternative way of doing a task).

In the R script of this page, you can observe than we often use the hash sign followed by an apostrophe in this way “#’”. This is a particular way of commenting that I will explain in future sessions (it has to do with the way of producing html pages such as the one you are reading out of R scripts).

Important

I assume here that you have installed both R and RStudio Desktop on your computer.
It is very important having a well organized directory in R. Please create an specific directory for the course, (i.e. Rcourse or RtalleR or whatever) and a subdirectory named RLab1. Furthermore, within RLab1 create a subdirectory named RLab1Data and copy this file in it:
https://dl.dropboxusercontent.com/u/3180464/taller/RLab1/mikola.csv
Avoid using the Desktop for the directory of the course. Better look for a rational location in your directory tree.
In general, avoid as much as possible having empty spaces embeded within file or directory names (better use “_" if you absolutely want to separate words in a name).

The RStudio GUI

We will be always using R through RStudio in this talleR. RStudio automatically starts R for you, please do not start both RStudio and then R on your own.
The RStudio GUI was briefly explained during the session and we will progressively learn more features along the sessions. RStudio is a pretty straigthforward and minimalist GUI, but if you need a guide Oscar Torres-Reyna from Princeton University has authored an excellent one: Introduction to RStudio
At the begining of the seesion, we started up RStudio, created a new project in your RLab1 directory (which we will call “R working directory” hereafter), and in few minutes learned:

that we must open a new R script and save it as RLab1.Rproj
that we must basically work by writing and executing commands in the script, leaving the console just to try some commands or perform intranscendental operations. I remind you that the fact that you organize your data analysis as a set of log scripts that eventually call other scripts and user functions warrantees that you keep your research documented, and thus auditable, which is essential fo reproducible research
how to quit (saving the session or not) RStudio
that after the first time for a given project, it is always better to start up RStudio by double clicking on the .Rproj file.

It is much shorter and more clear explaining the above issues “in vivo” than writing how to do it here. If you missed the session, please ask any of those who actually attended.

From now on in this page, I assume you have started RStudio and are working on your script named RLab1.R in your RLab1.Rproj project.

1. Scalars:

Numeric

a <- 12

The symbol “<-” stands for “assign”. Nowadays, “<-” has the same effect as “=”

a = 12
a

## [1] 12

Note that executing an object name as a command, has printed itself to the console. To check the type of object:

class(a)

## [1] "numeric"

You can also use simple mathematical operators:

b = a^2+3
x <- sqrt(a) + b

Use ? or help() to get help. In the case of the arithmetic operators, you need “”:

?"+"
?sqrt
help(class)

Character

a <- "DIOGENES"
a

## [1] "DIOGENES"

class(a)

## [1] "character"

In future sessions, we will always include the path of the working directory in which the script was designed to work at the beginning. You can get the path of your current directory with

getwd()

## [1] "/media/alobo/KINGSTON/Rictja/RLab1"

and then save it as a character variable

rwd <- "/Volumes/KINGSTON/Rictja/RLab1/.RData"

so that you keep documented the context in which you wrote your script.

2. Vectors

a <- 1:10
a

##  [1]  1  2  3  4  5  6  7  8  9 10

a <- seq(from=1,to=10,by=1)
#a <- seq(from=1,to=10,by=0.1)
a <- c(1,2,3,4,5,6,7,8,9,10)

2.1 Common R functions to gather information from vector objects

length(a)

## [1] 10

summary(a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00

sum(a)/length(a)

## [1] 5.5

mean(a)

## [1] 5.5

sd(a)

## [1] 3.028

Note that functions use () and have named arguments
Names of the arguments can be omitted but then arguments must be in order We always use “=” (not “<-”) for named arguments in the function call

seq(from=1,to=3,by=.5)

## [1] 1.0 1.5 2.0 2.5 3.0

seq(1,3,.5)

## [1] 1.0 1.5 2.0 2.5 3.0

seq(by=0.5,from=1,to=3)

## [1] 1.0 1.5 2.0 2.5 3.0

seq(0.5,1,3) #wrong order

## [1] 0.5

?seq

In particular for beginners, it is better to name the arguments in the call, except for very simple functions

Character vectors

x <- c("A","B")
x

## [1] "A" "B"

class(x)

## [1] "character"

x <- as.character(a)
x

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

vectors must have all elements of the same type

x <- c(a,"A") #numeric "upgraded" to character
x

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "A"

class(x)

## [1] "character"

Forcing a character to numeric renders a NA (“Not Available”, code for missing data)

x <- as.numeric(x)

## Warning: NAs introduced by coercion

##  [1]  1  2  3  4  5  6  7  8  9 10 NA

2.2. Subscripting

We use “[]”, not “()”: this indicates to us that the object is data and not a function

x[3]

## [1] 3

x[c(2,5,1)]

## [1] 2 5 1

ind <- c(3,1,4)
ind

## [1] 3 1 4

x[ind]

## [1] 3 1 4

x[3] == 6

## [1] FALSE

x[x>5]

## [1]  6  7  8  9 10 NA

x[which(x>5)]

## [1]  6  7  8  9 10

NA particularities

x[11]

## [1] NA

length(x)

## [1] 11

x[length(x)]

## [1] NA

x[11] == NA

## [1] NA

is.na(x[11])

## [1] TRUE

any(is.na(x))

## [1] TRUE

which(is.na(x))

## [1] 11

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    3.25    5.50    5.50    7.75   10.00       1

x +1

##  [1]  2  3  4  5  6  7  8  9 10 11 NA

mean(x)

## [1] NA

mean(x,na.rm=TRUE)

## [1] 5.5

x <- x[!is.na(x)]
x

##  [1]  1  2  3  4  5  6  7  8  9 10

shorter vector

x2 <- x[3:7]
x2 <- x[-c(1,2)]

Quick plots with vectors

a <- 1:10
b <- sqrt(a)
plot(a,b)

plot of chunk unnamed-chunk-18

plot(a,b,xlab="a", ylab="b = sqrt(a)", main="My first plot in R")

plot of chunk unnamed-chunk-18

plot(a,b,xlab="a", ylab="b = sqrt(a)", 
     xlim=c(0,15), ylim= c(0,10), 
     main="My first plot in R")
b2 <- b +3
points(a,b2,pch=19)
lines(a,b2)
legend("topright",pch=c(1,19), legend=c("b","b2"))

plot of chunk unnamed-chunk-18

dev.off()

## null device 
##           1

To check symbols and colors

plot(1:20,1:20,pch=1:20,col=1:20)

plot of chunk unnamed-chunk-19

3. Scripts and Custom functions

Let’s use the quadratic equation as an example:

a <- 3
b <- -5
c <- 1

The roots are:

x1 <- (-b + sqrt(b^2 - 4*a*c))/(2*a)
x2 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)

Quick plot:

x <- seq(from=-10, to=10, by=0.1)
plot(x,a*x^2 + b*x + c,type="l",ylim=c(-10,10))
abline(a=0,b=0)
abline(h=0)
abline(v=x1,col="red")
abline(v=x2,col="green")

plot of chunk unnamed-chunk-22

Now let’s create your first script (actually your second one, as your first one is this one that you are writing now).
Copy all lines since the last heading to another script file and save it as “quad_log.R” Then let’s clear the graphics

dev.off()

## null device 
##           1

…and the variables

rm(x,x1,x2,a,b,c)

Check that these variable do not exist any more in your workspace, i.e.

## Error: object 'a' not found

and “source” the script file:

source("quad_log.R")

plot of chunk unnamed-chunk-26

Note that you get the plot and that the variables are back to your workspace:

x1

## [1] 1.434

x2

## [1] 0.2324

And now let’s create your first function: copy the follwing lines to another script file and save it as “miquad.R” (and remove the first “#” of each line)

# miquad <- function(v=c(3,-5,1))
#     # quadratic function for a vector of 3 real parameters
# {
#     v[1]*x^2 + v[2]*x + v[3]
# }

As before:

dev.off()

## null device 
##           1

rm(x,x1,x2,a,b,c)
source("miquad.R")

But now the source() has got different consequences:

miquad

## function(v=x, p=c(3,-5,1))
## # quadratic function for a vector of 3 real parameters
## {
##     p[1]*v^2 + p[2]*v + p[3]
## }

miquad()

## Error: object 'x' not found

Omitting the “()” implies that R just prints out the contents of the object, which in this case is a function. Adding the “()” has exectuted the function, but we get en error because the function requires an argument (v) that defaults to x, and optionally accepts a second argument (p). Let’s create x again and run miquad() with it:

x <- seq(from=-10, to=10, by=0.1)
xq <- miquad(v=x)
plot(x,miquad(x),type="l")

plot of chunk unnamed-chunk-31

plot(x,miquad(x,p=c(2,9,11)),type="l")

plot of chunk unnamed-chunk-31

## Error: object 'p' not found

## Error: object 'v' not found

The function just returns to the user workspace whatever it is in its last line (you can use return() to make it clear if you want), but internal (to the function) variables such as p or v in this case just vanish after the function completes.

4. Distributions

#set.seed(10)
a <- rnorm(n=1000,m=2,sd=2.5)
plot(1:1000,a)

plot of chunk unnamed-chunk-32

hist(a)

plot of chunk unnamed-chunk-32

mean(a)

## [1] 2.073

sd(a)

## [1] 2.496

b <- runif(n=10000,min=min(a), max=max(a))
hist(b,nclass=50)

plot of chunk unnamed-chunk-32

Robust statistics

b <- a
b[200] <- 1000*a[200] #make an outlier
summary(a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.85    0.44    2.10    2.07    3.76    9.14

summary(b)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -6.8     0.4     2.1     3.2     3.8  1180.0

hist(a)

plot of chunk unnamed-chunk-33

hist(b)

plot of chunk unnamed-chunk-33

In the presence of outliers, median and mad are better estimators of the “true” mean and sd

mean(a)

## [1] 2.073

median(a)

## [1] 2.096

mean(b)

## [1] 3.252

median(b)

## [1] 2.098

(mean(b)-mean(a))*100/mean(a)

## [1] 56.82

(median(b)-mean(a))*100/mean(a)

## [1] 1.164

(sd(b)-sd(a))*100/sd(a)

## [1] 1395

(mad(b)-sd(a))*100/sd(a)

## [1] -1.121

5. Compare distributions

Assume we have two populations:

a <- rnorm(n=100000,m=2,sd=2.5)
b <- rnorm(n=100000,m=2.5,sd=2.5)

and make 3 samples:

a1 <- sample(a,size=1000)
a2 <- sample(a,size=1000)
b1 <- sample(b,size=1000)

summary(a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -9.450   0.313   2.000   2.000   3.700  13.200

summary(a1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -7.16    0.49    2.18    2.15    3.92   10.00

summary(a2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -6.800   0.339   2.000   2.030   3.870   8.920

summary(b1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -6.130   0.783   2.630   2.560   4.430  10.100

Tests: H0 The difference between the 2 means is 0

t.test(a1,a2) #we cannot reject H0

## 
##  Welch Two Sample t-test
## 
## data:  a1 and a2
## t = 1.113, df = 1998, p-value = 0.2657
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09547  0.34620
## sample estimates:
## mean of x mean of y 
##     2.153     2.028

t.test(a1,b1) #we can reject H0

## 
##  Welch Two Sample t-test
## 
## data:  a1 and b1
## t = -3.561, df = 1995, p-value = 0.000378
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6384 -0.1849
## sample estimates:
## mean of x mean of y 
##     2.153     2.565

hist(a1)
hist(a2,col="red",add=TRUE)

plot of chunk unnamed-chunk-37

Not very useful as bars overlap Try this

hist(a1,col=alpha("blue", 0.5))

## Error: could not find function "alpha"

plot of chunk unnamed-chunk-38

You get an error because alpha() is not in the “base” or “core” of R Load the package “scales”

require("scales")

## Loading required package: scales

You get again an error if you have not installed package “scales”. Install with Tools/Install Packages…

hist(a1,col=alpha("blue", 0.5))
hist(b1,col=alpha("green", 0.5),add=TRUE)

plot of chunk unnamed-chunk-40

3. Matrices

Set of vectors of the same length. As vectors, all elements must be of the same type

A <- cbind(a1,a2, b1) 
A[10,3]

##    b1 
## 2.522

A[10,]

##    a1    a2    b1 
## 6.195 2.838 2.522

#A[,2]
class(A)

## [1] "matrix"

dim(A)

## [1] 1000    3

nrow(A)

## [1] 1000

ncol(A)

## [1] 3

length(A)

## [1] 3000

rownames(A)

## NULL

colnames(A)

## [1] "a1" "a2" "b1"

colnames(A) <- c("samp1","samp2","samp3")
head(A,3)

##      samp1  samp2 samp3
## [1,] 3.135  4.735 3.436
## [2,] 5.499  5.401 4.147
## [3,] 1.989 -2.738 0.270

tail(A)

##           samp1   samp2  samp3
## [995,]  -0.6071  4.9520 1.3484
## [996,]   2.2766 -0.2093 5.3785
## [997,]   0.6017 -1.5982 0.6143
## [998,]  -0.2809  2.1062 4.5710
## [999,]   1.4591 -0.2311 4.5802
## [1000,]  1.9465 -0.7089 1.4442

B <-  A^2 #Note you must press ^ space, not just ^; also: B <- A**2

Note the effect of the ^2 in the distribution

hist(A[,1])

plot of chunk unnamed-chunk-42

hist(B[,1])

plot of chunk unnamed-chunk-42

Let’s keep the normality:

B <- A*2

We cannot refer to rows or cols that do not exist:

B[,4] <- 0.5*B[,3] + 10 + rnorm(n=nrow(B), m=0,sd=1)

## Error: subscript out of bounds

Correct way:

B <- cbind(B, 0.5*B[,3] + 10 + rnorm(n=nrow(B), m=0,sd=1))
head(B,3)

##       samp1  samp2 samp3      
## [1,]  6.270  9.470 6.872 13.65
## [2,] 10.998 10.801 8.293 15.61
## [3,]  3.978 -5.476 0.540 10.19

For rows:

B <- rbind(c(1:4),B)

Removing rows or cols:

B <- B[-1,] #remove first row

Removing last column:

B <- cbind(B,1) #add column of 1: R will repeat for us
dim(B)

## [1] 1000    5

B <- B[,-ncol(B)] #remove last column
dim(B)

## [1] 1000    4

summary(B)

##      samp1             samp2             samp3              V4       
##  Min.   :-14.322   Min.   :-13.591   Min.   :-12.26   Min.   : 2.92  
##  1st Qu.:  0.981   1st Qu.:  0.678   1st Qu.:  1.57   1st Qu.:10.56  
##  Median :  4.354   Median :  4.008   Median :  5.26   Median :12.70  
##  Mean   :  4.307   Mean   :  4.056   Mean   :  5.13   Mean   :12.58  
##  3rd Qu.:  7.843   3rd Qu.:  7.730   3rd Qu.:  8.86   3rd Qu.:14.50  
##  Max.   : 20.009   Max.   : 17.840   Max.   : 20.13   Max.   :21.56

Quick Plots

plot(x=B[,3],y=B[,4],pch=19)

plot of chunk unnamed-chunk-49

plot(x=B[,3],y=B[,4],pch=19,col=alpha("black",0.25),cex=0.7)

plot of chunk unnamed-chunk-49

Effect of outliers: Before including outliers:

summary(B)

##      samp1             samp2             samp3              V4       
##  Min.   :-14.322   Min.   :-13.591   Min.   :-12.26   Min.   : 2.92  
##  1st Qu.:  0.981   1st Qu.:  0.678   1st Qu.:  1.57   1st Qu.:10.56  
##  Median :  4.354   Median :  4.008   Median :  5.26   Median :12.70  
##  Mean   :  4.307   Mean   :  4.056   Mean   :  5.13   Mean   :12.58  
##  3rd Qu.:  7.843   3rd Qu.:  7.730   3rd Qu.:  8.86   3rd Qu.:14.50  
##  Max.   : 20.009   Max.   : 17.840   Max.   : 20.13   Max.   :21.56

t.test(B[,1],B[,3])

## 
##  Welch Two Sample t-test
## 
## data:  B[, 1] and B[, 3]
## t = -3.561, df = 1995, p-value = 0.000378
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2767 -0.3699
## sample estimates:
## mean of x mean of y 
##     4.307     5.130

Include outliers:

Boutl <- B
Boutl[c(10,100,1000),1] <- runif(n=3,min=50,max=500) 
Boutl[c(10,100,1000),1]

## [1] 168.6 284.1 406.4

summary(Boutl)

##      samp1           samp2             samp3              V4       
##  Min.   :-14.3   Min.   :-13.591   Min.   :-12.26   Min.   : 2.92  
##  1st Qu.:  1.0   1st Qu.:  0.678   1st Qu.:  1.57   1st Qu.:10.56  
##  Median :  4.4   Median :  4.008   Median :  5.26   Median :12.70  
##  Mean   :  5.1   Mean   :  4.056   Mean   :  5.13   Mean   :12.58  
##  3rd Qu.:  7.9   3rd Qu.:  7.730   3rd Qu.:  8.86   3rd Qu.:14.50  
##  Max.   :406.4   Max.   : 17.840   Max.   : 20.13   Max.   :21.56

t.test(Boutl[,1],Boutl[,3])

## 
##  Welch Two Sample t-test
## 
## data:  Boutl[, 1] and Boutl[, 3]
## t = 0.012, df = 1187, p-value = 0.9904
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.103  1.116
## sample estimates:
## mean of x mean of y 
##     5.137     5.130

boxplot(Boutl[,1:2],notch=TRUE)

plot of chunk unnamed-chunk-51

boxplot(Boutl[,1:2],notch=TRUE,ylim=c(0,30))

plot of chunk unnamed-chunk-51

Statistical detection of outliers

plot(quantile(Boutl[,1],probs=seq(0,1,0.1)),seq(0,1,0.1),
     xlab="percentile",
     ylab="probability")

plot of chunk unnamed-chunk-52

Assume median and mad are good estimates of undisturbed mean and sd

mean(B[,1])

## [1] 4.307

mean(Boutl[,1])

## [1] 5.137

median(Boutl[,1])

## [1] 4.368

sd(B[,1])

## [1] 5.071

sd(Boutl[,1])

## [1] 17.09

mad(Boutl[,1])

## [1] 5.112

Then exclude very extreme values only, i.e. 4sd

pnorm(4,m=0,sd=1) #prob of being within m-4sd and m+4sd in the Gaussian

## [1] 1

See http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

minval1 <- median(Boutl[,1])-4*mad(Boutl[,1])
maxval1 <- median(Boutl[,1])+4*mad(Boutl[,1])

Review candidate values to be NA: accept only if theoretically justified

Boutl[Boutl[,1]<minval1,1]

## numeric(0)

Boutl[Boutl[,1]>maxval1,1]

## [1] 168.6 284.1 406.4

set to NA

Boutl[Boutl[,1]<minval1,1] <- NA
Boutl[Boutl[,1]>maxval1,1] <- NA
summary(Boutl[,1])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -14.300   0.979   4.340   4.290   7.820  20.000       3

Test now?

t.test(Boutl[,1],Boutl[,3])

## 
##  Welch Two Sample t-test
## 
## data:  Boutl[, 1] and Boutl[, 3]
## t = -3.63, df = 1992, p-value = 0.0002905
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2929 -0.3859
## sample estimates:
## mean of x mean of y 
##      4.29      5.13

would this be correct? No! We must apply the same rule to the other distribution

minval3 <- median(Boutl[,3])-4*mad(Boutl[,3])
maxval3 <- median(Boutl[,3])+4*mad(Boutl[,3])
Boutl[Boutl[,3]<minval3,3]

## numeric(0)

Boutl[Boutl[,3]>maxval3,3]

## numeric(0)

Boutl[Boutl[,3]<minval3,3] <- NA
Boutl[Boutl[,3]>maxval3,3] <- NA
summary(Boutl[,3])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -12.30    1.57    5.26    5.13    8.86   20.10

Test now

t.test(Boutl[,1],Boutl[,3])

## 
##  Welch Two Sample t-test
## 
## data:  Boutl[, 1] and Boutl[, 3]
## t = -3.63, df = 1992, p-value = 0.0002905
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2929 -0.3859
## sample estimates:
## mean of x mean of y 
##      4.29      5.13

#There are specific tools for outlier detection in R:
# packages "extremevalues", "outliers"
# http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10-outlier-tutorial.pdf

4. Arrays

multi-dimensional matrices We skip this by now.

5. Data frames

Basic class for experimental data in R: as a matrix, but each column vector (“variable”) can be of different type. Let’s read an actual example from a data file:

dirdata <- "/Volumes/KINGSTON/Rictja/RLab1/RLab1Data" #NOTE: the actual path depends on your system
#For example, this could work:
dirdata <- "./RLab1Data"
kola <- read.csv(file.path(dirdata,"mikola.csv"),stringsAsFactors=FALSE)
#mg/kg
class(kola)

## [1] "data.frame"

dim(kola)

## [1] 1816   11

nrow(kola)

## [1] 1816

head(kola)

##   ID   XCOO    YCOO ELEV H LITO    Al    K   Pb   S    Cu
## 1  1 547960 7693790  135 M   20  71.2 4360 1.50 783  5.63
## 2  2 770025 7679167  140 M    4 245.0 4040 4.61 880 20.40
## 3  3 498651 7668151  255 M   31 103.0 4060 2.43 809  6.65
## 4  4 795152 7569386  240 M   20 307.0 4770 3.83 894 41.70
## 5  5 437050 7855900   80 M   10 253.0 4220 2.65 714  3.85
## 6  6 752106 7626023  140 M   20 238.0 5560 3.18 953 28.90

tail(kola)

##       ID   XCOO    YCOO ELEV H LITO    Al    K   Pb  S   Cu
## 1811 786 672583 7430000  160 C    1  9510 1000  1.5 24 17.8
## 1812 901 625409 7700000  210 C   52  6090  400  1.4 27 15.2
## 1813 902 657495 7740000   90 C   21 21300 3000  7.5 66 38.3
## 1814 903 685682 7740000  170 C    9 22600 1800 20.2 48 36.7
## 1815 904 683207 7730000  180 C    4  6310 1400  2.8 68 19.8
## 1816 905 680047 7720000  200 C   21  8430 2000  3.9 46 19.5

summary(kola)

##        ID           XCOO             YCOO              ELEV    
##  Min.   :  1   Min.   :372602   Min.   :7370000   Min.   : 15  
##  1st Qu.:185   1st Qu.:498651   1st Qu.:7510000   1st Qu.:140  
##  Median :383   Median :600600   Median :7610000   Median :200  
##  Mean   :386   Mean   :602356   Mean   :7611125   Mean   :204  
##  3rd Qu.:579   3rd Qu.:700690   3rd Qu.:7700000   3rd Qu.:260  
##  Max.   :906   Max.   :861309   Max.   :7890000   Max.   :540  
##                                                                
##       H                  LITO             Al              K        
##  Length:1816        Min.   :  1.0   Min.   :   34   Min.   :  100  
##  Class :character   1st Qu.:  4.0   1st Qu.:  290   1st Qu.:  900  
##  Mode  :character   Median : 20.0   Median : 1975   Median : 1300  
##                     Mean   : 21.2   Mean   : 5193   Mean   : 2257  
##                     3rd Qu.: 31.0   3rd Qu.: 7065   3rd Qu.: 3800  
##                     Max.   :107.0   Max.   :85900   Max.   :11000  
##                     NA's   :74                                     
##        Pb               S              Cu      
##  Min.   :   0.3   Min.   :   2   Min.   :   2  
##  1st Qu.:   1.9   1st Qu.:  46   1st Qu.:   6  
##  Median :   3.7   Median : 860   Median :  11  
##  Mean   :  10.2   Mean   : 830   Mean   :  27  
##  3rd Qu.:  14.9   3rd Qu.:1370   3rd Qu.:  21  
##  Max.   :1110.0   Max.   :3830   Max.   :4080  
##  NA's   :1

table(kola$LITO)

## 
##   1   4   6   7   9  10  20  21  22  31  32  51  52  81  82  83 105 107 
## 369  70  25 118  68 145 291  67  45 125 101 190  86   3  17  11   8   3

kola$LITO <- as.character(kola$LITO)
head(kola,3)

##   ID   XCOO    YCOO ELEV H LITO    Al    K   Pb   S    Cu
## 1  1 547960 7693790  135 M   20  71.2 4360 1.50 783  5.63
## 2  2 770025 7679167  140 M    4 245.0 4040 4.61 880 20.40
## 3  3 498651 7668151  255 M   31 103.0 4060 2.43 809  6.65

boxplot(Al~H,data=kola)

plot of chunk unnamed-chunk-61

Observe 2 problems: order of H is not correct and ylim must be adjusted

kola$H <- factor(kola$H, levels= c("M","0","C"))
boxplot(Al~H,data=kola,ylim=c(0,20000))