In this sesssion we’ll be covering:
The most common use of R is as a stastistical package, thus it has a lot of built-in functions to perform basic stats. There are even some packages that people have created which do a lot of the work for you. Today I will discuss some of the test that we might use most often with environmental data, but R has dozens of functions available to perform analysis on parametric, non-parametric, continuous or categorical variables. Here is a website that has a nice summary of many of the most popular statistical tests in R.
Let’s load our familiar chicago_air ozone data so that we can run some basic statistics on it. We will give the ozone column its own variable name so that we can access it quickly.
library(region5air)
data(chicago_air)
ozone <- chicago_air$ozone
min(ozone, na.rm=T)
## [1] 0.004
max(ozone, na.rm=T)
## [1] 0.081
range(ozone, na.rm=T)
## [1] 0.004 0.081
hist(ozone)
boxplot(ozone)
mean(ozone, na.rm=T)
## [1] 0.03567257
median(ozone, na.rm=T)
## [1] 0.034
var(ozone,na.rm=T) #variance
## [1] 0.0001878067
sd(ozone,na.rm=T) #standard deviation
## [1] 0.01370426
IQR(ozone,na.rm=T) #interquartile range
## [1] 0.02
summary(ozone)
install.packages("Hmisc")
library(Hmisc)
describe(ozone)
install.packages("psych")
library(psych)
describeBy(ozone)
install.packages("pastecs")
library(pastecs)
oz.stats <- stat.desc(ozone, norm=T) #If you set norm = TRUE, the function will describe the shape of the data and do various test for normality. This data will automatically output to a data frame.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00400 0.02500 0.03400 0.03567 0.04500 0.08100 26
## ozone
## n missing unique Info Mean .05 .10 .25 .50
## 339 26 63 1 0.03567 0.0160 0.0200 0.0250 0.0340
## .75 .90 .95
## 0.0450 0.0550 0.0591
##
## lowest : 0.004 0.008 0.010 0.011 0.013
## highest: 0.068 0.069 0.074 0.078 0.081
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 339 0.04 0.01 0.03 0.03 0.01 0 0.08 0.08 0.53 -0.14 0
## nbr.val nbr.null nbr.na min max
## 3.390000e+02 0.000000e+00 2.600000e+01 4.000000e-03 8.100000e-02
## range sum median mean SE.mean
## 7.700000e-02 1.209300e+01 3.400000e-02 3.567257e-02 7.443131e-04
## CI.mean.0.95 var std.dev coef.var skewness
## 1.464069e-03 1.878067e-04 1.370426e-02 3.841680e-01 5.320220e-01
## skew.2SE kurtosis kurt.2SE normtest.W normtest.p
## 2.008322e+00 -1.382295e-01 -2.616528e-01 9.745048e-01 1.059749e-05
write.csv(oz.stats, "oz_stats.csv") #You can save this dataframe to a csv file for later input into a table.
First, let’s plot our dataset. We will look at ozone values by month.
library(ggplot2)
p <- ggplot(chicago_air,aes(factor(chicago_air$month), chicago_air$ozone)) ##must change month to a factor so that boxplot knows how to group the data
p + geom_boxplot()
Now let’s say we want to compare ozone months in March and August and see if there is a significant difference in concentrations. First, we will subset the data and use the base boxplot function to view the data.
March.data <- chicago_air[chicago_air$month==3,]
August.data <- chicago_air[chicago_air$month==8,]
combo <- cbind(March.data, August.data)
combo <- combo[,c(2,8)] ##create a dataset of just the ozone data from March and August for easy plotting
boxplot(combo)
Now, let’s check for normality before doing any statistical tests.
hist(March.data$ozone)
hist(August.data$ozone)
d.mar <- density(March.data$ozone)
d.aug <- density(August.data$ozone,na.rm=T)
plot(d.mar, main="Kernel Density Plot of ozone data")
par(new=TRUE)
plot(d.aug, xlab="", main="", xaxt="n", yaxt="n")
Now we can do some comparisons between these 2 months of readings using the student’s t test.
t.test(March.data$ozone,August.data$ozone)
##
## Welch Two Sample t-test
##
## data: March.data$ozone and August.data$ozone
## t = -3.7499, df = 43.053, p-value = 0.0005233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.012861130 -0.003865751
## sample estimates:
## mean of x mean of y
## 0.03890323 0.04726667
t.test results show that these means are in fact different.pairwise.t.test to look at all the months.# this takes the form pairwise.t.test(x,g), where x = the vector of values and g= the grouping vector or factor.
pairwise.t.test(chicago_air$ozone,chicago_air$month)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: chicago_air$ozone and chicago_air$month
##
## 1 2 3 4 5 6 7 8 9
## 2 0.15321 - - - - - - - -
## 3 1.7e-08 0.01310 - - - - - - -
## 4 1.8e-15 2.6e-07 0.26779 - - - - - -
## 5 < 2e-16 2.4e-11 0.00174 1.00000 - - - - -
## 6 < 2e-16 3.1e-12 0.00041 1.00000 1.00000 - - - -
## 7 1.00000 1.00000 0.39405 0.00536 0.00014 5.7e-05 - - -
## 8 < 2e-16 4.0e-10 0.00857 1.00000 1.00000 1.00000 0.00041 - -
## 9 5.7e-11 0.00039 1.00000 1.00000 0.05957 0.01853 0.09282 0.19596 -
## 10 1.00000 0.90415 1.1e-06 3.4e-13 < 2e-16 < 2e-16 1.00000 < 2e-16 5.7e-09
## 11 1.00000 0.00645 4.1e-11 < 2e-16 < 2e-16 < 2e-16 1.00000 < 2e-16 9.1e-14
## 12 1.00000 0.00526 2.2e-11 < 2e-16 < 2e-16 < 2e-16 1.00000 < 2e-16 4.3e-14
## 10 11
## 2 - -
## 3 - -
## 4 - -
## 5 - -
## 6 - -
## 7 - -
## 8 - -
## 9 - -
## 10 - -
## 11 1.00000 -
## 12 1.00000 1.00000
##
## P value adjustment method: holm
If we want to look at how all the variables in our dataset relate to each other, correlation analysis is very useful.
chi_air.cor <- cor(chicago_air[,2:5],
use= "complete.obs",
method="pearson") #subset to only numeric data, can choose between spearman, kendall, and pearson correlation coefficients
chi_air.cor #print data to console to see the correlation matrix
## ozone temp solar month
## ozone 1.0000000 0.6035925 0.5926064 0.3397222
## temp 0.6035925 1.0000000 0.4923545 0.8138164
## solar 0.5926064 0.4923545 1.0000000 0.4385685
## month 0.3397222 0.8138164 0.4385685 1.0000000
cor.test(chicago_air$solar, chicago_air$temp, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: chicago_air$solar and chicago_air$temp
## t = 9.4859, df = 254, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4148735 0.5966489
## sample estimates:
## cor
## 0.5114604
A package called corrplot has some really nice plotting options for correlation matrices.
install.packages("corrplot")
library(corrplot)
corrplot(chi_air.cor, order='FPC') #you can order the plots based on alphabetical order, first principal component order, hierarchical clustering order, and more
corrplot(chi_air.cor,
method='number',
order='FPC')
pairs(chicago_air[,2:5], lower.panel=panel.smooth) # simply does pairwise plotting for all numeric variables
library(psych)
pairs.panels(chicago_air[,2:3], lm = TRUE) # shows the bivariate scatterplots, histogram, and Pearson correlation for each pair.
pairs.panels(chicago_air[,2:5], lm = TRUE)
| test | function |
|---|---|
| Chi Square Test | chisq.test() |
| Fisher’s Test | fisher.test() |
| Analysis of Variance | aov() |
EnvStats package has a comprehensive list of basic and more advanced statistical tests for Environmental Data.library(EnvStats)
?EnvStats
Now let’s try some exercises to test our understanding of basic statistical analysis in R.