Basic Statistics

The most common use of R is as a stastistical package, thus it has a lot of built-in functions to perform basic stats. There are even some packages that people have created which do a lot of the work for you. Today I will discuss some of the test that we might use most often with environmental data, but R has dozens of functions available to perform analysis on parametric, non-parametric, continuous or categorical variables. Here is a website that has a nice summary of many of the most popular statistical tests in R.

Let’s load our familiar chicago_air ozone data so that we can run some basic statistics on it. We will give the ozone column its own variable name so that we can access it quickly.

library(region5air)
data(chicago_air)
ozone <- chicago_air$ozone

Descriptive and exploratory statistics

min(ozone, na.rm=T)

## [1] 0.004

max(ozone, na.rm=T)

## [1] 0.081

range(ozone, na.rm=T)

## [1] 0.004 0.081

hist(ozone)

boxplot(ozone)

Measures of central tendency

mean(ozone, na.rm=T)

## [1] 0.03567257

median(ozone, na.rm=T)

## [1] 0.034

Measures of dispersion

var(ozone,na.rm=T)  #variance

## [1] 0.0001878067

sd(ozone,na.rm=T)   #standard deviation

## [1] 0.01370426

IQR(ozone,na.rm=T)  #interquartile range

## [1] 0.02

Functions/Packages that have nice summary stats:

summary(ozone)

install.packages("Hmisc")
library(Hmisc)
describe(ozone)

install.packages("psych")
library(psych)
describeBy(ozone)

install.packages("pastecs")
library(pastecs)
oz.stats <- stat.desc(ozone, norm=T) #If you set norm = TRUE, the function will describe the shape of the data and do various test for normality.  This data will automatically output to a data frame.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00400 0.02500 0.03400 0.03567 0.04500 0.08100      26

## ozone 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     339      26      63       1 0.03567  0.0160  0.0200  0.0250  0.0340 
##     .75     .90     .95 
##  0.0450  0.0550  0.0591 
## 
## lowest : 0.004 0.008 0.010 0.011 0.013
## highest: 0.068 0.069 0.074 0.078 0.081

##   vars   n mean   sd median trimmed  mad min  max range skew kurtosis se
## 1    1 339 0.04 0.01   0.03    0.03 0.01   0 0.08  0.08 0.53    -0.14  0

##       nbr.val      nbr.null        nbr.na           min           max 
##  3.390000e+02  0.000000e+00  2.600000e+01  4.000000e-03  8.100000e-02 
##         range           sum        median          mean       SE.mean 
##  7.700000e-02  1.209300e+01  3.400000e-02  3.567257e-02  7.443131e-04 
##  CI.mean.0.95           var       std.dev      coef.var      skewness 
##  1.464069e-03  1.878067e-04  1.370426e-02  3.841680e-01  5.320220e-01 
##      skew.2SE      kurtosis      kurt.2SE    normtest.W    normtest.p 
##  2.008322e+00 -1.382295e-01 -2.616528e-01  9.745048e-01  1.059749e-05

write.csv(oz.stats, "oz_stats.csv")  #You can save this dataframe to a csv file for later input into a table.

Statistical tests

First, let’s plot our dataset. We will look at ozone values by month.

library(ggplot2)
p <- ggplot(chicago_air,aes(factor(chicago_air$month), chicago_air$ozone)) ##must change month to a factor so that boxplot knows how to group the data
p + geom_boxplot()

Now let’s say we want to compare ozone months in March and August and see if there is a significant difference in concentrations. First, we will subset the data and use the base boxplot function to view the data.

March.data <- chicago_air[chicago_air$month==3,]
August.data <- chicago_air[chicago_air$month==8,]

combo <- cbind(March.data, August.data)
combo <- combo[,c(2,8)]  ##create a dataset of just the ozone data from March and August for easy plotting
boxplot(combo)

Now, let’s check for normality before doing any statistical tests.

hist(March.data$ozone)

hist(August.data$ozone)

d.mar <- density(March.data$ozone)
d.aug <- density(August.data$ozone,na.rm=T)
plot(d.mar, main="Kernel Density Plot of ozone data")
par(new=TRUE)
plot(d.aug, xlab="", main="", xaxt="n", yaxt="n")

Now we can do some comparisons between these 2 months of readings using the student’s t test.

t.test(March.data$ozone,August.data$ozone)

## 
##  Welch Two Sample t-test
## 
## data:  March.data$ozone and August.data$ozone
## t = -3.7499, df = 43.053, p-value = 0.0005233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.012861130 -0.003865751
## sample estimates:
##  mean of x  mean of y 
## 0.03890323 0.04726667

t.test results show that these means are in fact different.
We may be curious as to how each of the months compares with each other. We can do a pairwise.t.test to look at all the months.

# this takes the form pairwise.t.test(x,g), where x = the vector of values and g= the grouping vector or factor.
pairwise.t.test(chicago_air$ozone,chicago_air$month)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  chicago_air$ozone and chicago_air$month 
## 
##    1       2       3       4       5       6       7       8       9      
## 2  0.15321 -       -       -       -       -       -       -       -      
## 3  1.7e-08 0.01310 -       -       -       -       -       -       -      
## 4  1.8e-15 2.6e-07 0.26779 -       -       -       -       -       -      
## 5  < 2e-16 2.4e-11 0.00174 1.00000 -       -       -       -       -      
## 6  < 2e-16 3.1e-12 0.00041 1.00000 1.00000 -       -       -       -      
## 7  1.00000 1.00000 0.39405 0.00536 0.00014 5.7e-05 -       -       -      
## 8  < 2e-16 4.0e-10 0.00857 1.00000 1.00000 1.00000 0.00041 -       -      
## 9  5.7e-11 0.00039 1.00000 1.00000 0.05957 0.01853 0.09282 0.19596 -      
## 10 1.00000 0.90415 1.1e-06 3.4e-13 < 2e-16 < 2e-16 1.00000 < 2e-16 5.7e-09
## 11 1.00000 0.00645 4.1e-11 < 2e-16 < 2e-16 < 2e-16 1.00000 < 2e-16 9.1e-14
## 12 1.00000 0.00526 2.2e-11 < 2e-16 < 2e-16 < 2e-16 1.00000 < 2e-16 4.3e-14
##    10      11     
## 2  -       -      
## 3  -       -      
## 4  -       -      
## 5  -       -      
## 6  -       -      
## 7  -       -      
## 8  -       -      
## 9  -       -      
## 10 -       -      
## 11 1.00000 -      
## 12 1.00000 1.00000
## 
## P value adjustment method: holm

Correlation analysis

If we want to look at how all the variables in our dataset relate to each other, correlation analysis is very useful.

To run a correlation test, all of the values you give it must be numeric; therefore, dates or character values won’t work.
We need to subset our data to just the numeric columns that we want to test for correlation.
You will want to save the results to a variable so you can use it for plotting later.

chi_air.cor <- cor(chicago_air[,2:5], 
                   use= "complete.obs", 
                   method="pearson") #subset to only numeric data, can choose between spearman, kendall, and pearson correlation coefficients
chi_air.cor  #print data to console to see the correlation matrix

##           ozone      temp     solar     month
## ozone 1.0000000 0.6035925 0.5926064 0.3397222
## temp  0.6035925 1.0000000 0.4923545 0.8138164
## solar 0.5926064 0.4923545 1.0000000 0.4385685
## month 0.3397222 0.8138164 0.4385685 1.0000000

cor.test(chicago_air$solar, chicago_air$temp, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  chicago_air$solar and chicago_air$temp
## t = 9.4859, df = 254, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4148735 0.5966489
## sample estimates:
##       cor 
## 0.5114604

A package called corrplot has some really nice plotting options for correlation matrices.

install.packages("corrplot")
library(corrplot)
corrplot(chi_air.cor, order='FPC')  #you can order the plots based on alphabetical order, first principal component order, hierarchical clustering order, and more

corrplot(chi_air.cor, 
         method='number', 
         order='FPC')

Pairwise plots

pairs(chicago_air[,2:5], lower.panel=panel.smooth)  # simply does pairwise plotting for all numeric variables

library(psych)

pairs.panels(chicago_air[,2:3], lm = TRUE)  # shows the bivariate scatterplots, histogram, and Pearson correlation for each pair.
pairs.panels(chicago_air[,2:5], lm = TRUE)

Reference Table of a few popular tests for categorical data analysis in R

test	function
Chi Square Test	`chisq.test()`
Fisher’s Test	`fisher.test()`
Analysis of Variance	`aov()`

The `EnvStats` package has a comprehensive list of basic and more advanced statistical tests for Environmental Data.

library(EnvStats)
?EnvStats

Now let’s try some exercises to test our understanding of basic statistical analysis in R.

Basic Statistics

Wednesday, February 3, 2016

Topics

Basic Statistics

Descriptive and exploratory statistics

Measures of central tendency

Measures of dispersion

Functions/Packages that have nice summary stats:

Statistical tests

Correlation analysis

Pairwise plots

Reference Table of a few popular tests for categorical data analysis in R

The `EnvStats` package has a comprehensive list of basic and more advanced statistical tests for Environmental Data.

Exercise 6

Next session:

Other sessions:

Basic Statistics

Wednesday, February 3, 2016

Topics

Basic Statistics

Descriptive and exploratory statistics

Measures of central tendency

Measures of dispersion

Functions/Packages that have nice summary stats:

Statistical tests

Correlation analysis

Pairwise plots

Reference Table of a few popular tests for categorical data analysis in R

The EnvStats package has a comprehensive list of basic and more advanced statistical tests for Environmental Data.

Exercise 6

Next session:

Other sessions:

The `EnvStats` package has a comprehensive list of basic and more advanced statistical tests for Environmental Data.