A First R Session

Author

Andrew Dalby

Background

In this session I am going to describe how to create a simple dataset with a single variable. These are male heights taken from Statistical Methods in Biology, (Bailey 1981). This page also aligns with the materials in chapter one of Statistical Analysis with R, (Schmuller 2017)

The first step is to create a variable with the data. In this case the heights are measured at 2cm intervals and there are frequencies for each of the measured heights. These are 117 values in total.

height <- c(rep(1.58, 1),rep(1.60,3),rep(1.62,6),rep(1.64,8),rep(1.66,13),
            rep(1.68, 18),rep(1.70, 19), rep(1.72, 14),rep(1.74, 14),
            rep(1.76, 9), rep(1.78, 5),rep(1.80, 4), rep(1.82, 2),
            rep(1.84, 1))

This produces a single column vector containing the heights. This can be viewed by typing the name of the variable.

height

  [1] 1.58 1.60 1.60 1.60 1.62 1.62 1.62 1.62 1.62 1.62 1.64 1.64 1.64 1.64 1.64
 [16] 1.64 1.64 1.64 1.66 1.66 1.66 1.66 1.66 1.66 1.66 1.66 1.66 1.66 1.66 1.66
 [31] 1.66 1.68 1.68 1.68 1.68 1.68 1.68 1.68 1.68 1.68 1.68 1.68 1.68 1.68 1.68
 [46] 1.68 1.68 1.68 1.68 1.70 1.70 1.70 1.70 1.70 1.70 1.70 1.70 1.70 1.70 1.70
 [61] 1.70 1.70 1.70 1.70 1.70 1.70 1.70 1.70 1.72 1.72 1.72 1.72 1.72 1.72 1.72
 [76] 1.72 1.72 1.72 1.72 1.72 1.72 1.72 1.74 1.74 1.74 1.74 1.74 1.74 1.74 1.74
 [91] 1.74 1.74 1.74 1.74 1.74 1.74 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
[106] 1.78 1.78 1.78 1.78 1.78 1.80 1.80 1.80 1.80 1.82 1.82 1.84

Summary Statistics

I can now calculate the summary statistics of height, including the mean, median, variance, and standard deviation.

mean(height)

[1] 1.702564

median(height)

[1] 1.7

var(height)

[1] 0.002755438

sd(height)

[1] 0.05249226

You can also create a five (six with the mean) figure summary.

summary(height)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.580   1.660   1.700   1.703   1.740   1.840

If you want to examine the data graphically you can create a histogram.

hist(height)

I do not like the histogram that it has produced because the divisions between the breaks are not frequent enough to show the true shape. It is also better to use density rather than frequency to put the data on a more natural scale.

You should also take this opportunity to improve on the labels for the chart and give it a meaningful title.

hist(height, 
     breaks=c(1.57,1.59,1.61,1.63,1.65,1.67,1.69,1.71,1.73,1.75,1.77,1.79,
              1.81,1.83,1.85),
     freq=FALSE,
     xlab="Height (m)",
     main="Histogram of the Height of 117 Males")

Another useful visualisation tool is to add the normal distribution curve to the histogram to check on how good a fit the data is to the normal.

hist(height, 
     breaks=c(1.55,1.57,1.59,1.61,1.63,1.65,1.67,1.69,1.71,1.73,1.75,1.77,
              1.79,1.81,1.83,1.85,1.87),
     freq=FALSE,
     xlab="Height (m)",
     main="Histogram of the Height of 117 Males")

m <- mean(height)
std <- sd(height)
curve(dnorm(x, mean=m, sd=std), 
      col="darkblue", lwd=2, add=TRUE)

Calculating the Standard Deviation

R has an in built function for calculating the standard deviation but I am going to illustrate some of the useful properties of vectors in R and use the formula to calculate the standard deviation.

The variance and standard deviation are examples of moments, they calculate the average distances from the centre of the distribution, the mean, of each of the data points.

\[ s=\sqrt{\frac{\sum_{i=1}^{n}({x_{i}-\bar{x}})^{2}}{n-1}} \tag{1}\]

This can be re-written in a form that is easier to calculate as:

\[ s=\sqrt{\frac{\sum_{i=1}^{n}{x^{2}-\frac{(\sum_{i=1}^{n}x)^{2}}{n})}}{n-1}} \]

This depends on the sum of the squares of x and the sum of x, squared.

We can calculate the squares of the height vector by squaring it. This will produce a vector where each value has been squared.

squared <-(height)^2
squared

  [1] 2.4964 2.5600 2.5600 2.5600 2.6244 2.6244 2.6244 2.6244 2.6244 2.6244
 [11] 2.6896 2.6896 2.6896 2.6896 2.6896 2.6896 2.6896 2.6896 2.7556 2.7556
 [21] 2.7556 2.7556 2.7556 2.7556 2.7556 2.7556 2.7556 2.7556 2.7556 2.7556
 [31] 2.7556 2.8224 2.8224 2.8224 2.8224 2.8224 2.8224 2.8224 2.8224 2.8224
 [41] 2.8224 2.8224 2.8224 2.8224 2.8224 2.8224 2.8224 2.8224 2.8224 2.8900
 [51] 2.8900 2.8900 2.8900 2.8900 2.8900 2.8900 2.8900 2.8900 2.8900 2.8900
 [61] 2.8900 2.8900 2.8900 2.8900 2.8900 2.8900 2.8900 2.8900 2.9584 2.9584
 [71] 2.9584 2.9584 2.9584 2.9584 2.9584 2.9584 2.9584 2.9584 2.9584 2.9584
 [81] 2.9584 2.9584 3.0276 3.0276 3.0276 3.0276 3.0276 3.0276 3.0276 3.0276
 [91] 3.0276 3.0276 3.0276 3.0276 3.0276 3.0276 3.0976 3.0976 3.0976 3.0976
[101] 3.0976 3.0976 3.0976 3.0976 3.0976 3.1684 3.1684 3.1684 3.1684 3.1684
[111] 3.2400 3.2400 3.2400 3.2400 3.3124 3.3124 3.3856

You can then use the sum function to sum the squares of x and x and apply the formula.

sumh <- sum(height)
n <- 117

step1 <- (sumh*sumh)/n
step1

[1] 339.1508

sums <- sum(squared)
sums

[1] 339.4704

s <- sqrt((sums-step1)/(n-1))
s

[1] 0.05249226

sd(height)

[1] 0.05249226

The ability to apply mathematical functions to vectors is a very powerful tool. You can use this to convert units of data or to take logarithms etc.

You can go one step further by creating your own functions this reduces the typing for any operation that you are going to apply often. For example I could write a sumofsquares function to combine the squaring and summation steps

sumofsquares <- function(x){ #This defines the arguments of the function.
  #In this case it is the name of the vector containing the heights.
  sumsq <- sum(x^2) # Carry out the two operations.
  return(sumsq) #Output the RETURN value.
}

sumofsquares(height)

[1] 339.4704

References

Bailey, N. T. J. 1981. Statistical Methods in Biology. 2nd ed. London: Hodder; Stoughton.

Schmuller, Joseph. 2017. Statistical Analysis with r. For Dummies. Hoboken, New Jersey: John Wiley & Sons.