The R Programming Language

Every programming language has similar elements. We need to input data, identify variables, perform calculations, create functions, control program flow, and output data and graphics. The R language is attractive in that it has it has a full online development environment -RStudio Cloud- and built in functions for regression analysis, solving equations, and producing graphics. R is increasingly used for those involved in data analysis and statistics, particularly social sciences, bio-statistics, and medicine. It is currently available in a free version. References on R tend to emphasize different features of R more suitable for social sciences, data mining, ecology, and medicine. A good reference for scientists using numerical methods is available¹, as well as an Analytical Chemistry textbook². This manual will present the elements of R gradually, as needed, mostly through chemical data examples. This document itself is created with RMarkdown, which integrates a document creation with R code.

RStudio Cloud

Rstudio Cloud

We will write our programs on a online platform (an IDE) called Rstudio Cloud, which makes it platform independent - we access the program the same way on a PC, Mac, or Chromebook. The Rstudio Cloud environment is divided into 4 main sections, with the most important elements listed below.

Top left: Script - where you write R code
Bottom left: Console - show output
Top right: Environment - we will mostly ignore this
Bottom right: show Plots, help, packages

A package is set of programs that R can call, such as a statistical package. There are base packages that come with R, and there are many that have to be added.

To start writing a program in Rstudio Cloud:

from a Workspace (like a file folder that can contain related programs) , select New project, and fromthe menu:
File ——> Newfile —–> Rscript

& start typing!

To run a program, highlight the code and select “run”,

When you run a script program, results and error messages will appear on the console, and plots appear on the plot area.

The great escape. Sometimes a program doesn’t run & only commands show up on the console. Put the cursor on the console and press the esc key to get out of this.

Basics of Numerics and Graphs

The shaded area is R code. White background appears as output from a previous R command.

#  A comment line

x <- 5.0    #  Set x equal to 5.0

y <- x^2    # y is equal to x squared. 

x <- c(1.0, 2.0, 3.0)   #   x is equal to a numbered list of values - a “vector” 

x <- seq(1,2,0.2)       # create an incremented sequence

xx <-matrix(c(2,4,6,8),2,2)   # a matrix row 1 = 2 6 row 2 = 4 8
# length(x) returns the length of the vector x.

# print(x)   # print(x) simply x returns all the values of x

length(x)

## [1] 6

print(x)

## [1] 1.0 1.2 1.4 1.6 1.8 2.0

print(xx)

##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8

Here is an example of a simple R program that plots \(y = x^2\)

x   <-  seq(0,10,0.5)     # a sequence from 1 to 10, increments 0f 0.5

y  <-   x^2     #  Note that y is calculated for every x. This is called vectorized.

plot(x,y)     # Create a plot. We can add a lot of formatting.

We will want to custom format our plots: Here is an example!

plot(x,y,type = "b",main = "A Formatted Graph",col = "darkblue", xlab = "X Label", ylab = "Y Label")

grid (NULL,NULL, lty = 1, col = "lightgreen") # to add a grid

Later, you may want to learn to use ggplot, which is very popular but which uses something called dataframes. For beginners, I suggest using the simpler “plot” command.

mydat <- data.frame(x,y)     #  creating a data frame 

library(ggplot2)

ggplot(mydat, aes(x,y)) + geom_point() + xlab("New Label")

Here is a quick guide to ggplot

Data and Statistics

Every measurement has random error: Error that is inherent in the nature of the measurement.

Significant Figures: The number of significant figures is all the certain figures plus one uncertain figure. The degree of uncertainty in the last digit is ultimately determined by a statistical analysis.

The Mean (or average) of a set of n measurements x is defined:

\[\large\overline{x} = \frac{\sum_{i=1}^{n}x_i}{n}\] and the Standard Deviation is:

\[\large s = \large\sqrt\frac{\sum(x_i - \overline{x})^2}{n-1} \]

The mean and standard deviation can be related to the Guassian distribution - that gives the probability of observing a particular value of x. For a finite number of measurements, the Gaussian distribution can be approximated as:

\[\large y =\frac{1}{s\sqrt(2\pi)}e^\frac{-(x-\overline{x})^2}{2s^2}\]

According to one reference, the average (mean) male weight is 172 pounds with a standard deviation of 29 pounds. Substituting this into the Gaussian distribution formula

s <- 29

xmean <-  172

xval <- seq(72,282,5)

xval

##  [1]  72  77  82  87  92  97 102 107 112 117 122 127 132 137 142 147 152 157 162
## [20] 167 172 177 182 187 192 197 202 207 212 217 222 227 232 237 242 247 252 257
## [39] 262 267 272 277 282

#  here we calculate a gaussian distribution

yp <- (1/(s*sqrt(2*pi)))*exp(1)^((-(xval-xmean)^2)/(2*s^2))

yp

##  [1] 3.601635e-05 6.430517e-05 1.114505e-04 1.875030e-04 3.062134e-04
##  [6] 4.854339e-04 7.470093e-04 1.115865e-03 1.618034e-03 2.277473e-03
## [11] 3.111781e-03 4.127191e-03 5.313615e-03 6.640726e-03 8.056213e-03
## [16] 9.487162e-03 1.084505e-02 1.203419e-02 1.296259e-02 1.355367e-02
## [21] 1.375663e-02 1.355367e-02 1.296259e-02 1.203419e-02 1.084505e-02
## [26] 9.487162e-03 8.056213e-03 6.640726e-03 5.313615e-03 4.127191e-03
## [31] 3.111781e-03 2.277473e-03 1.618034e-03 1.115865e-03 7.470093e-04
## [36] 4.854339e-04 3.062134e-04 1.875030e-04 1.114505e-04 6.430517e-05
## [41] 3.601635e-05 1.958139e-05 1.033421e-05

plot(xval,yp)

#  R has a command for Gaussian distribution, dnorm.

dp  <- dnorm(xval,199.8,29)    #  we can also get probability distribution using the command dnorm distribution

plot(xval,dp)
lines(xval,dp)
fdp <-   dnorm(xval,170.8,29)

lines(xval,fdp)

Each point represents the probability of a particular observation,and the area under the curve (the sum of all probabilities) is 1.

The distance from the mean of a particular measurement can be discussed in terms of “deviations from the mean” as multiples of the standard deviation. For instance:

68.3% of measurements lie within plus or minus one standard deviation

95.5% within plus or minus two standard deviations

99.7% within plus or minus three standard deviations.

The R command for the cumulative distribution, which approaches 1, is pnorm.

cdp <-  pnorm(xval,172,29)    #   men  weight distribution
fdp <-   pnorm(xval,160,29)   #   women weight distribution

plot(xval,cdp)
lines(xval,fdp)

Comparing two sets of data: the t test

Argon was discovered because the mass of chemically generated nitrogen was significantly different than nitrogen obtained from air. Significant, in statistics, is carefully defined in terms of probabilities. An interesting example comes from history: Rayleigh’s investigation of the mass of chemically generated nitrogen (for instance, from the decomposition of pure NO) and nitrogen obtained from air was about 0.5% greater than that obtained from chemical decomposition. Was this slight difference attributable to experimental error? The mass of the The two sets of masses were as follows, and the results of a t-test in R are shown.

Rdata <- c(2.31017,2.30986,2.31010,2.31001,2.310024,2.31010,2.31028)
Tdata <- c(2.30143,2.29890,2.29816,2.30182,2.29869,2.29940,2.29848)

Xdata <- data.frame(Rdata,Tdata)
knitr::kable(Xdata[,], col.names = c('Chem Data','Air Data'), caption = "Gas Data for t.test/grams")

Gas Data for t.test/grams
Chem Data	Air Data
2.310170	2.30143
2.309860	2.29890
2.310100	2.29816
2.310010	2.30182
2.310024	2.29869
2.310100	2.29940
2.310280	2.29848

t.test(Rdata,Tdata)

## 
##  Welch Two Sample t-test
## 
## data:  Rdata and Tdata
## t = 18.876, df = 6.0976, p-value = 1.218e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.009164539 0.011882319
## sample estimates:
## mean of x mean of y 
##  2.310078  2.299554

The important value to look for is p (probability) value: it tells us the probability that this overlap is due to merely random experimental error. The p value reported indicates about a one in a million chance that these results would occur if the masses were actually the same. We can also see that the size standard deviations, and see that they are much smaller than the difference between the two average values.

The p value is related to the ratio of the difference of the means and the standard deviations of the two data sets. The larger the difference of that rato, the more significant the difference of the means is.

A nice visualization is found on the mini-tab web site.

Gaussian Project

Find a data set that is likely random distribution. This could be baseball batting averages, heights, or some other attribute. Should be > 20 values.
Using R, create a vector with c command. X <- c(1,2,5,8,9)
Find the mean and the standard deviation. The commands are - if X is the vector: mean(X) and sd(X).
Use the dnrom command & graph the estimated Gaussian distribution over a reasonable range (created with seq())

Introduction to Scientific Programming and Simulation Using R, CRC Press (2014)↩︎
*Analytical Chemistry ↩︎