Every programming language has similar elements. We need to input data, identify variables, perform calculations, create functions, control program flow, and output data and graphics. The R language is attractive in that it has it has a full online development environment -RStudio Cloud- and built in functions for regression analysis, solving equations, and producing graphics. R is increasingly used for those involved in data analysis and statistics, particularly social sciences, bio-statistics, and medicine. It is currently available in a free version. References on R tend to emphasize different features of R more suitable for social sciences, data mining, ecology, and medicine. A good reference for scientists using numerical methods is available1, as well as an Analytical Chemistry textbook2. This manual will present the elements of R gradually, as needed, mostly through chemical data examples. This document itself is created with RMarkdown, which integrates a document creation with R code.
Rstudio Cloud
We will write our programs on a online platform (an IDE) called Rstudio Cloud, which makes it platform independent - we access the program the same way on a PC, Mac, or Chromebook. The Rstudio Cloud environment is divided into 4 main sections, with the most important elements listed below.
A package is set of programs that R can call, such as a statistical package. There are base packages that come with R, and there are many that have to be added.
To start writing a program in Rstudio Cloud:
from a Workspace (like a file folder that can contain related
programs) , select New project, and fromthe menu:
File ——> Newfile —–> Rscript
& start typing!
To run a program, highlight the code and select “run”,
When you run a script program, results and error messages will appear on the console, and plots appear on the plot area.
The great escape. Sometimes a program doesn’t run & only commands show up on the console. Put the cursor on the console and press the esc key to get out of this.
The shaded area is R code. White background appears as output from a previous R command.
# A comment line
x <- 5.0 # Set x equal to 5.0
y <- x^2 # y is equal to x squared.
x <- c(1.0, 2.0, 3.0) # x is equal to a numbered list of values - a “vector”
x <- seq(1,2,0.2) # create an incremented sequence
xx <-matrix(c(2,4,6,8),2,2) # a matrix row 1 = 2 6 row 2 = 4 8
# length(x) returns the length of the vector x.
# print(x) # print(x) simply x returns all the values of x
length(x)
## [1] 6
print(x)
## [1] 1.0 1.2 1.4 1.6 1.8 2.0
print(xx)
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
Here is an example of a simple R program that plots \(y = x^2\)
x <- seq(0,10,0.5) # a sequence from 1 to 10, increments 0f 0.5
y <- x^2 # Note that y is calculated for every x. This is called vectorized.
plot(x,y) # Create a plot. We can add a lot of formatting.
We will want to custom format our plots: Here is an example!
plot(x,y,type = "b",main = "A Formatted Graph",col = "darkblue", xlab = "X Label", ylab = "Y Label")
grid (NULL,NULL, lty = 1, col = "lightgreen") # to add a grid
Later, you may want to learn to use ggplot, which is very popular but which uses something called dataframes. For beginners, I suggest using the simpler “plot” command.
Every measurement has random error: Error that is inherent in the nature of the measurement.
Significant Figures: The number of significant figures is all the certain figures plus one uncertain figure. The degree of uncertainty in the last digit is ultimately determined by a statistical analysis.
The Mean (or average) of a set of n measurements x is defined:
\[\large\overline{x} = \frac{\sum_{i=1}^{n}x_i}{n}\] and the Standard Deviation is:
\[\large s = \large\sqrt\frac{\sum(x_i - \overline{x})^2}{n-1} \]
The mean and standard deviation can be related to the Guassian distribution - that gives the probability of observing a particular value of x. For a finite number of measurements, the Gaussian distribution can be approximated as:
\[\large y =\frac{1}{s\sqrt(2\pi)}e^\frac{-(x-\overline{x})^2}{2s^2}\]
According to one reference, the average (mean) male weight is 172 pounds with a standard deviation of 29 pounds. Substituting this into the Gaussian distribution formula
s <- 29
xmean <- 172
xval <- seq(72,282,5)
xval
## [1] 72 77 82 87 92 97 102 107 112 117 122 127 132 137 142 147 152 157 162
## [20] 167 172 177 182 187 192 197 202 207 212 217 222 227 232 237 242 247 252 257
## [39] 262 267 272 277 282
# here we calculate a gaussian distribution
yp <- (1/(s*sqrt(2*pi)))*exp(1)^((-(xval-xmean)^2)/(2*s^2))
yp
## [1] 3.601635e-05 6.430517e-05 1.114505e-04 1.875030e-04 3.062134e-04
## [6] 4.854339e-04 7.470093e-04 1.115865e-03 1.618034e-03 2.277473e-03
## [11] 3.111781e-03 4.127191e-03 5.313615e-03 6.640726e-03 8.056213e-03
## [16] 9.487162e-03 1.084505e-02 1.203419e-02 1.296259e-02 1.355367e-02
## [21] 1.375663e-02 1.355367e-02 1.296259e-02 1.203419e-02 1.084505e-02
## [26] 9.487162e-03 8.056213e-03 6.640726e-03 5.313615e-03 4.127191e-03
## [31] 3.111781e-03 2.277473e-03 1.618034e-03 1.115865e-03 7.470093e-04
## [36] 4.854339e-04 3.062134e-04 1.875030e-04 1.114505e-04 6.430517e-05
## [41] 3.601635e-05 1.958139e-05 1.033421e-05
plot(xval,yp)
# R has a command for Gaussian distribution, dnorm.
dp <- dnorm(xval,199.8,29) # we can also get probability distribution using the command dnorm distribution
plot(xval,dp)
lines(xval,dp)
fdp <- dnorm(xval,170.8,29)
lines(xval,fdp)
Introduction to Scientific Programming and Simulation Using R, CRC Press (2014)↩︎