R includes an extensive set of built-in math functions. Here is a partial list:
Suppose we have n independent events, and the ith event has the probability pi of occurring. What is the probability of exactly one of these events occurring?
Suppose first that n = 3 and our events are named A, B, and C. Then we break down the computation as follows:
P(exactly one event occurs)= P(A and not B and not C) + P(not A and B and not C) + P(not A and not B and C)
P(A and not B and not C) would be \(P_A (1-P_B)(1-P_C)\), P(not A and B and not C) would be \((1-P_A)P_B(1-P_C)\) P(not A and not B and C) would be \((1-P_A)(1-P_B)P_C\)
For general n, that is calculated as follows:
\[ \sum_{i=1}^n p_i (1-p_1)...(1-p_{i-1})(1-p_{i+1})...(1-p_n)\] Note: The ith term inside the sum is the probability that event i occurs and all the others do not occur.
Here’s code to compute this, with our probabilities \(p_i\) contained in the vector p:
exactlyone <- function(p) {
notp <- 1 - p
tot <- 0.0
for (i in 1:length(p))
tot <- tot + p[i] * prod(notp[-i])
return(tot)
}How does it work? Well, the assignment
creates a vector of all the “not occur” probabilities \(1 − p_j\) , using recycling. The expression notp[-i] computes the product of all the elements of notp, except the ith—exactly what we need.
As mentioned, the functions cumsum() and cumprod() return cumulative sums and products
## [1] 12 17 30
## [1] 12 60 780
In x, the sum of the first element is 12, the sum of the first two elements is 17, and the sum of the first three elements is 30. The function cumprod() works the same way as cumsum(), but with the product instead of the sum.
There is quite a difference between min() and pmin(). The former simply combines all its arguments into one long vector and returns the minimum value in that vector. In contrast, if pmin() is applied to two or more vectors, it returns a vector of the pair-wise minima, hence the name pmin.
Here’s an example:
## [1] 1
## [1] 1 3 2
In the first case, min() computed the smallest value in (1,5,6,2,3,2). But the call to pmin() computed the smaller of 1 and 2, yielding 1; then the smaller of 5 and 3, which is 3; then finally the minimum of 6 and 2, giving 2. Thus, the call returned the vector (1,3,2).
You can use more than two arguments in pmin(), like this:
## [1] 1 2
The 1 in the output is the minimum of 1, 5, and 6, with a similar computation leading to the 2. The max() and pmax() functions act analogously to min() and pmin().
Function minimization/maximization can be done via nlm() and optim(). For example, let’s find the smallest value of \[f(x)=x^2-sin(x)\]
## $minimum
## [1] -0.2324656
##
## $estimate
## [1] 0.4501831
##
## $gradient
## [1] 4.024558e-09
##
## $code
## [1] 1
##
## $iterations
## [1] 5
Here, the minimum value was found to be approximately −0.23, occurring at x = 0.45. A Newton-Raphson method (a technique from numerical analysis for approximating roots) is used, running five iterations in this case. The second argument specifies the initial guess, which we set to be 8. (This second argument was picked pretty arbitrarily here, but in some problems, you may need to experiment to find a value that will lead to convergence.)
R also has some calculus capabilities, including symbolic differentiation and numerical integration, as you can see in the following example.
## exp(x^2) * (2 * x)
## 0.3333333 with absolute error < 3.7e-15
Here, R reported \[ \frac{d}{dx} e^{x^2} =2xe^{x^2}\]
and \[ \int_{0}^1 x^2 dx \simeq 0.33333\]
You can find R packages for differential equations (odesolve), for interfacing R with the Yacas symbolic math system (ryacas), and for other calculus operations.
R has functions available for most of the famous statistical distributions. Prefix the name as follows:
df<-data.frame(Distribution =c("Normal", "Chi-square","Binomial"),
Density.pmf = c("dnorm","dchisq","dbinom"),
cdf =c("pnorm","pchisq","pbinom"),
Quantiles=c("qnorm","qchisq","qbinom"),
randomNumbers=c("rnorm","rchisq","rbinom"))
dfAs an example, let’s simulate 1,000 chi-square variates with 2 degrees of freedom and find their mean.
## [1] 1.864855
The r in rchisq specifies that we wish to generate random numbers— in this case, from the chi-square distribution. As seen in this example, the first argument in the r-series functions is the number of random variates to generate.
These functions also have arguments specific to the given distribution families. In our example, we use the df argument for the chi-square family, indicating the number of degrees of freedom.
Let’s also compute the 95th percentile of the chi-square distribution with two degrees of freedom:
## [1] 5.991465
Here, we used q to indicate quantile—in this case, the 0.95 quantile, or the 95th percentile. The first argument in the d, p, and q series is actually a vector so that we can evaluate the density/pmf, cdf, or quantile function at multiple points. Let’s find both the 50th and 95th percentiles of the chi-square distribution with 2 degrees of freedom.
## [1] 1.386294 5.991465
Ordinary numerical sorting of a vector can be done with the sort() function, as in this example:
## [1] 5 5 12 13
## [1] 13 5 12 5
Note that x itself did not change, in keeping with R’s functional language philosophy. If you want the indices of the sorted values in the original vector, use the order() function. Here’s an example:
## [1] 2 4 3 1
This means that x[2] is the smallest value in x, x[4] is the second smallest, x[3] is the third smallest, and so on.
You can use order(), together with indexing, to sort data frames, like this:
## [1] 3 1 2
What happened here? We called order() on the second column of y, yielding a vector r, telling us where numbers should go if we want to sort them. The 3 in this vector tells us that x[3,2] is the smallest number in x[,2]; the 1 tells us that x[1,2] is the second smallest; and the 2 tells us that x[2,2] is the third smallest. We then use indexing to produce the frame sorted by column 2, storing it in z. You can use order() to sort according to character variables as well as numeric ones, as follows:
A related function is rank(), which reports the rank of each element of a vector.
## [1] 4.0 1.5 3.0 1.5
This says that 13 had rank 4 in x; that is, it is the fourth smallest. The value 5 appears twice in x, with those two being the first and second smallest, so the rank 1.5 is assigned to both. Optionally, other methods of handling ties can be specified.
Multiplying a vector by a scalar works directly, as you saw earlier. Here’s another example:
## [1] 2 6 8 20
If you wish to compute the inner product (or dot product) of two vectors, use crossprod(), like this:
## [,1]
## [1,] 68
The function computed 1 · 5 + 2 · 12 + 3 · 13 = 68. Note that the name crossprod() is a misnomer, as the function does not compute the vector cross product.
For matrix multiplication in the mathematical sense, the operator to use is %%, not . For instance, here we compute the matrix product:
\[ \left( \begin{array}{cc} 1 & 2 \\ 3 & 4 \end{array} \right) % \left( \begin{array}{cc} 1 & -1 \\ 0 & 1 \end{array} \right) = \begin{pmatrix} 1 & 1 \\ 3 & 1 \end{pmatrix} \]
Here’s the code:
## [,1] [,2]
## [1,] 1 1
## [2,] 3 1
The function solve() will solve systems of linear equations and even find matrix inverses. For example, let’s solve this system:
\[x_1+x_2=2\] \[-x_1+x_2=4\] Its matrix form is as follows:
\[ \left( \begin{array}{cc} 1 & 1 \\ -1 & 1 \end{array} \right) % \left( \begin{array}{cc} x_1 \\ x_2 \end{array} \right) = \begin{bmatrix} 2 \\ 4 \end{bmatrix} \]
Here’s the code:
## [1] 3 1
## [,1] [,2]
## [1,] 0.5 0.5
## [2,] -0.5 0.5
In that second call to solve(), the lack of a second argument signifies that we simply wish to compute the inverse of the matrix. Here are a few other linear algebra functions: - t(): Matrix transpose - qr(): QR decomposition - chol(): Cholesky decomposition - det(): Determinant - eigen(): Eigenvalues/eigenvectors - diag(): Extracts the diagonal of a square matrix (useful for obtaining variances from a covariance matrix and for constructing a diagonal matrix). - sweep(): Numerical analysis sweep operations
Note the versatile nature of diag(): If its argument is a matrix, it returns a vector, and vice versa. Also, if the argument is a scalar, the function returns the identity matrix of the specified size.
## [1] 1 8
## [,1] [,2]
## [1,] 1 0
## [2,] 0 8
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
The sweep() function is capable of fairly complex operations. As a simple example, let’s take a 3-by-3 matrix and add 1 to row 1, 4 to row 2, and 7 to row 3.
## [,1] [,2] [,3]
## [1,] 2 3 4
## [2,] 8 9 10
## [3,] 14 15 16
The first two arguments to sweep() are like those of apply(): the array and the margin, which is 1 for rows in this case. The fourth argument is a function to be applied, and the third is an argument to that function (to the “+” function).
R includes some handy set operations, including these: - union(x,y): Union of the sets x and y - intersect(x,y): Intersection of the sets x and y - setdiff(x,y): Set difference between x and y, consisting of all elements of x that are not in y - setequal(x,y): Test for equality between x and y - c %in% y: Membership, testing whether c is an element of the set y - choose(n,k): Number of possible subsets of size k chosen from a set of size n
Here are some simple examples of using these functions:
## [1] 1 2 5 8 9
## [1] 1 5
## [1] 2
## [1] 8 9
## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] 10
Recall that you can write your own binary operations. For instance, consider coding the symmetric difference between two sets— that is, all the elements belonging to exactly one of the two operand sets. Because the symmetric difference between sets x and y consists exactly of those elements in x but not y and vice versa, the code consists of easy calls to setdiff() and union(), as follows:
Let’s try it.
## [1] 1 2 5
## [1] 5 1 8 9
## [1] 2 8 9
Here’s another example: a binary operand for determining whether one set u is a subset of another set v. A bit of thought shows that this property is equivalent to the intersection of u and v being equal to u. Hence we have another easily coded function:
## [1] TRUE
## [1] FALSE
The function combn() generates combinations. Let’s find the subsets of {1,2,3} of size 2.
## [,1] [,2] [,3]
## [1,] 1 1 2
## [2,] 2 3 3
## [1] "matrix" "array"
The results are in the columns of the output. We see that the subsets of {1,2,3} of size 2 are (1,2), (1,3), and (2,3). The function also allows you to specify a function to be called by combn() on each combination. For example, we can find the sum of the numbers in each subset, like this:
## [1] 3 4 5
The first subset, {1,2}, has a sum of 2, and so on.
One of the most common uses of R is simulation. Let’s see what kinds of tools R has available for this application.
As mentioned, R has functions to generate variates from a number of different distributions. For example, rbinom() generates binomial or Bernoulli random variates.
Let’s say we want to find the probability of getting at least four heads out of five tosses of a coin (easy to find analytically, but a handy example). Here’s how we can do this:
## [1] 0.18758
First, we generate 100,000 variates from a binomial distribution with five trials and a success probability of 0.5. We then determine which of them has a value 4 or 5, resulting in a Boolean vector of the same length as x. The TRUE and FALSE values in that vector are treated as 1s and 0s by mean(), giving us our estimated probability (since the average of a bunch of 1s and 0s is the proportion of 1s). Other functions include rnorm() for the normal distribution, rexp() for the exponential, runif() for the uniform, rgamma() for the gamma, rpois() for the Poisson, and so on. Here is another simple example, which finds E[max(X, Y )], the expected value of the maximum of independent N(0,1) random variables X and Y:
sum <- 0
nreps <- 100000
for (i in 1:nreps) {
xy <- rnorm(2) # generate 2 N(0,1)s
sum <- sum + max(xy)
}
print(sum/nreps)## [1] 0.5613667
We generated 100,000 pairs, found the maximum for each, and averaged those maxima to obtain our estimated expected value. The preceding code, with an explicit loop, may be clearer, but as before, if we are willing to use some more memory, we can do this more compactly.
emax<-function(nreps) {
x <- rnorm(2*nreps)
maxxy <- pmax(x[1:nreps],x[(nreps+1):(2*nreps)])
return(mean(maxxy))
}Here, we generated double nreps values. The first nreps value simulates X, and the remaining nreps value represents Y. The pmax() call then computes the pair-wise maxima that we need. Again, note the contrast here between max() and pmax(), the latter producing pair-wise maxima.
According to the R documentation, all random-number generators use 32-bit integers for seed values. Thus, other than round-off error, the same initial seed should generate the same stream of numbers. By default, R will generate a different random number stream from run to run of a program. If you want the same stream each time—important in debugging, for instance—call set.seed(), like this: