It’s possible to use R as a calculator, although that is a bit of overkill.
Suppose we have 6 exam scores and we wish to find the average.
Using algebraic commands:
(78+73+92+85+75+98)/6
## [1] 83.5
We can use R commands to accomplish the same task. First we form a vector or list in R of the scores. We do this using the combine, c(), function. Then we wrap this inside the mean() function, which calculates the average.
mean(c(78,73,92,85,75,98))
## [1] 83.5
Similarly, we can use the R function sd() to calculate the sample standard deviation of the exam scores.
sd(c(78,73,92,85,75,98))
## [1] 9.974969
In the case where you want to use the same list or vector repeatedly, it’s a good idea to give it a descriptive name. In this case, we’ll name our vector “exams”. Use the assignment operator <- to do this.
exams <- c(78,73,92,85,75,98)
Now, we can use these R functions many different statistics for the exam scores, using the name of the vector instead of typing out the values again.
mean(exams) #Calculate the mean
## [1] 83.5
sd(exams) #Calculate the standard deviation
## [1] 9.974969
var(exams) #Calculate the variance
## [1] 99.5
median(exams) #Find the median
## [1] 81.5
Let’s look at some data in the R package openintro
. Load the package into working memory. You must do this each time you use RStudio.
require(openintro)
Many of the packages used in R include sample data sets. When you load the openintro package, you gain access to this data. We’ll use the View()
function to look at the ageAtMar
data set which contains the age at first marriage for a sample of 5,534 US women. Notice that there is a capital “V” for this command. Punctuation matters! The command below will open the dataset in a new tab in RStudio.
View(ageAtMar)
You’ll want to pay attention to how the data is labeled and coded. Specifically, note the name of the variable and how it is spelled. Also notice that because our data is now in a data frame (and not a simple vector that we defined ourselves like the exams example above) the way we call the functions is different. We must use dataset$variablename as the argument to each function.
mean(ageAtMar$age) #Calculate the mean
## [1] 23.44019
sd(ageAtMar$age) #Calculate the standard deviation
## [1] 4.721365
var(ageAtMar$age) #Calculate the variance
## [1] 22.29129
median(ageAtMar$age) #Find the median
## [1] 23
nrow(ageAtMar) #Count the number of observations to get n
## [1] 5534
min(ageAtMar$age) #Find the minimum value
## [1] 10
max(ageAtMar$age) #Find the maximum value
## [1] 43
To find the range we can subtract the minimum value from the maximum value in the data.
max(ageAtMar$age) - min(ageAtMar$age)
## [1] 33
To find percentiles, we’ll use the quantile()
function. This function needs you to tell it which data to look at, and the percentile you want as a decimal. For instance, we could also find the median of the data using the following command.
quantile(ageAtMar$age, .50)
## 50%
## 23
To find the first (lower) quartile, Q1, we use
quantile(ageAtMar$age, .25)
## 25%
## 20
And to find the third (upper) quartile, Q3, we use
quantile(ageAtMar$age, .75)
## 75%
## 26
If you save these numbers using the assignment operator <-
, you can use them in subsequent calculations.
Q1 <- quantile(ageAtMar$age, .25)
Q3 <- quantile(ageAtMar$age, .75)
IQR <- Q3 - Q1
IQR #This will print the result to the console
## 75%
## 6