Descriptive Statistics in R

Using R as a calculator

It’s possible to use R as a calculator, although that is a bit of overkill.

Suppose we have 6 exam scores and we wish to find the average.

Using algebraic commands:

(78+73+92+85+75+98)/6

## [1] 83.5

Basic statistical functions in R

We can use R commands to accomplish the same task. First we form a vector or list in R of the scores. We do this using the combine, c(), function. Then we wrap this inside the mean() function, which calculates the average.

mean(c(78,73,92,85,75,98))

## [1] 83.5

Similarly, we can use the R function sd() to calculate the sample standard deviation of the exam scores.

sd(c(78,73,92,85,75,98))

## [1] 9.974969

Saving a list of numbers

In the case where you want to use the same list or vector repeatedly, it’s a good idea to give it a descriptive name. In this case, we’ll name our vector “exams”. Use the assignment operator <- to do this.

exams <- c(78,73,92,85,75,98)

Using Statistical Functions

Now, we can use these R functions many different statistics for the exam scores, using the name of the vector instead of typing out the values again.

mean(exams) #Calculate the mean

## [1] 83.5

sd(exams)   #Calculate the standard deviation

## [1] 9.974969

var(exams)  #Calculate the variance

## [1] 99.5

median(exams)  #Find the median

## [1] 81.5

Larger Datasets

Let’s look at some data in the R package openintro. Load the package into working memory. You must do this each time you use RStudio.

require(openintro)

Looking at a pre-loaded data set

Many of the packages used in R include sample data sets. When you load the openintro package, you gain access to this data. We’ll use the View() function to look at the ageAtMar data set which contains the age at first marriage for a sample of 5,534 US women. Notice that there is a capital “V” for this command. Punctuation matters! The command below will open the dataset in a new tab in RStudio.

View(ageAtMar)

You’ll want to pay attention to how the data is labeled and coded. Specifically, note the name of the variable and how it is spelled. Also notice that because our data is now in a data frame (and not a simple vector that we defined ourselves like the exams example above) the way we call the functions is different. We must use dataset$variablename as the argument to each function.

Find descriptive statistics for this data set.

mean(ageAtMar$age)   #Calculate the mean

## [1] 23.44019

sd(ageAtMar$age)     #Calculate the standard deviation

## [1] 4.721365

var(ageAtMar$age)    #Calculate the variance

## [1] 22.29129

median(ageAtMar$age) #Find the median

## [1] 23

nrow(ageAtMar)       #Count the number of observations to get n

## [1] 5534

min(ageAtMar$age)    #Find the minimum value

## [1] 10

max(ageAtMar$age)    #Find the maximum value

## [1] 43

To find the range we can subtract the minimum value from the maximum value in the data.

max(ageAtMar$age) - min(ageAtMar$age)

## [1] 33

Finding percentiles

To find percentiles, we’ll use the quantile() function. This function needs you to tell it which data to look at, and the percentile you want as a decimal. For instance, we could also find the median of the data using the following command.

quantile(ageAtMar$age, .50)

## 50% 
##  23

To find the first (lower) quartile, Q1, we use

quantile(ageAtMar$age, .25)

## 25% 
##  20

And to find the third (upper) quartile, Q3, we use

quantile(ageAtMar$age, .75)

## 75% 
##  26

If you save these numbers using the assignment operator <-, you can use them in subsequent calculations.

Q1 <- quantile(ageAtMar$age, .25)
Q3 <- quantile(ageAtMar$age, .75)
IQR <- Q3 - Q1
IQR  #This will print the result to the console

## 75% 
##   6