Calculating Summary Statistics

Setting Up Your R Script
Loading packages
Reading in data
Calculating summary statistics and saving as new dataframe
Creating new .csv of summary stats
Full code with limited annotation

Setting Up Your R Script

Do not forget to help your future self by annotating your code and providing a description before writing any code. It is also a good idea to create and R project and make sure you know where you put the data you plan to work with. If you do not have an R project established, you will have to set your working directory (using the setwd() function). This is an example of the first couple of lines I like to use when writing code:

# Title: Calculating summary stats using sample dataset
# Name: Joseph Brown
# Email: brownjk5@vcu.edu
# Date: 08/20/20

Loading packages

You should always start script writing by installing/loading packages you know you will be using in the script. It is okay if you find out you need more packages later, its never too late to install a new package! Remember that you only need to install packages once.

To calculate summary statistics we will only need functions in the packages called dplyr and those that are pre-loaded in base R.

# Loading packages we need for our script
# install.packages("dplyr") - I already have this package installed
library(dplyr)

Keep in mind that dplyr is a package in the tidyverse suite of packages. You can either load tidyverse or justdplyr.

Reading in data

For this exercise we will be using the publicly available iris dataset. However, I have the dataset saved in my R project folder called “raw data” so that we can practice reading an existing .csv file into the R environment.

In tidyverse it is possible to import multiple file types (i.e. .xls, .json, etc.). But we will start by practicing with a .csv file because it is the most commonly used method. Below is an example of using the read.csv() function to locate locate and upload our data file

# Reading in iris_data.csv dataset 
iris_data <- read.csv("R_projects_fall20/raw data/iris_data.csv")

We can view a sample of the first 6 lines by using the head() function:

# Checking data using 
head(iris_data)

##   X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 1          5.1         3.5          1.4         0.2  setosa
## 2 2          4.9         3.0          1.4         0.2  setosa
## 3 3          4.7         3.2          1.3         0.2  setosa
## 4 4          4.6         3.1          1.5         0.2  setosa
## 5 5          5.0         3.6          1.4         0.2  setosa
## 6 6          5.4         3.9          1.7         0.4  setosa

You can use the tail() to see the last 6 lines of your data.

There are other data exploration techniques that can give you information about your data. For example, using the str() function to see how your data are structured and how R is reading your data types (i.e. factor, numerical, integer, etc.

Calculating summary statistics and saving as new dataframe

First we are going to direct the result to a new data frame, this will help us check our work easily in R. Here I am naming the new data frame iris_sepal.length_summary).

In the code chunk below you will notice an operator that may be new to you (%>%). This is called “piping”. It takes the output of one argument and makes it the input of the next argument. It is a commonly used operator in dplyr to streamline code.

R programmers have made it fairly simple to calculate summary statistics because most of the functions are intuitively named. Here is a list of functions and what they do:

mean() = calculates averages of metrics you identify
median() = identifies the median value in your data
n() = identifies the sample size in your data
var() = computes the variance of your data
sd() = computes the standard deviation of your data

You may notice that the above list doesn’t include standard error. Base R does not have a function for calculating standard error, so we have to write a line of code to calculate our own SE values.

Luckily, R computes standard deviation for us, and we know that SE = the SD divided by the square root of the sample size. So, we can calculate the SE of our sample using the following formula:

SE = sd()/sqrt(n())

The function sqrt() calculates the square root of any numerical input.

Before using the above functions to compute your summary statistics, we need to tell R how to group our data (usually by factor levels) using the group_by() function that is in the dplyr package.

Second, we need to tell R that we want the new summary variables we create (i.e. mean, SD, n) to be further utilized within our summary, so that we can easily calculate SE. To do this we will compute our summary stats within the summarize() function in dplyr. See the example code below:

# Calculating summary stats for Sepal.Length, grouping by Species
iris_sepal.length_summary <- iris_data %>% # Identifying the data we want to use and how to label it in our new data frame
  group_by(Species) %>% # Telling R how to group our data for calculations
  summarize(
    mean_Sepal.Length = mean(Sepal.Length), # Computes mean and stores it in a df as "mean_Sepal.Length"
    median_Sepal.Length = median(Sepal.Length),
    variance_Sepal.Length = var(Sepal.Length),
    sd_Sepal.Length = sd(Sepal.Length),
    n_Sepal.Length = n(),
    se_Sepal.Length = sd_Sepal.Length / sqrt(n_Sepal.Length) # Using the sd we computed above to write formula for SE calculation 
  )

You should find that when you run this code a new data frame will appear in the R Global Environment called iris_sepal.length_summary that has a row for each species and columns that identify all the summary stats we calculated.

It should look something like this:

New dataframe including all summary stats we computed with the above code
Species	mean_Sepal.Length	median_Sepal.Length	variance_Sepal.Length	sd_Sepal.Length	n_Sepal.Length	se_Sepal.Length
setosa	5.006	5.0	0.1242490	0.3524897	50	0.0498496
versicolor	5.936	5.9	0.2664327	0.5161711	50	0.0729976
virginica	6.588	6.5	0.4043429	0.6358796	50	0.0899270

Creating new .csv of summary stats

The final step is to save the newly created summary stats data frame to a new .csv file in your R project.

# Writing new .csv file and storing it in "created data" folder of my R project
write.csv(iris_sepal.length_summary, "R_projects_fall20/created data/iris_sepal.length_summary.csv")

Full code with limited annotation

# Title: Calculating summary stats using sample dataset
# Name: Joseph Brown
# Email: brownjk5@vcu.edu
# Date: 08/20/20

# Loading packages
library(dplyr)

# Reading in csv of data - This line of code is user dependant
iris_data <- read.csv("R_projects_fall20/raw data/iris_data.csv")

# Take a look at your data
head(iris_data)

# Calculating summary statistics and constructing new dataframe with 
iris_sepal.length_summary <- iris_data %>%
  group_by(Species) %>%
  summarize(
    mean_Sepal.Length = mean(Sepal.Length),
    median_Sepal.Length = median(Sepal.Length),
    variance_Sepal.Length = var(Sepal.Length),
    sd_Sepal.Length = sd(Sepal.Length),
    n_Sepal.Length = n(),
    se_Sepal.Length = sd_Sepal.Length / sqrt(n_Sepal.Length)
  )