Programming in R

Section 1: Getting Started in R

This chapter will cover basic programming concepts, and setting up your R session which will include setting the working directory, installing packages, importing/exporting data, and viewing data.

Basic Programming Concepts

This section will provide a brief overview of the most fundamental programming concepts, which is a good place to start if you’ve never used or have forgotten how to use R.

What’s an Object?

Everything is an object in R. For simplicity’s sake, we’ll say that an object is a name given to something so we can reference it later. Highways are given numbers so they can be referenced; for example, interstate 15. In much the same way, objects are names used to reference a value or a set of values.

y <- 2 # This symbol "<-" is how you assign a value to an object
3 -> x # This way "->" works equally well

Data Types

R has 5 basic data types, which are known as classes, and to manipulate your data it is a good idea to have a basic understanding of these classes.

0.5, 2         ## 1. Numeric class
TRUE, FALSE    ## 2. Logical class
"hat", "dog"   ## 3. Character class
5L, 12L        ## 4. Integer class ("L" stores this as integer)
1+0i, 2+4i     ## 5. Complex class

These classes are the basic building blocks, and all values in R belong to one of these classes. Multiple values can be concatenated together and assigned to an object:

y <- c(1,2,3,4,5)
y

## [1] 1 2 3 4 5

In the above example, numeric values were concatenated together and assigned to the object y. Concatenating values together and assigning them to one object creates a data structure.

Data structures

R has many data structures. The most relevant data structures include:

Vectors: all values within a vector must be of the same class. In this example, all values are of the numeric class.

x <- c(3, 23, 56, 12, 54)
x

## [1]  3 23 56 12 54

Lists: contain values that can belong to any class. Lists can even other data structures, such as another list.

my_list <- list(0.5, TRUE, "hat")
my_list

## [[1]]
## [1] 0.5
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] "hat"

Factors: look similar to vectors, except that they contain levels that group the data. In this example, a vector is first created and then converted to a factor.

y <- c(3, 3, 7, 7, 7, 3, 5, 5, 3, 7)
y <- factor(y)
y

##  [1] 3 3 7 7 7 3 5 5 3 7
## Levels: 3 5 7

Data frames: a special type of list where every element of the list has the same length. A data frame is usually what’s created when importing a data file into R.

df <- data.frame(id = letters[1:5], x = 1:5, y = 6:10)
df

##   id x  y
## 1  a 1  6
## 2  b 2  7
## 3  c 3  8
## 4  d 4  9
## 5  e 5 10

Setting up your R session

Setting the working directory

Before you start importing files and coding, you need to set the working directory so that R looks for and saves files in the correct folder (which can be whatever you choose). You’ll want to set the working directory to wherever your data files are located.

getwd() # Shows your current working directory
setwd("/Desktop/Code") # Sets the working directory to the listed file path
dir() # Shows files in the working directory

Installing Packages

Install any packages you want to use in your code, such as the tidyverse package. The tidyverse package is a collection of packages that includes ggplot2, dplyr, tidyr, readr, and several others. You only need to install a package once. Once the package is installed you should never have to install it again.

install.packages("tidyverse")
library() # Shows the installed packages

Loading libraries

Packages only need to be installed once, but they need to be loaded every time you start an R session. Loading a package puts it into the working memory; that way R knows you want to use those functions from those packages. Requiring the user to load packages is a way for R to save memory, which allows for more efficient and faster processing.

library(tidyr)
search() # Shows the loaded packages

Importing/Exporting data

Importing

Make sure that your working directory is correctly set before attempting to import files. If your data files are contained within your working directory folder, you simply need to type the name of the data file with quotes around it:

1.Import a csv file

mydata <- read.csv("data.csv")

# If the data is contained within a subfolder of the working directory, you need to specify the file path when importing:

mydata <- read.csv("/data_files/example.csv")

Import an excel file

# The readxl package is needed to use this function
# Library(readxl)
mydata <- read_excel("data.xlsx")

Import a table (txt file)

# Reads a file in table format and creates a data frame from it
mydata <- read.table("file.txt", sep="\t")

Exporting

Functions to export your data to a text or csv file:

write.table(mydata, file = "test.txt", sep = "\t")
write.csv(mydata, file = "mydata.csv")

Feeling Out Your Data

Listed below are a few useful functions for feeling out your data after it has been imported.

We’ll use R’s built-in datasets to give an example

data() # Shows R's built-in datasets
iris   # We'll use the iris dataset

The str() function shows the overall structure of the data, which consists of the data structure and the class of each variable, as well as the number of observations and variables. It also shows a preview of the data for each variable.

str(iris) # Shows the structure of the data

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Here are some other useful functions to feel out your data:

head(iris)         # First 6 rows of dataset
head(iris, n=3)    # First 3 rows of dataset
head(iris, n= -50) # All rows but the last 50

tail(iris)         # Last 6 rows
tail(iris, n=10)   # Last 10 rows
tail(iris, n= -10) # All rows but the first 10

iris[1:10, ]       # First 10 rows (brackets are used to subset)
iris[ ,1:2]        # First two columns 
iris[1:10,1:3]     # First 10 rows of data of the first 3 columns

Here are some basic plotting functions to start exploring variables and the relationships between variables. More on plotting outliers and assessing normality will be covered in Chapter 4.

# Dollar sign "$" is a way to subset a dataframe by the column name
plot(iris$Sepal.Length, iris$Sepal.Width)
plot(iris$Species)
boxplot(iris$Sepal.Length, iris$Sepal.Width)
hist(iris$Sepal.Length)

Section 2: Merging and Cleaning Data

The goal of this chapter is to help you compile your data into a single data file that can be manipulated in R, and to then clean that data and prepare it for data transformation.

Packages needed for this chapter:

library(tidyverse)

Merging

Oftentimes, you will probably need to stitch together rows and columns from different data files. This can be accomplished using several different functions in R, including (but not limited to) cbind(), rbind(), and a custom loop function if more complex merging is required.

Merging with Cbind and Rbind

Below are examples of the cbind() and rbind() functions. As you can see, the data is added sequentially based off the order that the data is referenced in the function. If you need to concatenante many files together, however, this is not a very efficient method. Another way to add data files together is by using a loop function.

example_data1 <- c(1:5)
example_data2 <- c(5:1)
example_data3 <- c(6:10)

cbind(example_data1, example_data2, example_data3)

##      example_data1 example_data2 example_data3
## [1,]             1             5             6
## [2,]             2             4             7
## [3,]             3             3             8
## [4,]             4             2             9
## [5,]             5             1            10

rbind(example_data1, example_data2, example_data3)

##               [,1] [,2] [,3] [,4] [,5]
## example_data1    1    2    3    4    5
## example_data2    5    4    3    2    1
## example_data3    6    7    8    9   10

Merging with Loop Function

Here is a custom loop function to compile all of the files within a directory into a single file; courtesy of Dr. Lohse. Keep in mind that this loop function is specific to the dataset that was used in this example (hence, custom loop functon); meaning you cannot simply plug and chug to use it with your data. You will need to specify exactly how the data should be merged together.

setwd("~/Desktop/Code/KINES_6770/data/eeg_data")
# Set the working directory to the location of the files

file_list <- list.files()
file_list
# Create an object that references the list of files

for (file in file_list){
  
  # if the merged dataset doesn't exist, create it
  if (!exists("MASTER")){
    MASTER <- read.table(file, header=TRUE, sep=",")
    # Take the subject ID from the file name and make it a variable:
    MASTER$subID <-factor(substr(file, start = 1, stop = 3))
    # Extract the block from the file name and make it a variable:
    MASTER$block <-factor(substr(file, start = 25, stop = 27))
    print(file)
  }
  
  # if the merged dataset does exist, append to it
  if (exists("MASTER")){
    # Create the temporary data file:
    temp_dataset <-read.table(file, header=TRUE, sep=",")
    # Take the subject ID from the file name and make it a variable:
    temp_dataset$subID <-factor(substr(file, start = 1, stop = 3))
    # Extract the block from the file name and make it a variable:
    temp_dataset$block <-factor(substr(file, start = 25, stop = 27))
    # Row bind the temporary data to the master data
    MASTER<-rbind(MASTER, temp_dataset) 
    # Remove or "empty" the temporary data set
    rm(temp_dataset) 
    print(file)
  }
  
}

Cleaning Data with tidyr

The “tidyr” package is a great package for cleaning data. Only two of the functions will be discussed in this notebook: spread and gather. If you find yourself needing more data cleaning functions, take a look at R for Data Science by Hadley Wickham.

What makes a dataset tidy?

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Gathering

Problem: column names are not names of variables, but values of a variable.

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Solution: Gather those columns into a new pair of variables

table4a %>%
  gather(`1999`, `2000`, key = "year", value = "cases")

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

# This "%>%" is called a pipe and it is used extensively in the tidyverse package. When you see the pipe, think of the word "then". For example, call on table4a, then use the gather function to manipulate the table.

Spreading

Problem: an observation is scattered across multiple rows.

table2[1:8,]

## # A tibble: 8 x 4
##   country      year type           count
##   <chr>       <int> <chr>          <int>
## 1 Afghanistan  1999 cases            745
## 2 Afghanistan  1999 population  19987071
## 3 Afghanistan  2000 cases           2666
## 4 Afghanistan  2000 population  20595360
## 5 Brazil       1999 cases          37737
## 6 Brazil       1999 population 172006362
## 7 Brazil       2000 cases          80488
## 8 Brazil       2000 population 174504898

Solution: Create new variables and spread them across multiple columns

table2 %>%
    spread(key = type, value = count)

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Cleaning Data: Converting and Recoding

Converting a Variable

It was discussed in Chapter 1 that there are distinct data types (classes) and data structures in R. You can only do certain manipulations to an R object depending on its class and structure. Sometimes you’ll need to convert the class or structure of an object so that you can manipulate it the way you want.

For example, here is the structure of the mtcars dataset:

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The mtcars is a dataframe and all of its variables are vectors. It makes sense to have the cylinder variable grouped by its engine size (as a factor), since there are only 3 possible engine sizes: 4, 6, and 8 liter; that way we can graph the data as a categorical variable rather than a continuous variable. This can be accomplished using the following function:

as.factor(mtcars$cyl)

##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8

Here are some useful functions for checking and coverting the class and structure of objects:

is.numeric()
is.character()
is.logical()
is.data.frame()
is.factor()
# Check to see what the object is

as.integer()
as.data.frame()
as.factor()
# Change the object if you need to

Recoding a Variable

In Chapter 3 we’ll go over how to create new variables with the dplyer package. In contrast, this section will be about changing specific values within a given variable. For example, what if we had a numeric vector and wanted to recode all values below 5 to 0? Here’s an example of how that could be accomplished:

data <- 1:10 # Data is a vector containing the numbers 1-10
data[data < 5] <- 0 # Values < 5 will be assigned a new value, 0
data

##  [1]  0  0  0  0  5  6  7  8  9 10

Or, let’s say that all values below 10 should be assigned a missing value, NA.

data <- 1:20
data[data < 10] <- NA
data

##  [1] NA NA NA NA NA NA NA NA NA 10 11 12 13 14 15 16 17 18 19 20

What if we had a variable named “gender” that contained character values and we wanted to re-code those values as 0 for males and 1 for females:

gender <- c("MALE","FEMALE","FEMALE", "UNKNOWN", "MALE")
as.numeric(gender) # as.numeric doesn't work in this case

## Warning: NAs introduced by coercion

## [1] NA NA NA NA NA

The as.numeric() function doesn’t work in this case because R doesn’t know how to assign numerical values to characters. Instead, we could use an if/else statement:

ifelse(gender == "MALE", 0, ifelse(gender == "FEMALE", 1, 2))

## [1] 0 1 1 2 0

# if/else statement, where anything other than male or female= 2

Section 3: Data Transformation

Data might not be in the most workable form even after it has been cleaned. Often times, we’ll still need to create new variables, rename variables, reorder observations, and group and summarize the data. The dplyr package (a package within the tidyverse package) has several very useful functions for accomplishing these tasks.

Packages needed for this chapter:

library(dplyer)

Basics of dplyr

Listed below are a few of the more common functions with the dplyr package that can assist in solving a wide array of data-manipulation challenges.

filter() – Selects observations based upon pre-determined criteria
arrange() – Reorders the rows of your data
select() – Picks variables by their names
mutate()- Creates a new variable with functions of existing variables
group_by()- Changes the scope of each function from operating on the entire dataset to operating on it group-by-group
summarize() – Collapses many values down to a single summary

Filter

The filter() function allows you to subset observations based on their values.

mtcars %>% 
  filter(hp== 175)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Pontiac Firebird  19.2   8  400 175 3.08 3.845 17.05  0  0    3    2
## Ferrari Dino      19.7   6  145 175 3.62 2.770 15.50  0  1    5    6

# Returns all observations with a horsepower (hp) of exactly 175

More complex filters require boolean operators to make arguments:

& = and
= or
! = not

For example:

mtcars %>% 
  filter(mpg < 20 & cyl== 6)

##               mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Valiant      18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1
## Merc 280     19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4
## Merc 280C    17.8   6 167.6 123 3.92 3.44 18.90  1  0    4    4
## Ferrari Dino 19.7   6 145.0 175 3.62 2.77 15.50  0  1    5    6

# 6 cylinder cars with an mpg of less than 20

Arrange

The arrange() function changes the order of the variables in a data frame. By default, observations are listed in ascending order of the selected variable.

mtcars%>% 
  arrange(mpg)       # Ascending

mtcars%>% 
  arrange(desc(mpg)) # Descending

Select

If you have a big dataset, you probably don’t want to look at the whole thing all at once because it’s too much information to display on a screen. Or, you might want to compare and contrast variables but their columns are so far apart that it’s difficult to compare them. To get around this, you can use the select() function to select only the variables that you want to keep in the dataset.

mtcars[1:5,] %>% # Only the first 5 obs were selected to save space
  select(mpg, wt, carb)

##                    mpg    wt carb
## Mazda RX4         21.0 2.620    4
## Mazda RX4 Wag     21.0 2.875    4
## Datsun 710        22.8 2.320    1
## Hornet 4 Drive    21.4 3.215    1
## Hornet Sportabout 18.7 3.440    2

mtcars[1:5,] %>% 
  select(mpg:wt) # Select all columns between mpg and weight

##                    mpg cyl disp  hp drat    wt
## Mazda RX4         21.0   6  160 110 3.90 2.620
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875
## Datsun 710        22.8   4  108  93 3.85 2.320
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215
## Hornet Sportabout 18.7   8  360 175 3.15 3.440

Mutate

The mutate() function allows you to add new columns that are functions of existing columns, and it adds those new columns to the end of the dataset.

iris[1:5,] %>% 
  mutate(Sepal_Ratio= Sepal.Length/Sepal.Width)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Ratio
## 1          5.1         3.5          1.4         0.2  setosa    1.457143
## 2          4.9         3.0          1.4         0.2  setosa    1.633333
## 3          4.7         3.2          1.3         0.2  setosa    1.468750
## 4          4.6         3.1          1.5         0.2  setosa    1.483871
## 5          5.0         3.6          1.4         0.2  setosa    1.388889

# Sepal_Ratio is a new variable created by taking the Sepal_Length divided by the Sepal_Width for each observation

# If you only want to keep the new variables, you can use the transmute() function instead

Group_by

The group_by() function changes the unit of analysis from the complete dataset to indivdual groups.

mtcars %>%
  group_by(cyl)

## # A tibble: 32 x 11
## # Groups:   cyl [3]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows

It’s a great function, but it doesn’t really do anything by itself. Instead, we can pair group_by() with the other functions, such as the summarise() function, to do a lot of useful taks.

Summarize

The summaize() function (or summarise()) can summarize data based off of the group_by() function.

mtcars %>% 
  group_by(cyl) %>% 
  summarise(avg_mpg= mean(mpg))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##     cyl avg_mpg
##   <dbl>   <dbl>
## 1     4    26.7
## 2     6    19.7
## 3     8    15.1

# Data was grouped by the cylinder variable, and then the average mpg was computed for each cylinder size

Putting it all together

These functions can be paired together to quickly and easily make all kinds of transformations to a dataset:

mtcars[1:5,] %>% 
  filter(hp > 100) %>% 
  select(mpg, cyl, hp, wt) %>% 
  arrange(mpg) %>% 
  mutate(hp_wt= hp/wt)

##    mpg cyl  hp    wt    hp_wt
## 1 18.7   8 175 3.440 50.87209
## 2 21.0   6 110 2.620 41.98473
## 3 21.0   6 110 2.875 38.26087
## 4 21.4   6 110 3.215 34.21462

Section 4: Visualizing Data

In this section, basic graphing functions will be covered, including types of plots, basic arguments, and aesthetics. Additionally, plots for detecting outliers or determining normality of data will be provided. At the end, applied examples are provided that can be modified for personal use.

Packages needed for this chapter:

library(tidyverse)
librar(ggpubr)

Basic Graph Visualizations

We’ll use the ggplot() function along with the mpg dataset from the tidyverse package for the graphing examples. Let’s try to graph the mpg dataset:

ggplot(data= mpg)

If you try graphing this, you will get a blank graph. That’s because we haven’t told the ggplot function what we want it to do with the data. We need to tell ggplot what type of graph to make. We’ll start with the geom_point() function.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cty))

Now we have told ggplot that we want to make a geom_point(), which is a scatter plot, and we’ve also specified the x and y values. There are a couple different ways to write this code, but to keep it consistent we’ll continue writing the code in this format. Here are a few graphs that can be plotted with ggplot():

geom_point(): Plots individual data points according to the x and y values assigned

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cty))

geom_jitter(): Much like geom_point, but if there are overlapping data points, they will both be clearly plotted on the graph with slight variation in the x value (points are slightly shifted left or right).

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = hwy, y = cty))

geom_line(): Plots a line going through each individual data point. Note that with this particular data frame, this code does not produce an interpretable graph. For an advanced example, please see Advanced Examples.

ggplot(data = mpg) + 
  geom_line(mapping = aes(x = hwy, y = cty))

geom_smooth(): Plots line of best fit for data points. “lm” and “loess” are the common arguments for method, which will plot a straight or a curved line, respectively. Below are examples of “lm” (top) and “loess” (bottom) methods.

ggplot(data = mpg) + 
  geom_smooth(method = "lm", mapping = aes(x = hwy, y = cty))

## `geom_smooth()` using formula 'y ~ x'

ggplot(data = mpg) + 
  geom_smooth(method = "loess", mapping = aes(x = hwy, y = cty))

## `geom_smooth()` using formula 'y ~ x'

geom_bar(): Creates a bar chart for specified independent variables.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv))

geom_boxplot(): Creates a box plot for specified variables, displaying the mean and standard errors for each independent variable. Note that the black dots represent outliers in the data frame.

ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = drv, y=cty))

Basic Arguments

The best way to learn the basic arguments of ggplot is to play with different values on the same plot and then see what changes. Listed below are some basic arguments of the ggplot() function:

alpha: adjusts the transparency of a specified object.
size: adjusts the size of data points or lines.
fill: adjusts the color inside of data points.
weight: adjusts the thickness of a line or data point.
linetype: adjusts the type of line displayed in plots with lines; enables categorical variables to be on same graph with different visual representation
shape: adjusts the type of shape displayed in plots with points; enables categorical variables to be on same graph with different visual representation
color: adjusts the color displayed in plots; enables categorical variables to be on same graph with different visual representation

Examples

Here are 3 comparison examples that highlight the functionality of some of these arguments:

ggplot(data = mpg) +
  geom_smooth(method = "loess", size = 1.5, 
              mapping =  aes(x = displ, y = hwy))+
  geom_point(size = 2, mapping =  aes(x = displ, y = hwy))

## `geom_smooth()` using formula 'y ~ x'

ggplot(data = mpg) +
  geom_smooth(method = "loess", size = 1.5, mapping =  aes(
      x = displ, y = hwy, color = drv), se = FALSE)+
  geom_point(size = 2, mapping =  aes(x = displ, y = hwy, 
      color = drv))

## `geom_smooth()` using formula 'y ~ x'

The two graphs above have very similar codes but different results. Both are displaying the relationship between highway miles and “displ,” a categorical variable. The main difference is the implementation of color = drv, which created separate lines for each name in that categorical variable. This would work with linetype for the geom_smooth code, and with shape for the geom_point code. Finally, the se = FALSE removed the shading around the lines, which indicated the standard error.

ggplot(data = mpg, mapping= aes(x = drv, y = cty, shape = drv,
      color = drv)) +
  geom_jitter()

ggplot(data = mpg) +
  geom_jitter(size = 2.3, mapping =  aes(
    x = drv, y = cty, shape = drv, color = drv))+
  coord_cartesian(ylim = c(10,25))

These two graphs display the same data, which is the city mileage for each drv data point. However, we made use of the coord_cartesian function which can zoom in on the axes. We also adjusted the size of the data points. The plot on the right, although it is missing some outlier data points, looks clearer.

ggplot(data = mpg) +
  geom_smooth(method = "lm", mapping =  aes(x = displ, y = cty, linetype = drv), color = "purple", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

ggplot(data = mpg) +
  geom_smooth(method = "lm", mapping =  aes(x = displ, y = cty), color = "purple", se = FALSE)+
  facet_wrap(~drv, nrow = 1, ncol = 3)

## `geom_smooth()` using formula 'y ~ x'

These graphs show the same data, but in different visual forms. The first code differentiates the different “drv” levels through linetypes, while the second code displays the different levels of “drv” in separate panels. The latter is accomplished through the facet_wrap function. The facet_grid function can also be used, but facet_wrap enables you to customize the layout of the panels (through the ncol and nrow arguments).

Aesthetics

In this section, a template is laid out for developing basic aesthetics in graphs. Please note that there are more adjustments that can be made to a graph or figure, but this provides a basic template that you can use to make graphs. Below, we display two graphs with identical data, but the second graph has been cleaned up by taking advantage of ggplot’s additional arguments.

ggplot(data = mpg) +
  geom_smooth(method = "loess", size = 1.5, mapping =  aes(x = displ, y = hwy, color = drv), se = FALSE)+
  geom_point(size = 2, mapping =  aes(x = displ, y = hwy, color = drv))

## `geom_smooth()` using formula 'y ~ x'

g1 <- ggplot(data = mpg) +
  geom_smooth(method = "loess", size = 1.5, mapping =  aes(x = displ, y = hwy, color = drv), se = FALSE)+
  geom_point(size = 2, mapping =  aes(x = displ, y = hwy, color = drv))
g2 <- g1 + scale_x_continuous(name = "Engine Displacement (litres)")+
  scale_y_continuous(name = "Highway Miles per Gallon", breaks = c(15,20,25,30,35,40,45))
g3 <- g2 + theme_bw() + 
  theme(axis.text=element_text(size=18, colour="black"), 
        axis.title=element_text(size=18,face="bold")) +
  theme(strip.text.x = element_text(size = 18))+
  theme(axis.title.y = element_text(margin = margin(t = 0, r = 20, b = 0, l = 15)))+
  theme(axis.title.x = element_text(margin = margin(t = 15, r = 0, b = 10, l = 0)))+
  theme(legend.position = c(0.80,0.9))+
  theme(legend.title = element_blank())+
  theme(legend.text = element_text(size = 16))+
  theme(plot.title = element_blank())
g4 <- g3+scale_color_discrete(name=element_blank(),
                                 labels = c("4-Wheel Drive",
                                            "Front-Wheel Drive",
                                            "Rear-Wheel Drive"))
plot(g4)

## `geom_smooth()` using formula 'y ~ x'

There are several things to note in the code:

scale_x_continous & scale_y_continuous allow the axes to be automatically scaled to our data, but we can further name the axes and set the breaks, which specifies which numbers should be displayed on an axis.
theme_bw() dictated that the general color scheme of the figure is black and white. However, we did input a color argument before, so this won’t apply to the different lines. The background is now white though instead of grey.
axis_test and axis_title allow us to specify the size and font of the text. This goes for the legend_text and legend_title as well.
margin allows us to adjust the spacing and position between the axis title and the figure itself. legend_position allows us to put the legend where we wish to on the figure.
scale_color_discrete allows us to name the different levels of the “drv” variable as it will appear in the legend.

Checking Normality

There are a variety of ways to check normality of data, but visualizations provide a quick, easy way to determine if assumptions of normality have been met or if outliers are present.

Distribution Graphs

hist(mpg$cty,probability=T, 
     main="Histogram of skewed data", xlab="Skewed data")
lines(density(mpg$cty),col=2)

Above is a histogram and density curve showing the distribution of the “cty” variable in the mpg dataset. This graph displays a slight right skew.

ggqqplot(diamonds$depth)

Above is q-q plot graph, which shows the correlation between your sample distribution and a theoretical normal distribution. The dots should generally hug that line of best fit created in this plot. This plot indicates this may not be a normal distribution since the ends are tailing off, and there are 6 outliers at each end of the plot.

Detecting Outliers and Determining Homoscedasticity

Another useful way to detect outliers is using Cook’s distance. This is applicable when determining if any observations will have a significant impact on a predicted y value in an analysis. For this example, we use the simple lm function to determine if “hwy” can predict “cty.” We then use this model to plot the cook’s distance for each observation.

mod <- lm(cty ~ hwy, data=mpg)  # Create a linear model
cooksd <- cooks.distance(mod)   # Cook's distance

plot(cooksd, pch="*", cex=2, 
     main="Influential Obs by Cooks distance")  # Plot Cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red")  # Add cutoff line
text(x=1:length(cooksd)+1,
     y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),
     names(cooksd),""), col="red")

The red line in this code shows where most of the observations are located, and anything above this line has more influential effects on analysis (because they may be outliers). Generally, observations with a cook’s distance greater than 1 should be removed. However, this graph shows the greatest value to be 0.30 (look at y-axis). Therefore, data probably doesn’t need to be removed, unless the detected data points (as indicated in red text) are subjectively interpreted as outliers

plot(fitted(mod),(abs(resid(mod)/sd(resid(mod)))))

This plot checks if residuals in your data (for an analysis, like a linear model) are equally distributed (homoscedastic) or bunch together as specific values (heteroscedastic). Normal data is homoscedastic: the plot should look like a shotgun blast in this case, which is does. Heteroscedastic data will show residuals bunching near a specific value, forming a “cone” shape with the data in this graph.

Advanced Examples

For these example plots you will need access to the box folder for the KIN 6770 class. To follow along, dowload the “data_TECH_DI.csv” and “TRIM_T_DI.csv” files from the box folder.

Example 1: Mental Toughness

# Download the files and setwd() to the location of the files
setwd("~/Desktop/Code/Lab notebook/data") 

# Read in file
TECH<-read.csv("data_TECH_DI.csv", header = TRUE, sep=",",  
               na.strings=c("NA","NaN"," ",""))

# This creates a new dataframe, filtering out data from 2014 or earlier
Test<- subset(TECH, !Year < 2014)

# Creating the plot
g1 <- ggplot(data = subset(Test, !is.na(Test$SMTQ_total)), mapping = aes(x = Year, y = NAT_TECH, na.rm = TRUE))+
  geom_smooth(method = "lm", size = 2, se = FALSE, aes(color = SMTQ_total > 40))+
  geom_jitter(size = 2, width = 0.25, pch = 21, aes(fill = SMTQ_total > 40))+
  ggtitle(element_blank())+
  coord_cartesian(xlim=c(2013.95,2018.2))
g2 <- g1+scale_x_continuous (name="Years", breaks = pretty(TECH$Year, n=3))+
  scale_y_continuous(name = "National Technical Points (Rank)")
g3 <- g2 + theme_bw() + 
  theme(axis.text=element_text(size=26, colour="black"), 
        axis.title=element_text(size=26,face="bold")) +
  theme(strip.text.x = element_text(size = 18))+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  theme(axis.title.y = element_text(margin = margin(t = 0, r = 20, b = 0, l = 15)))+
  theme(axis.title.x = element_text(margin = margin(t = 20, r = 0, b = 10, l = 0)))+
  theme(legend.position = c(0.3,0.92))+
  theme(legend.text = element_text(size = 22))
g4 <- g3+scale_color_manual(name="",
                            labels = c("Low Mental Toughness (<40)",
                                       "High Mental Toughness (>40)"),
                            values = c("FALSE" = "black",
                                       "TRUE" = "grey"))+
  scale_fill_manual(name="",
                    labels = c("Low Mental Toughness (<40)",
                               "High Mental Toughness (>40)"),
                    values = c("FALSE" = "black",
                               "TRUE" = "grey"))
plot(g4)

## `geom_smooth()` using formula 'y ~ x'

This code created distinct lines and dots by using a conditional argument. In this case, it was if the variable “SMTQ_total” was less than or greater than 40. The rest of the code includes aesthetics that have been covered, but slightly tweaked.

Example 2: Amount of Practice

setwd("~/Desktop/Code/Lab notebook/data") 

# Read in data file
TRIM_T<-read.csv("./TRIM_T_DI.csv", header = TRUE, sep=",",  
                 na.strings=c("NA","NaN"," ",""))

# Remove unspecified gender from data
TRIM_T <- subset(TRIM_T, !is.na(Gender))

# Create average hours variable grouping by age, gender, and country, and creating a variable for SD
AvgAge<-summarize(group_by(TRIM_T, AgeYear, Country, Gender),
                  AgeHours = mean(HOURS_TOTAL, na.rm=TRUE),
                  AgeHoursSD = sd(HOURS_TOTAL)/sqrt(sum(!is.na(HOURS_TOTAL))))

## `summarise()` regrouping output by 'AgeYear', 'Country' (override with `.groups` argument)

# Filtering out ages below 3 and above 18
AvgAge <- filter(AvgAge, AgeYear <19)
AvgAge <- filter(AvgAge, AgeYear >2)

# Rescaling hours variable
AvgAge <- mutate(AvgAge, Hours = AgeHours * 100,
                 SE = AgeHoursSD * 100)

# Making a variable so error bars won't overlap
dodge = position_dodge2(width = 0.4)

# Renaming country variables
levels(AvgAge$Country) <- c("Austria", "United States")

g1<-ggplot(AvgAge) +
  geom_line(color = "black", size = 1.5, mapping =  aes(x = AgeYear, y = Hours, linetype = Gender), position = dodge)+
  geom_errorbar(aes(x = AgeYear, ymin=Hours-SE, ymax=Hours+SE), width = 0.4, position = dodge)+
  facet_wrap(~Country, nrow = 1, ncol = 2)+
  coord_cartesian(xlim = c(2.5,18),ylim = c(50,800))

g2<-g1+scale_x_continuous(name = "Age", breaks = c(3,6,9,12,15,18)) +
  scale_y_continuous(name = "Average Engagement Hours/Year", breaks = c(100,200,300,400,500,600,700,800))

g3 <- g2 + theme_bw() + 
  theme(axis.text=element_text(size=18, colour="black"), 
        axis.title=element_text(size=23,face="bold"),
        strip.text = element_text(size = 30)) +
  theme(strip.text.x = element_text(size = 23))+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  theme(axis.title.y = element_text(margin = margin(t = 0, r = 20, b = 0, l = 15)))+
  theme(axis.title.x = element_text(margin = margin(t = 15, r = 0, b = 10, l = 0)))+
  theme(legend.position = c(0.9,0.1))+
  theme(legend.title = element_blank())+
  theme(legend.text = element_text(size = 21))+
  theme(plot.title = element_blank())+
  theme(panel.spacing = unit(4, "lines"))

g4 <- g3+scale_linetype_discrete(name=element_blank(),
                                 labels = c("Female",
                                            "Male"))  
plot(g4)

This plot has some extra steps for formatting the data to plot. We want to create a graph that shows the average amount of practice for each age group when grouped by gender and country. First, we removed anyone with unspecified gender. Next, we used the summarize function to create an average hours variable for each grouping. Next, we filtered out ages we don’t want displayed in the graph, while adjusting the scaling of the hours variable. Finally, we created a “dodge” variable, which would make it so error bars don’t overlap in the graph code. The “levels” argument allows the variable name to be changed to whatever we want displayed on the graph.