Overview

Univariate Distribution
Bivariate Distribution
Plot Customization
dplyr Package

Univariate Distribution

Bar plot

Bar plots are used to plot discrete (integer) or categorical variables.

# Create a table
survived <- table(titanic$Survived)

# Input the table as an argument to barplot()
barplot(survived)

# To change the order of the bars, convert to factor and order the levels as you want them to appear on the plot
titanic$Survived_labels <- factor(titanic$Survived, levels = c("1", "0"), labels = c("Survived", "Died" )) # Using labels to get a nicer plot

# Plot 
survived <- table(titanic$Survived_labels)
barplot(survived)

Question 1

How would you change the code to get proportions instead of frequencies on the y-axis?

Histogram

Histograms are used to plot continuous (“numeric” in R) variables.

# With frequencies
hist(titanic$Age)

# With density
hist(titanic$Age, freq = F) # setting the argument freq to FALSE gives us the density

# With proportions
# Assign the histogram to an object
hist_age <- hist(titanic$Age, plot = F) # setting the plot argument to FALSE to avoid plotting
hist_age # Data is stored in this object

## $breaks
## [1]  0 10 20 30 40 50 60 70 80
## 
## $counts
## [1]  64 115 230 155  86  42  17   5
## 
## $density
## [1] 0.0089635854 0.0161064426 0.0322128852 0.0217086835 0.0120448179
## [6] 0.0058823529 0.0023809524 0.0007002801
## 
## $mids
## [1]  5 15 25 35 45 55 65 75
## 
## $xname
## [1] "titanic$Age"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

# Change the density attribute of the histogram to proportions
hist_age$density <- hist_age$counts/sum(hist_age$counts)

# Plot the modified histogram object
plot(hist_age, freq = F) # Note that the label of the y-axis is misleading. The Plot Customization section shows how to change it.

Density Plot

Density plots are also used to plot continuous variables. The basic R density function returns kernel probability density estimates. Since these are probabilities, the sum of the area under the curve equals 1.

# Plotting the probability distribution of age
plot(density(titanic$Age, na.rm =T)) # setting the na.rm argument to TRUE

# Add vertical lines to indicate the average and median age
abline(v=mean(titanic$Age, na.rm =T), col="red") # setting the colour to red
abline(v=median(titanic$Age, na.rm =T), col="blue")

Question 2

Is the age distribution skewed? If yes, what type of skew?

Box Plot

Box plots are used to plot continuous variables and include information about the quartiles and the outliers. They allow a visualization of the interquartile range. The figure below shows how to interpret a box plot.

Source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

Outliers are defined as observations that fall 1.5 times the interquatile range below the lower quartile or above the upper quartile.

boxplot(titanic$Age)

Bivariate Distribution

Box Plot

Box plots can also be used to visualize the relationship between a categorical and a continuous variable.

boxplot(Fare ~ Pclass, data = titanic) # using the formula notation where ~ means "is a function of"

Bar plot

Bar plots can be used to visualize a relationship between two categorical variables. They are basically a visual representation of a two-way contingency table.

# Create a contingency table
survived_sex <- table(titanic$Survived_labels, titanic$Sex)

# Plot
barplot(survived_sex , legend.text = T, args.legend = list(x = "topleft")) # Adding a legend to make the plot legible and changing its position to make the plot nicer

Scatterplot

Scatterplots are used to plot the bivariate distribution of two continuous or discrete variables.

plot(titanic$Age, titanic$Fare)

Question 3

Is there a relationship between age and fare?

Plot Customization

When presenting the results of a data analysis, graphs that should be self-explanatory, in the sense that all the information needed to understand the graph should be included in the graph itself, and legible.

Making a plot legible means including a legend when needed, including a meaningful title, changing the axis labels to use words instead of the variable names, changing the axis range to represent the range the data, etc.

The arguments and the axis() function below work with any base R plot (apply to all the plot functions in this tutorial).

Axis Labels

plot(titanic$Age, titanic$Fare, xlab = "Age", ylab = "Fare") # Using the xlab and ylab arguments

Title

# Change the title
plot(titanic$Age, titanic$Fare, xlab = "Age", ylab = "Fare", main = "Bivariate Distribution of Age and Fare Among Passengers of the Titanic") # Using the main argument.

# The title should include the name of the variables that are plotted and what are the observations on which they are measured. 

plot(titanic$Age, titanic$Fare, xlab = "Age", ylab = "Fare", main = "Bivariate Distribution of Age and Fare\nAmong Passengers of the Titanic") # Use \n to create a new line when the title is too long

Axis Limits and Tick Intervals

# Base plots with arguments set to their default are often flawed
barplot(survived) # the y-axis does not cover the full range of the data (does not include the max frequency of fatalities)

survived # the max frequency of fatalities is 549

## 
## Survived     Died 
##      342      549

# Set the limits of the axis by using xlim or ylim
barplot(survived, xlab = "Survival", ylab= "Frequency", ylim = c(0,550), yaxt="n") # Delete the axis whose ticks you want to modify by setting xaxt or yaxt to "n"

# Create the new axis with the desired tick interval
axis(side=2, at = seq(from=0, to=550, by=50)) # The index of axis sides increases clockwise starting from the bottom. We set the lower limit of the axis to 0, and the upper limit to 550. We set the tick interval to 50.

dplyr Package

dplyr is a package that contains functions to manage data and the pipe operator (%>%). The pipe is used to chain functions in a single command and avoid creating an object to store the result at each step. The pipe passes the results of a function to the next function.

# First, load the package
library(dplyr) # Only needs to be done once. You need to install the package first using the function install.packages("[package]")

Subset Columns

Let’s say we want a list of the passengers’ names and their class.

select(titanic, Name, Pclass) # dataframe is the first argument, then the columns

##                                                  Name Pclass
## 1                             Braund, Mr. Owen Harris      3
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)      1
## 3                              Heikkinen, Miss. Laina      3
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel)      1
## 5                            Allen, Mr. William Henry      3
## 6                                    Moran, Mr. James      3

Filter Rows

Let’s say we are only interested in first class passengers.

filter(titanic, Pclass == 1)

##   PassengerId Survived Pclass
## 1           2        1      1
## 2           4        1      1
## 3           7        0      1
## 4          12        1      1
## 5          24        1      1
## 6          28        0      1
##                                                  Name    Sex Age
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38
## 2        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35
## 3                             McCarthy, Mr. Timothy J   male  54
## 4                            Bonnell, Miss. Elizabeth female  58
## 5                        Sloper, Mr. William Thompson   male  28
## 6                      Fortune, Mr. Charles Alexander   male  19
##   Sibling_spouse Parent_children     Fare Survived_dicho Family Sex_dicho
## 1              1               0  71.2833              1      1         1
## 2              1               0  53.1000              1      1         1
## 3              0               0  51.8625              0      0         0
## 4              0               0  26.5500              1      0         1
## 5              0               0  35.5000              1      0         0
## 6              3               2 263.0000              0      5         0
##   Survived_labels
## 1        Survived
## 2        Survived
## 3            Died
## 4        Survived
## 5        Survived
## 6            Died

The Pipe Operator

Ooops! All the variables are displayed! We only want the name and the class.

We can use the pipe operator to combine select() and filter() into one command.

titanic %>% # passes the dataframe to subsequent functions
  select(Name, Pclass)  %>% # we don't need to input the dataframe again
  filter(Pclass == 1)

##                                                  Name Pclass
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)      1
## 2        Futrelle, Mrs. Jacques Heath (Lily May Peel)      1
## 3                             McCarthy, Mr. Timothy J      1
## 4                            Bonnell, Miss. Elizabeth      1
## 5                        Sloper, Mr. William Thompson      1
## 6                      Fortune, Mr. Charles Alexander      1

Exercise 1

Use the pipe operator to return a list of the passengers’ ID and their fare, for passengers in second and third class.

Create a New Variable

Let’s say we want to create a new age variable with months as a unit.

# Use the mutate function
titanic <- titanic %>%  # Assign the new dataframe to an object
           mutate(Age_month = Age * 12) # Age_month is the name of the new variable. Here we created a new variable by applying an arithmetic operation to an existing variable, but we could also use built-in functions

Sort Data

titanic <- titanic %>%  
           arrange(Age_month) # Sorts in ascending order. To sort in descending order, set the argument desc() to TRUE

Summarize Data

Sometimes we need to summarize data. Here, “summarizing” means aggregating the data at a higher level (group) using some function (count, mean, etc.).

Let’s say we want to summarize the data by passenger class.

# Count number of passengers in each class
titanic %>%
  group_by(Pclass) %>% # group by class
  summarize(n()) # Use the summarize() and n() functions to count

## # A tibble: 3 x 2
##   Pclass `n()`
##    <int> <int>
## 1      1   216
## 2      2   184
## 3      3   491

# Compute the average fare by class
titanic %>%
  group_by(Pclass) %>%
  summarize(mean(Fare))

## # A tibble: 3 x 2
##   Pclass `mean(Fare)`
##    <int>        <dbl>
## 1      1         84.2
## 2      2         20.7
## 3      3         13.7

# Create a new dataframe, with fare aggregated by class
titanic_fare_avg <- titanic %>% # Assign the new dataframe to a new name
                      group_by(Pclass) %>% # Note that group_by() can be combined with any other function, including mutate()
                      summarize(Fare_avg = mean(Fare)) # name the variable containing the average
titanic_fare_avg

## # A tibble: 3 x 2
##   Pclass Fare_avg
##    <int>    <dbl>
## 1      1     84.2
## 2      2     20.7
## 3      3     13.7

This work by Sarah Lachance is licensed under CC BY-NC-ND 4.0

Data Visualization & dplyr

Sarah Lachance

2021-07-30

Overview

Univariate Distribution

Bar plot

Question 1

Histogram

Density Plot

Question 2

Box Plot

Bivariate Distribution

Box Plot

Bar plot

Scatterplot

Question 3

Plot Customization

Axis Labels

Title

Axis Limits and Tick Intervals

dplyr Package

Subset Columns

Filter Rows

The Pipe Operator

Exercise 1

Create a New Variable

Sort Data

Summarize Data