Week 11: Central Tendency and Descriptive Statistics in R

Readings
- Wickham Chapter 7- Exploratory Analysis
- Wickham Chapter 3 - Data Visualization
- Wickham Chapter 27 - R Markdown
- Wickham Chapter 28 - Graphics for Communication

Looking at Data

  • str() - Displays preview of internal structure of object. First few observations, variable name, Variable type
  • class() - What kind of object is it? (high-level)
  • typeof() - What is the object’s data type (low-level)
  • length() - How long is it?
  • attributes() - Does it have any metadata?

Housing Distributions

Data Creation:

# Creating vectors for the number of children, and the number of families with each type of children in two different cities
Numb.Kids <- c(1, 2, 3, 4, 5, 6, 7)
Numb.Fam1 <- c(100, 200, 300, 200, 100, 50, 50 ) # Number of families that correspond to Numb.Kids for City 1
Numb.Fam2 <- c(500, 200, 10, 10, 10, 300, 100) # Number of families that correspond to Numb.Kids for City 2

#Calculates total children from family size
Tot.Kids1 <- Numb.Kids * Numb.Fam1 
Tot.Kids2 <- Numb.Kids * Numb.Fam2

# Combines 3 vectors to make a dataframe for each City
City1 <- data.frame(Numb.Kids, Numb.Fam1, Tot.Kids1)
City2 <- data.frame(Numb.Kids, Numb.Fam2, Tot.Kids2)

# Or you could make 1 dataset that has all the vectors in it
Cities <- data.frame(Numb.Kids, Numb.Fam1, Numb.Fam2, Tot.Kids1, Tot.Kids2)

Note: grid_arrange is from the gridExtra() package.

# Create bar graph for City 1
City1Plot <- City1 %>%            # dataset variables come from
  ggplot(aes(x = Numb.Kids, 
             y = Numb.Fam1)) +    # variables used on x and y axis
  geom_col() +                    # type of graph: column
  labs(subtitle = "City 1") +     # Gives it a subtitle. Title was too big for combined graph
  ylim(0,500) +                   # sets y axis limits to 0 and 500
  theme(axis.title = element_blank()) # removes x and y axis labels

# Create bar graph for City2
City2Plot <- City2 %>% 
  ggplot(aes(x = Numb.Kids, y = Numb.Fam2)) +
  geom_col() + 
  labs(subtitle = "City 2") +
  theme(axis.title = element_blank()) 
    
# Combines them into one graph side by side
# Mostly for the ease of teaching, but it may be useful for your projects
grid.arrange(City1Plot, City2Plot, nrow=1, top = "Distribution of Family Size in Two Cities", bottom = "Number of Kids", left = "Number of Families")

Descriptive Statistics

The describe() and describeBy commands come from the psych package. describe(): summarizes multiple descriptive stats at once for numeric variables describeBy(): Summarizes descriptive statistics using a grouping variable

# library(psych) 
describe(City1, skew = FALSE)
##           vars n   mean     sd min max range     se
## Numb.Kids    1 7   4.00   2.16   1   7     6   0.82
## Numb.Fam1    2 7 142.86  93.22  50 300   250  35.23
## Tot.Kids1    3 7 478.57 282.63 100 900   800 106.82
describe(City2, skew = FALSE)
##           vars n   mean     sd min  max range     se
## Numb.Kids    1 7   4.00   2.16   1    7     6   0.82
## Numb.Fam2    2 7 161.43 186.14  10  500   490  70.35
## Tot.Kids2    3 7 502.86 629.15  30 1800  1770 237.79
# describeBy()      # Summarizes by Grouping Variable

Both have roughly the same number of families (1000 vs. 1130) and roughly the same number of kids (3350 vs. 3520).

They also both have similar average kids per family:

  • City 1: average number of kids per family = 3.35
  • City 2: Average number of kids per family = 3.1150442

If the distribution of the data is spread out, what does that imply? Why would this matter?

Distribution: Boxplots

Box plots of the number of families

Box1 <- ggplot(Cities, aes(x = Numb.Fam1) ) + 
  geom_boxplot(fill= "blue") +  
  labs(title = 'City 1')

Box2 <- ggplot(Cities, aes(x = Numb.Fam2)) + 
  geom_boxplot(fill = "purple")+  
  labs(title = 'City 2') 

grid.arrange(Box1, Box2, nrow=1)

Any issues with this? Yes, the x-axis has very different ranges.

Box1 <- ggplot(Cities, aes(x = Numb.Fam1) ) + 
  geom_boxplot(fill= "blue") +  # fills City 1 with blue
  xlim(0, 500) +    # sets 0 as min and 2000 as max for x axis
  coord_flip() +
  labs(title = 'City 1')        # gives it a label

Box2 <- ggplot(Cities, aes(x = Numb.Fam2)) + 
  geom_boxplot(fill = "purple")+  
  xlim(0, 500) + 
  coord_flip() +
  labs(title = 'City 2') 

grid.arrange(Box1, Box2, nrow=1)

grid.arrange(City1Plot, City2Plot, Box1, Box2, nrow=2, ncol = 2)


Bruce & Bruce Textbook Example

Content warning: Murder

This recreates and expands on the example in the book Practical Statistics for Data Scientists: 50 Essentail Concepts by Peter Bruce & Andrew Bruce (one of the assigned readings)

getwd() # check my working directory to make sure that R can find my file without the full file path
## [1] "C:/Users/aleaw/OneDrive - University of Illinois at Chicago/Week 11"
state <- read_csv("Bruce&Bruce_murder.csv") # bring the CSV into R as a tibble
## 
## -- Column specification --------------------------------------------------------
## cols(
##   State = col_character(),
##   Population = col_double(),
##   Murder.Rate = col_double(),
##   Abbreviation = col_character()
## )
names(state)      # tells me names of variables
## [1] "State"        "Population"   "Murder.Rate"  "Abbreviation"
str(state)        # shows preview of dataset structure
## spec_tbl_df [50 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ State       : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Population  : num [1:50] 4779736 710231 6392017 2915918 37253956 ...
##  $ Murder.Rate : num [1:50] 5.7 5.6 4.7 5.6 4.4 2.8 2.4 5.8 5.8 5.7 ...
##  $ Abbreviation: chr [1:50] "AL" "AK" "AZ" "AR" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   State = col_character(),
##   ..   Population = col_double(),
##   ..   Murder.Rate = col_double(),
##   ..   Abbreviation = col_character()
##   .. )
summary(state)    # calculates statistics for all numeric variables
##     State             Population        Murder.Rate     Abbreviation      
##  Length:50          Min.   :  563626   Min.   : 0.900   Length:50         
##  Class :character   1st Qu.: 1833004   1st Qu.: 2.425   Class :character  
##  Mode  :character   Median : 4436370   Median : 4.000   Mode  :character  
##                     Mean   : 6162876   Mean   : 4.066                     
##                     3rd Qu.: 6680312   3rd Qu.: 5.550                     
##                     Max.   :37253956   Max.   :10.300

Calculate the mean and median population for the 50 states using the mean() and median() commands.

Remember what the mean actually is: The sum of all of populations for all 50 states, divided by the number of states. We can do this manually with this code below:

# adds up total population
totalpopulation <- sum(state$Population)

# counts how many observations there are within the Population variable (50 for the 50 states)
numberofstates <- length(state$Population) 

# calculate the average population across the 50 states
totalpopulation / numberofstates
## [1] 6162876

Could also look like this:

sum(state$Population)/length(state$Population)

Or we could just use the built in functions that do these steps for us:

mean(state$Population)

Calculating Murder Rates

The data set includes the State, the Population, and the Murder Rate (in units of murders per 100,000 people). If we wanted to calculate the average number of murders within the US, it may be tempting to calculate it the same was as we calculated the average population:

sum(state$Murder.Rate)/length(state$Murder.Rate)
## [1] 4.066

Put this number back into the context of the variable. What does it mean?

BUT remember, the Murder.Rate variable was already adjusted for population size (murders per 100,000 people per year).

We could calculate a weighted mean. This means we would would have to consider the population size while calculating the mean murder rate. There is even a command for that: weighted.mean() from the stats() package.

# install.packages("stats")
# library(stats)

# weighted.mean(thing you want the mean of, thing to use for weights)
weighted.mean(state$Murder.Rate, state$Population)
## [1] 4.445834

Now the average murder rate in the United States, when considering population size, is 4.4 per 100,000 people.

Create a variable for the number of murders within each state:

# Creates a new variable named "Count" within the State dataset
state <- state %>% 
  mutate(Count = Population/100000*Murder.Rate)

Frequency Table and Histograms

breaks <- seq(from=min(state[["Population"]]),
                to=max(state[["Population"]]), length=11)
pop_freq <- cut(state[["Population"]], breaks=breaks,
                right=TRUE, include.lowest = TRUE)
table(pop_freq)
breaks <- seq(from = min(state$Population), to = max(state$Population), length = 11)

pop_freq <- cut(state$Population, breaks = breaks,
                right = TRUE, include.lowest = TRUE)
table(pop_freq)
## pop_freq
## [5.64e+05,4.23e+06]  (4.23e+06,7.9e+06]  (7.9e+06,1.16e+07] (1.16e+07,1.52e+07] 
##                  24                  14                   6                   2 
## (1.52e+07,1.89e+07] (1.89e+07,2.26e+07] (2.26e+07,2.62e+07] (2.62e+07,2.99e+07] 
##                   1                   1                   1                   0 
## (2.99e+07,3.36e+07] (3.36e+07,3.73e+07] 
##                   0                   1

Reminder: A histogram is a wayto visualize the frequency table, with bins on the x-axis and data count on the y-axis.

  • Empty bins should be included in the graph! Otherwise you skew the axis
  • Bins are equal width
  • Number of bins and bin size is up to the user
hist(state$Population) # defaulted to 8 bins

hist(state$Population, breaks = breaks) # now has 10 bins,

hist(state$Murder.Rate, freq = FALSE) # now has 10 bins,
lines(density(state$Murder.Rate), lwd = 3, col="blue")

Key Ideas

  • A frequency histogram plots frequency counts on the y-axis and variable values on the x-axis; it gives a sense of the distribution of the data at a glance.

  • A frequency table is a tabular version of the frequency counts found in a histogram.

  • A boxplot—with the top and bottom of the box at the 75th and 25th percentiles, respectively—also gives a quick sense of the distribution of the data; it is often used in side-by-side displays to compare distributions.

  • A density plot is a smoothed version of a histogram; it requires a function to estimate a plot based on the data (multiple estimates are possible, of course).

Standard Deviation

Remember, standard deviation is easier to interpret than variance because it is on the same scale as the original data. Standard Deviation is sensitive to outliers

sd(state$Population)
## [1] 6848235
IQR(state$Population)
## [1] 4847308
#    ?quantile # what are the potential options for the quantile command?

quantile(____): quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, ...)

The default quantiles are to calculate the 0% (min), 25%, 50% (median), 75%, and 100% (max)

quantile(state$Population)
##       0%      25%      50%      75%     100% 
##   563626  1833004  4436370  6680312 37253956
# change the qunatiles to calculate the bottom 5% and top 95% instead of 25% and 75%
quantile(state$Population, c(0.05, .25, .5, .95, .95))
##       5%      25%      50%      95%      95% 
##   689529  1833004  4436370 19118546 19118546

boxplot() is from the graphics() package. Top and bottom of the box are the 75th and 25th percnetiles. Median is horizontal line. The dashed lines are the whiskers and indicate the range for the majority of the data. For this command, the whiskers extend to the farthest point beyond the box up until the 1.5 times the interquartile range (IQR). The IQR of the population was 4.847308^{6} so the whiskers go until + or - 1.5(4.847308^{6}), or +/- 7.270962^{6}.

boxplot(state$Population/1000000, 
        ylab = "Population (millions")

Boxplots are useful summaries, but hide the shape of the distribution. If there was a bimodal distribution, it would not be visible with a boxplot. Another graphing alternative to the boxplot is a violin plot, where the shape (of the density of points) is drawn. Use geom_violin() for violin plots.

state %>% ggplot(aes(x=Population)) + 
  geom_boxplot() # boxplot

state %>% ggplot(aes(x=Population)) + 
  geom_density() # smoothed distribution

state %>% ggplot(aes(y=Population)) +
  geom_boxplot()

p <- state %>% ggplot(aes(y=Population)) +
  geom_boxplot() 

p    # view the basic boxplot

ggplot(state, aes( x = Population, y = Population)) +
  geom_boxplot() +
  #coord_flip() +
  geom_jitter(shape = 16, position = position_jitter(0.2)) +
  stat_summary(fun.y = mean, geom = "point", shape=23, size = 4) 
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Correlation

Correlation is the extent that two continuous (interval or ratio level) variables are related. A relationship exists when knowing the value of one variavble is useful for predicting the value of a second variable.

Correlation Coefficient: symmetric, scale-invariant measure of association between two variables

  • Ranges from -1 to +1.
  • Strength of Correlation: 0 means no correlation, -1 and +1 imply perfect correlation (either one increases as the other decreases [-1] or they move in the same direction[+1])
  • assumes there is a normal distribution (dangerous assumption)
  • cor() can be used to compute correlation between two or more vectors
corcoef <- cor(state$Population, state$Murder.Rate)
corcoef
## [1] 0.1820693
coef_text <- paste("Correlation Coefficient:", round(cor(state$Population, state$Murder.Rate), digits = 3))
coef_text
## [1] "Correlation Coefficient: 0.182"

Interpretation?

Link with tons of examples of ways to customize scatter plots.

ggplot(state, aes(Population, Murder.Rate, label= Abbreviation)) +
  geom_point() +
  geom_smooth(method=lm, se=FALSE) + 
 annotate(x = 20000000, y = 8.5, geom = "text", label = coef_text) +
  theme(panel.background = element_blank())
## `geom_smooth()` using formula 'y ~ x'

mtcars Examples

# Factor
#install.packages(dataset)
library(datasets)
data("mtcars")
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
str(mtcars) # see the structure of the data
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
cor(mtcars$mpg , mtcars$hp)
## [1] -0.7761684
cor(mtcars$disp , mtcars$hp)
## [1] 0.7909486
# One Factor & One Numeric
mtcars %>% group_by(cyl) %>% summarise(avg = mean(mpg), 
                                       median = median(mpg), 
                                       std = sd(mpg))
## # A tibble: 3 x 4
##     cyl   avg median   std
##   <dbl> <dbl>  <dbl> <dbl>
## 1     4  26.7   26    4.51
## 2     6  19.7   19.7  1.45
## 3     8  15.1   15.2  2.56

Bar chart vs col chart

Some geoms plot the data directly (e.g., geom point takes the values of our X and Y variables and plots points corresponding to those values directly on the grid). Other geoms do some work behind the scenes. Each of the geom_ functions we can call in ggplot has an associated stat_ function that it uses by default and for which you can change.

For bar and column charts, we often want categorical data; so let’s use the mpg data and look at an example of a bar chart.

data(mpg)
#This dataset contains a subset of the fuel economy data that the EPA makes available on
#http://fueleconomy.gov. It contains only models which had a new release every year between 1999 
#and 2008 - this was used as a proxy for the popularity of the car.

ggplot(data = mpg, mapping = aes( x = class)) + geom_bar()

Note, we did not pass to our aes() argument a variable for Y. That is because the default stat_ function geom_bar() is stat_count().
It automatically counted the number of observations for each of category of car class. If there were NA values in that variable, an NA category would be included in the bar chart by default. If you wanted to drop the NA, you could do the following:

ggplot(data = remove_missing(mpg), mapping = aes( x = class)) + geom_bar()

What if we try to create this same plot with geom_col()

ggplot(data = mpg, mapping = aes( x = class)) + geom_col()

Now we get an error that we are required to provide a Y variable in the aesthetic mappings.

Why?

Well the default stat_ for geom_col is stat_identity: it leaves the data as is and therefore you need to supply both an X and Y value. Consequently, with geom_col, the height of the bars rendered on the plot correspond to the values in the data.

What if you wanted to plot the average highway miles for each car class? You might first attempt the following:

ggplot(data = mpg, mapping = aes( x = class, y = hwy)) + geom_col()

This is clearly not what you intended. ggplot is doing something reasonable here and is summing up all of the hwy miles per gallon for each class and plotting the total. You can see this by calculating those totals for yourself.

mpg %>% group_by(class) %>%
  summarize(hwy.tot = sum(hwy))
## # A tibble: 7 x 2
##   class      hwy.tot
##   <chr>        <int>
## 1 2seater        124
## 2 compact       1330
## 3 midsize       1119
## 4 minivan        246
## 5 pickup         557
## 6 subcompact     985
## 7 suv           1124

So what do we do if you want to show the average. Well, we can turn to the dplyr verbs we learned and preprocess our data. In other words, we can create a new dataset (or tibble) that contains the exact values we want to plot.

mpg.avg <- mpg %>%
  group_by(class) %>%
  summarize(avg.hwy = mean(hwy))

ggplot(data = mpg.avg, mapping = aes(x = class, y = avg.hwy)) + geom_col()

Remember - geom_bar() was designed to make it easy to create bar charts that show the counts of some different categories in your data. For geom_col() on the other hand, it shows the actual values of the variables in the data. > With geom_col(), you often wrangle or summarize your data prior to plotting.

Color vs fill (especially for bar/column charts)

These two aesthetic mappings can be confusing at first. Compare the following two plots:

ggplot(data = mpg, mapping = aes( x = class, color = class)) + geom_bar()

ggplot(data = mpg, mapping = aes( x = class, fill = class)) + geom_bar()

Here we can see that fill defines the color by which the geom will be filled. On the other hand, color defines the outline of the geom.

A legend shows up by default. This is redundant information in these plots. We can remove it with the following code:

ggplot(data = mpg, mapping = aes( x = class, fill = class)) + geom_bar() +
  guides(fill = FALSE) 

Note, that points, generally, only have color and no fill. There are shapes that you can use for your points that do allow you to pass both a color and fill and therefore have one color for the primary shape and one color for the outline.

Discrete vs continuous colors

R looks at the type of vector you are passing to color and if it is a discrete scale, like a factor, it will apply a discrete color scheme. If it is a continuous variable then it will apply a color gradient.

ggplot(data = fatal,
       mapping = aes(x = unemp, y = income,
                     color = state)) + 
  geom_point()

Maybe not a good idea to use a variable with 50 categories, but you get the point.

Now if you were to give it a numeric value, like year, you will get a color gradient (moving from dark to light)

ggplot(data = fatal,
       mapping = aes(x = unemp, y = income,
                     color = year)) + 
  geom_point()

Different Color Palettes

Faceting your data

It is much easier for your eyes and minds to make sense of data that has been faceted. Faceting is a great option to avoid plots with too much information on them. The subplots produced by faceting have the same axes and so it make it easier for us to compare across the faceted categories. Let’s look at an example. Let’s begin with a plot from above:

ggplot(data = fatal,
       mapping = aes(x = unemp, y = income,
                     color = factor(year)) ) + 
  scale_color_brewer(palette = 'Set1') +
  geom_point()

Very difficult to see what is going on here or to make sense of the different years.

Instead of having the year just take on a color value in a single plot, we could also facet by year.

ggplot(data = fatal,
       mapping = aes(x = unemp, y = income,
                     color = factor(year)) ) + 
  scale_color_brewer(palette = 'Set1') +
  geom_point() +
  facet_wrap(~year)

#obviously, we could drop the coloring by year here...though I kind of like it 
#but I would drop the legend. You can facet by more than one variable using facet_grid

Legends and Titles

Let’s start with looking at our legends. There are two ways to adjust the legend title and associated labels. The first is outside of your ggplot call. You can change the data frame itself so the variable name and categories/factors have the names you desire. The other way, which we will look at here, is to do the adjustments inside the ggplot call. The legend serves as a guide to the viewer of the figure to understand how fill, color, linetype, shape and other aesthetics have been applied to the plot.

Let’s work with our mpg data to do this.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point()

#note displ is engine displacement in liters

To change the title of your legend, you can specify it using the labs function. We will use this again below when we want to make changes to our main title and axis labels. For now, let’s just change the legend title. Because the legend is associated with the mapping ‘color’, we indicate in the labs that we want the mapping color to have the title of “Type of Car”. In other words, this is saying, the label of the thing we mapped to the aesthetic color should “Type of Car”.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  labs(color = "Type of Car")

You can also change the legend position of your plot using the theme() function.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  labs(color = "Type of Car") + 
  theme(legend.position="bottom")

#options for the legend position are: left, top, right, bottom
#you can also pass legend.position a numeric vector (x and y coordinates)
#to put the legend wherever you want, even inside the plotting area. 

Let’s now take a look at titles.

To change our titles we can use the same labs function and indicate the main title for the plot, the subtitle for the plot, the x and y axis labels, a caption to indicate source.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  labs(title = "Highway Miles per Gallon for Different Engine Displacements",
       subtitle = "Popular Cars from 1999 to 2009",
       x = "Engine Displacement in Liters",
       y = "Highway Miles per Gallon", 
       caption = "Data Source: mpg data",
       color = "Type of Car")

The mappings that we choose (such as color, fill, shape, and size) all have scales that allow us to adjust how they are presented. Scales for mappings may have labels, axis tick marks, specific colors, etc…and if we want to adjust them we need to use the appropriate scale_ function. Each scale function has two components to it (resulting in a huge number of different scale functions). For each scale you need to identify the particular mapping you are referring to (x, y, color, fill, shape, size) as well as the kind of mapping that is occurring (continuous, discrete, manual).

Because we mapped color to a discrete (and not continuous) variable, the appropriate scale_ function to use is scale_color_discrete(). This function will allow us to change the color values associated with each discrete category of class, change the title or name of the legend, and change the labels. For instance, let’s say we wanted to change the legend name to “Car Type” and also capitalize each of the car types, we could do the following.

#drop the labs call to simplify the code here

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "Compact", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "SUV"))

#need to make sure the labels are in the exact order.  There is no warning if you reorder them and 
#then your plot is showing the wrong color for a given class.
#Below, I switch SUV and compact. Which you can see results in the wrong plot. 

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "SUV", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "Compact"))

#One way to avoid this is to always set the breaks argument in your scale call. As long as
#you match your labels to where you set your breaks, you will always map the right
#color to the right category. As you can see below, I don't set the breaks alphabetically
#as they would normally be set, but the results are fine because the labels correspond
#to the correct break points.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       breaks = c("2seater", "suv", "midsize", "minivan", "pickup",
                                  "subcompact", "compact"),
                       labels = c("Two-Seater", "SUV", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "Compact"))

In the same way we adjusted our color mapping, we can adjust other mappings, in this case to our x or y axis or both. Let’s adjust the y-axis so we have tick marks for every 5 miles per gallon. Now let’s think again of our mapping and our type. Because we are dealing with your mapping to the y axis and that mapping is continuous variable, we are going to call scale_y_continuous(). Compare this to what we called our color aesthetic, there we needed scale_color_discrete().

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "Compact", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "SUV")) +
  scale_y_continuous(breaks = seq(0,50, by = 5))

Notice here that the y axis is truncated at 15 and above 45, even though we set our breaks to run in intervals of 5 from zero to 50. This is because there are not cars with hwy values below 10 or above 45 and thus the graph is truncated to avoid a lot of empty space. If you want to show that space, you need to tell R what the limits of the y axis should be.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "Compact", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "SUV")) +
  scale_y_continuous(limits = c(0,50), breaks = seq(0,50, by = 5))

Not that this is a good idea for this plot, but you can choose your own labels for the axis tick marks just as you chose your own labels for the legend. This time let’s adjust the x axis. The scale we need should make sense…it is scale_x_continuous()

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "Compact", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "SUV")) +
  scale_y_continuous(limits = c(0,50), breaks = seq(0,50, by = 5)) +
  scale_x_continuous(breaks = c(2,4,6), labels = c("two", "four", "six"))

You may want to make additional adjustments now, including changing the font size, text color, or other aspects of legend and text to improve your figure. These changes are done through settings in the theme. Before we do that though, let me first mention that there are different themes that make adjustments to the entire look of your plot. So for instance if we wanted a Wall Street Journal theme to our plot we could do the following:

#bring in the ggthemes library
library(ggthemes)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "Compact", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "SUV")) +
  theme_wsj()

#Or the Economist

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "Compact", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "SUV")) +
  theme_economist()

#there are a large number of other themes you can set

Whether you are using a specific theme, like the Economist or other, there may be features for different elements that you still want to adjust (such as font size, color, etc…). Those tweaks are also done in through themes.

#here I will use the default theme and bring back our labs so we have titles and
#labels that we can change. Let's make something real ugly to show how this works.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "Compact", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "SUV")) +
  labs(title = "Highway Miles per Gallon for Different Engine Displacements",
       subtitle = "Popular Cars from 1999 to 2009",
       x = "Engine Displacement in Liters",
       y = "Highway Miles per Gallon", 
       caption = "Data Source: mpg data") + 
  theme(plot.title = element_text(size = 15, color = "purple"),
        axis.title.x = element_text(size = 15, color = "green"),
        axis.text.x = element_text(face = "italic"),
        axis.text.y = element_text(size = 10, color = "blue"),
        legend.title = element_text(size = 12, color = "gray"))

#side note, since I set the legend title in the scale_color_discrete()
#call, I didn't need to do it again in labs()

If you want to drop any of these elements, you can do it through the theme as well. Here below we remove the x axis title.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() +
  scale_color_discrete(name = "Car Type",
                       labels = c("Two-Seater", "Compact", "Midsize", "Minivan", 
                                  "Pickup", "Subcompact", "SUV")) +
  labs(title = "Highway Miles per Gallon for Different Engine Displacements",
       subtitle = "Popular Cars from 1999 to 2009",
       x = "Engine Displacement in Liters",
       y = "Highway Miles per Gallon", 
       caption = "Data Source: mpg data",
       color = "Type of Car") + 
  theme(plot.title = element_text(size = 15, color = "purple"),
        axis.title.x = element_blank(),
        axis.text.x = element_text(face = "italic"),
        axis.text.y = element_text(size = 10, color = "blue"),
        legend.title = element_text(size = 12, color = "gray"))

What are the differences between the scale_ functions, the guides() function, and the theme() function? When do you know to use one rather than the other?

Every aesthetic mapping has a scale. If you want to adjust how that scale is marked or graduated, then you use a scale_ function.

Many scales come with a legend or key to help the reader interpret the graph. These are called guides. You can make adjustments to them with the guides() function. Perhaps the most common use case is to make the legend disappear, as it is sometimes superfluous. Another is to adjust the arrangement of the key in legends and colorbars.

Graphs have other features not strictly connected to the logical structure of the data being displayed. These include things like their background color, the typeface used for labels, or the placement of the legend on the graph. To adjust these, use the theme() function.

Consistent with ggplot’s overall approach, adjusting some visible feature of the graph means first thinking about the relationship that the feature has with the underlying data. Roughly speaking, if the change you want to make will affect the substantive interpretation of any particular geom, then most likely you will either be mapping an aesthetic to a variable using that geom’s aes() function, or you will be specifying a change via some scale_ function. If the change you want to make does not affect the interpretation of a given geom_, then most likely you will either be setting a variable inside the geom_ function, or making a cosmetic change via the theme() function."

More advanced graphing examples

Scatterplots w/ Density Distribution

set.seed(1234)
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
group <- as.factor(rep(c(1,2), each=500))
df <- data.frame(x, y, group)
head(df)
##             x          y group
## 1 -2.20706575 -0.2053334     1
## 2 -0.72257076  1.3014667     1
## 3  0.08444118 -0.5391452     1
## 4 -3.34569770  1.6353707     1
## 5 -0.57087531  1.7029518     1
## 6 -0.49394411 -0.9058829     1
# scatter plot of x and y variables
# color by groups
scatterPlot <- ggplot(df,aes(x, y, color=group)) + 
  geom_point() + 
  scale_color_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position=c(0,1), legend.justification=c(0,1))

scatterPlot

# Marginal density plot of x (top panel)
xdensity <- ggplot(df, aes(x, fill=group)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position = "none")
xdensity

# Marginal density plot of y (right panel)
ydensity <- ggplot(df, aes(y, fill=group)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c('#999999','#E69F00')) + 
  theme(legend.position = "none")
ydensity

blankPlot <- ggplot()+geom_blank(aes(1,1))+
  theme(plot.background = element_blank(), 
   panel.grid.major = element_blank(),
   panel.grid.minor = element_blank(), 
   panel.border = element_blank(),
   panel.background = element_blank(),
   axis.title.x = element_blank(),
   axis.title.y = element_blank(),
   axis.text.x = element_blank(), 
   axis.text.y = element_blank(),
   axis.ticks = element_blank()
     )

#install.packages("gridExtra)
library("gridExtra")
grid.arrange(xdensity, blankPlot, scatterPlot, ydensity, 
        ncol=2, nrow=2, widths=c(4, 1.4), heights=c(1.4, 4))

# Load
# install.packages("ggExtra")
library("ggExtra")
## Warning: package 'ggExtra' was built under R version 4.0.5
# Create some data
set.seed(1234)
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
df3 <- data.frame(x, y)
# Scatter plot of x and y variables and color by groups
sp2 <- ggplot(df3,aes(x, y)) + geom_point()
# Marginal density plot
ggMarginal(sp2 + theme_gray())

# Marginal histogram plot
ggMarginal(sp2 + theme_gray(), type = "histogram",
           fill = "steelblue", col = "darkblue")

# Create data 
my_variable=c(rnorm(1000 , 0 , 2) , rnorm(1000 , 9 , 2))
 
# Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
 
# Draw the boxplot and the histogram 
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(my_variable , horizontal=TRUE , ylim=c(-10,20), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F)
par(mar=c(4, 3.1, 1.1, 2.1))
hist(my_variable , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(-10,20))

Graphing Bivariate Data

Bivariate data is used to find out if there is a relationship between two different variables. It is frequently represented with scatter plots where one variable is on the X axis and the other is on the Y axis. If the data seems to fit a line or curve then there may be a relationship, or correlation, between the two variables. Always be careful when examining relationships. Many variables may appear related when in fact their relationship happened by chance or a third variable is influencing both variables.

Scatterplots

Test of Association: Used to test whether two variables are related to one another or not

  • Depends on the nature of the variables you are studying (nominal, ordinal, interval) and number of categories
  • Depends on the nature of the question you’re answering (independence, agreement of coders, effect of an intervention, etc.)

Questions to ask yourself:

  • Could this pattern happen by chance?
  • How do you describe the relationship implied by the pattern?
  • How strong is the relationship implied by the pattern?
  • What other variables might affect the relationship?
  • Does the relationship change if you look at individual subgroups of the data?

“Variation creates uncertainty but covariation reduces it.” - pg 106