The “tidyverse” Package

It is a super package. Loading this package also loads 8 other packages, including “dplyr” and “ggplot2”.

The “dplyr” package is for data wrangling. It provides a consistent set of verbs that help you solve the most common data manipulation challenges:

  • mutate() adds new variables that are functions of existing variables.

  • select() picks variables (i.e. columns) based on their names.

  • filter() picks cases(i.e. rows) based on their values.

  • summarise() reduces multiple values down to a single summary.

  • arrange() changes the ordering of the rows.

The summarise() function should be used after grouping your data rows with the group_by() function, since a summary is usually done “by group”.

  • The “ggplot2” package is for plotting. Watch the first video (first 15 minutes) in the vedio series: https://rpubs.com/scsustat/tidyverse to get a feel how a nice plot can be created using this package. We will dive into this later.

Using arrange() to arrange observations in order

arrange(mtcars, cyl, -disp) # Arrange data in ascending order by cyl, then in descending order by disp
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3

Using mutate() to change or create a new column

D = mtcars # Make a copy
mutate(D, 
       fake = mpg + disp, # A meaningless column called "fake" added to the D data frame
       mpg_cat = (mpg>20), # Add another column called "mpg_cat". The column indicates whether mpg > 20
       log_disp = log(disp), # Add still another column called "log_disp": A log-transformation to disp
       mpg = log(mpg) # Replace the mpg column by its log value
      )
##                          mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           3.044522   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       3.044522   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          3.126761   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      3.063391   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   2.928524   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             2.895912   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          2.660260   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           3.194583   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            3.126761   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            2.954910   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           2.879198   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          2.797281   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          2.850707   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         2.721295   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  2.341806   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 2.341806   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   2.687847   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            3.478158   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         3.414443   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      3.523415   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       3.068053   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    2.740840   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         2.721295   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          2.587764   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    2.954910   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           3.306887   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       3.258097   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        3.414443   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      2.760010   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        2.980619   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       2.708050   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          3.063391   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                      fake mpg_cat log_disp
## Mazda RX4           181.0    TRUE 5.075174
## Mazda RX4 Wag       181.0    TRUE 5.075174
## Datsun 710          130.8    TRUE 4.682131
## Hornet 4 Drive      279.4    TRUE 5.552960
## Hornet Sportabout   378.7   FALSE 5.886104
## Valiant             243.1   FALSE 5.416100
## Duster 360          374.3   FALSE 5.886104
## Merc 240D           171.1    TRUE 4.988390
## Merc 230            163.6    TRUE 4.947340
## Merc 280            186.8   FALSE 5.121580
## Merc 280C           185.4   FALSE 5.121580
## Merc 450SE          292.2   FALSE 5.619676
## Merc 450SL          293.1   FALSE 5.619676
## Merc 450SLC         291.0   FALSE 5.619676
## Cadillac Fleetwood  482.4   FALSE 6.156979
## Lincoln Continental 470.4   FALSE 6.131226
## Chrysler Imperial   454.7   FALSE 6.086775
## Fiat 128            111.1    TRUE 4.365643
## Honda Civic         106.1    TRUE 4.326778
## Toyota Corolla      105.0    TRUE 4.264087
## Toyota Corona       141.6    TRUE 4.788325
## Dodge Challenger    333.5   FALSE 5.762051
## AMC Javelin         319.2   FALSE 5.717028
## Camaro Z28          363.3   FALSE 5.857933
## Pontiac Firebird    419.2   FALSE 5.991465
## Fiat X1-9           106.3    TRUE 4.369448
## Porsche 914-2       146.3    TRUE 4.789989
## Lotus Europa        125.5    TRUE 4.554929
## Ford Pantera L      366.8   FALSE 5.860786
## Ferrari Dino        164.7   FALSE 4.976734
## Maserati Bora       316.0   FALSE 5.707110
## Volvo 142E          142.4    TRUE 4.795791

Using group_by() to group data

by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing how it's grouped):
by_cyl
## # A tibble: 32 × 11
## # Groups:   cyl [3]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows
class(mtcars)
## [1] "data.frame"
class(by_cyl) # The structure of the new data frame is richer
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Using summarize() to summarize grouped data

by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing how it's grouped):
by_cyl # Now, the mtcars data frame has been grouped and called by_cyl.
## # A tibble: 32 × 11
## # Groups:   cyl [3]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows
by_cyl %>% summarise(
  mpg_mean = mean(mpg), # Mean of disp for each cyl group
  hp_mean = mean(hp) # Mean of hp for each cyl group
)
## # A tibble: 3 × 3
##     cyl mpg_mean hp_mean
##   <dbl>    <dbl>   <dbl>
## 1     4     26.7    82.6
## 2     6     19.7   122. 
## 3     8     15.1   209.
by_cyl %>% summarise(freq = n()) # Frequency distribution of "cyl"; 
## # A tibble: 3 × 2
##     cyl  freq
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14
                                 # The resulting data frame has two columns: cyl and freq 

An Introduction to the ggplot2 package

Here are articles that show why data visualization is important:

The ggplot2 Package

This package allows you to create different kinds of plots using a more generalized approach. It is based on the graphic concept that a graph is a combination of 3 independent components: data, aesthetics, and geometry, where

  • data is a data frame (maybe NULL),

  • aesthetics is used to set the x- and y-axes or to choose the shape, size, or color of the plotting character.

  • geometry refers to the type of graphics (such as scatter plot, line plot, histogram, boxplot, density curve, barplot, pie chart, violin plot).

We use examples to demonstrate the use of ggplot2 package when plotting data based on the following situations:

  • One continuous variable (histogram, boxplot, density curve)

  • One categorical variable (barplot, pie chart)

  • Two continuous variables (scatter plot, line plot)

  • One continuous variable and one categorical variable (side-by side boxplots, overlaid density curves, overlaid histograms, violin plots)

  • Two categorical variables (side-by-side barplots, mosaic plots)

  • More than 3 variables

A warning: For each aesthetic, you use the aes() function to associate the name of the aesthetic with a variable to display. The aes() function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument. You can also set the aesthetic properties of your geom manually. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes()!

In addition to the 3 components of a graph, there are more than 30 themes that control the appearance of the graph. They can be roughly grouped into five categories: plot, axis, legend, panel and facet. Check out the theme() function for details.

head(iris) # Check the data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
ggplot(iris) # Does not plot anything, since no x or y coordinate is specified

ggplot(iris, aes(x = Sepal.Length)) # Only plot the coordinate system

ggplot(iris, aes(x = Sepal.Length)) + 
  geom_histogram(bins = 6, 
                 color = "blue", 
                 fill = "red"
                ) # Add a layer

ggplot(iris, aes(x = Sepal.Length)) + geom_density(color = "blue") 

ggplot(iris, aes(x = Sepal.Length)) + geom_boxplot(color = "blue") 

ggplot(iris, aes(x = Species)) + geom_bar(color = "blue", fill = "red") 

ggplot(iris, aes(y = Species)) + geom_bar(color = "blue", fill = "red") # horizontal bars

p = ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(col = "blue", size = 2)  # shape = pch in base plot

p

ggplotly(p) # Using plotly package to make plot interactive
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_line(col = "blue", size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

ggplot(mtcars, aes(x = factor(cyl), y = wt, fill = factor(cyl))) +
  geom_boxplot() + 
  theme(legend.position = "none") +
  labs(x = "Cylinder", y = "Weight", title = "My Plots", subtitle = "Made by XYZ", caption = "Data courtesy")

ggplot(mtcars, aes(x = wt, color = factor(cyl))) +
  geom_density() +
  labs(x = "Weight", color = "Cylinder")  # Legend's title is set via color here

ggplot(mtcars, aes(x = factor(cyl), y = wt, fill = factor(cyl))) +
  geom_violin() +
  labs(x = "Weight", fill = "Cylinder")  # Legend's title is set via color here

More

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
ggplot(mtcars, mapping=aes(x = hp, y = mpg)) +
  geom_point(color = "red")

## Want to do other plots?

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_line(color = "red")

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(color = "blue", fill = "red", bins=6)

ggplot(mtcars, aes(x = mpg)) +
  geom_density(color = "red")

ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
  geom_histogram(bins = 6) # bad

ggplot(mtcars, aes(x = mpg, colour = factor(cyl))) +
  geom_density() +
  labs(colour = "Cylinder") # rename legend title

ggplot(mtcars, aes(x = mpg, color= as.factor(cyl))) +
  geom_density() +
  labs(title = "My Plot Title", x="Miles per Gallon", y= "Density", color="Cylinder") + # rename legend title
  theme(plot.title = element_text(hjust = 0.5)) # Center the title

ggplot(mtcars, aes(y = mpg)) +
  geom_boxplot(fill = "red")

# More to learn about the theme()
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
  geom_boxplot() + 
  labs(title = "Mpg vs Cyl", subtitle = "Hello World!", x="Cyl", caption = "Data courtesy: xyz") +
  theme(plot.background = element_rect(fill = "gray"),
        panel.background = element_rect(fill = "yellow"),
        plot.margin = unit(c(t=1,r=2,b=3,l=4), "cm"),
        axis.title.x = element_text(hjust = 0.5, size = 20, color = "pink", angle = 45, vjust = -1),
        plot.title = element_text(hjust = 0.5, size = 28, color = "green"),
        plot.subtitle = element_text(hjust = 0.5, size = 15, color = "red"),
        plot.caption = element_text(hjust = 0, size = 10, color = "green"),
        legend.position = "none",
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank()
        ) 

myData = data.frame(
  Student = 1:24,
  Major = c("Accounting", "IS", "CS", "Statistics", "EE", "Math", "Accounting", "CS", "CS", "Statistics", "EE", "Math","Accounting", "IS", "IS", "Statistics", "Math", "Math", "Accounting", "IS", "Statistics", "Statistics", "EE", "CS"),
  Sex = c("Male", "Female", "Male", "Male", "Female", "Male", "Female", "Male", "Male", "Female", "Female", "Female", "Male", "Male", "Female", "Male", "Female", "Female", "Male", "Female", "Male", "Male", "Male", "Male"),
  Score = c(87, 92, 69, 90, 88, 79, 94, 86, 92, 84, 78, 67, 95, 91, 93, 83, 82, 94, 86, 74, 95, 55, 93, 85)
)

myData
##    Student      Major    Sex Score
## 1        1 Accounting   Male    87
## 2        2         IS Female    92
## 3        3         CS   Male    69
## 4        4 Statistics   Male    90
## 5        5         EE Female    88
## 6        6       Math   Male    79
## 7        7 Accounting Female    94
## 8        8         CS   Male    86
## 9        9         CS   Male    92
## 10      10 Statistics Female    84
## 11      11         EE Female    78
## 12      12       Math Female    67
## 13      13 Accounting   Male    95
## 14      14         IS   Male    91
## 15      15         IS Female    93
## 16      16 Statistics   Male    83
## 17      17       Math Female    82
## 18      18       Math Female    94
## 19      19 Accounting   Male    86
## 20      20         IS Female    74
## 21      21 Statistics   Male    95
## 22      22 Statistics   Male    55
## 23      23         EE   Male    93
## 24      24         CS   Male    85
ggplot(myData, aes(x = Score, color= Sex)) +
  geom_density() 

# Bar plots with 3 possible positions
ggplot(myData, aes(x = Major, fill = Sex)) +
  geom_bar(position = "stack") # Each bar consists of two pieces representing the counts of each sex

ggplot(myData, aes(x = Major, fill = Sex)) +
  geom_bar(position = "dodge") # Side-by-side

ggplot(myData, aes(x = Major, fill = Sex)) +
  geom_bar(position = "fill") # Each bar has the same height of 1 and consists of two sexes in proportion

# Name-value kind of data
popData = data.frame(City = c("St Cloud", "Sartell", "Waite Park", "St Joseph"), 
                        Population = c(66169, 18428, 7718, 7147))
popData
##         City Population
## 1   St Cloud      66169
## 2    Sartell      18428
## 3 Waite Park       7718
## 4  St Joseph       7147
ggplot(popData, aes(x = City, y = Population)) +
  geom_col() + # for Name-value kind of data 
  theme(axis.text.x = element_text(angle = 45)) 

# The Titanic Data

D=as.data.frame(Titanic)
D1=D %>% group_by(Class)
D2=D1 %>% summarise(Freq=sum(Freq))
D2
## # A tibble: 4 × 2
##   Class  Freq
##   <fct> <dbl>
## 1 1st     325
## 2 2nd     285
## 3 3rd     706
## 4 Crew    885
ggplot(D2, aes(x=Class, y = Freq)) +
  geom_col() +
  geom_text(aes(label=Freq), vjust=2.5, col ="red", size=13)

ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
  geom_col(position = "stack") +
  labs(fill = "Survival Status") # Change title of legend

ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
  geom_col(position = "dodge")

ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
  geom_col(position = "dodge")

# Which one to use: geom_bar() or geom_col? 
# Both functions create bar graphs. In general, for individual data, such as myData above, use geom_bar(); 
# for grouped data, such as D above, use geom_col(). 
# The function geom_bar() can also be applied to grouped data, if you use it this way: geom_bar(stat="identity").

A quick quiz 1: how many aesthetics are adopted in the following plot? How would you create such a plot?

## Warning: Using size for a discrete variable is not advised.

A quick quiz 2: how many aesthetics are adopted in the following plot? How would you create such a plot?

D = mtcars %>% mutate(cyl = factor(cyl), am = factor(am), gear = factor(gear))

p <- ggplot(data = D, aes(x=wt, y=mpg)) +
    geom_line(aes(group=factor(gear), color=gear)) +   # group seems not necessary?
    labs(title = "Demonstrate the use of aesthetics", subtitle = 'Using the "mtcars" data') +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme(plot.subtitle = element_text(hjust = 0.5))

p

Faceting

Sometimes, you need to split your plot into facets, subplots that each display one subset of the data.

# Prepare data
D = mtcars %>% mutate(am = factor(am, 
                                  levels = c(0,1), 
                                  labels = c("Automatic", "Manual")
                                 ), 
                      vs = factor(vs, 
                                  levels = c(0,1), 
                                  labels = c("V-shaped", "straight")
                                 )
                     )

# Use one categorical variable for faceting
ggplot(data = mtcars, aes(x=wt, y=mpg)) +
  geom_line() +
  facet_grid(am~.)

# Or
ggplot(data = mtcars, aes(x=wt, y=mpg)) +
  geom_line() +
  facet_grid(.~am)

# Use two categorical variables for faceting
ggplot(data = D, aes(x=wt, y=mpg)) +
  geom_line() +
  facet_grid(am~vs)

Dodged Barplots with Text Labels

# data
D=data.frame(state=rep(LETTERS[1:4], each = 2), year = rep(2020:2019, 4), prop=c(0.6, 0.3, 0.4, 0.7, 0.2,0.7, 0.1, 0.6))

D
##   state year prop
## 1     A 2020  0.6
## 2     A 2019  0.3
## 3     B 2020  0.4
## 4     B 2019  0.7
## 5     C 2020  0.2
## 6     C 2019  0.7
## 7     D 2020  0.1
## 8     D 2019  0.6
# Dodged Barplots with Text Labels
ggplot(D, aes(x=state, y=prop, fill=factor(year))) +
  geom_col(position = "dodge") +
  geom_text(aes(label=prop), 
            position = position_dodge(width=1), # 1-bar distance between texts on side-by-side bars
            color="white",
            vjust = 1.5, # texts inside bars
            hjust = 0.5  # centered texts
           ) +
  labs(fill = "Year")

A Comprehensive Example

This example will use the data frame “diamonds” from the ggplot2 package. The dataset contains the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

price: price in US dollars ($326–$18,823)

carat: weight of the diamond (0.2–5.01)

cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color: diamond colour, from D (best) to J (worst)

clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x: length in mm (0–10.74)

y: width in mm (0–58.9)

z: depth in mm (0–31.8)

depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

table: width of top of diamond relative to widest point (43–95)

Reference: https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2

head(diamonds, n = 10)
## # A tibble: 10 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
df <- diamonds %>%
  group_by(cut, clarity) %>% ## Must do grouping first in order to do the following operations
  summarise(count = n()) %>%
  group_by(cut) %>% # If not specified, the following will still be done by default for each category of cut
  mutate(cut.count = sum(count), 
         prop = count/sum(count)) %>%
  ungroup() # Remove grouping after we have added the two new columns
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
ggplot(df, aes(x = cut, y = prop, width = cut.count, fill = clarity)) +
  geom_col(colour = "black") +
  facet_grid(.~cut, scales = "free_x", space = "free_x") +
  scale_fill_brewer(palette = "RdYlGn") +
  theme_void() 

# A reordered barplot
df <- diamonds %>%
  group_by(cut) %>% ## Must do grouping first in order to do the following operations
  summarise(count = n())

ggplot(df, 
       aes(x = reorder(cut, -count), y = count)) +  # "-" means reorder "cut" by "count" from high to low
  geom_col() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  theme_bw() +
  ggtitle("Distribution of Diamonds by Cut") +xlab("Cut")

Other Nice Plots

Polar Coordinates

There are nice references:

D=USArrests
D$State = rownames(USArrests)
D=arrange(D, Murder)
angle <- 90-360 * (0:49) /50     

hjust <- ifelse( angle < -90, 1, 0)
angle <- ifelse(angle < -90, angle+180, angle)
 
ggplot(D, aes(x = reorder(State, Murder))) +
    geom_col(aes(y = Murder, fill = "Murder")) +
    geom_col(aes(y = -Assault/10, fill = "Assault")) + 
    labs(fill = rep(c("black", "red"), c(c(nrow(D), nrow(D))))) +
    coord_polar() +
    theme_void() +   # Void all themes
    theme(panel.grid.major.x  = element_line(colour = "lightgrey"),
          panel.grid.major.y  = element_blank()) + 
    geom_text(aes(y = 20, label = State), angle = angle-3, hjust = hjust, size = 3) +
    geom_text(aes(y = Murder, label = Murder), size = 2, vjust = 0.5) +
    geom_text(aes(y = -Assault/10, label = Assault), size = 2, vjust = 0.5) + 
    labs(title = "A Mirrored, Radial Barplot", 
         subtitle = "Murder & Assault Rates for States",
         fill = "Crime Type") + 
    theme(legend.position = c(1.1, 0.9))

Challenging problems:

The “BasketballAnalyzeR” package

We demonstrate the use of the “BasketballAnalyzeR” package for creating nice plots.

D = USArrests[1:10,] # Only use the first 10 rows of the USArrests data from the base package
D$group = rep(1:2, c(5,5)) # Add an arbitrary group just for demo

BasketballAnalyzeR::radialprofile(data=D, 
                                  std = TRUE,  # variables are standardized.
                                  title = rownames(D))
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

The ggradar Package

Install this package using code:

devtools::install_github(“ricardo-bion/ggradar”, dependencies = TRUE)

# Prepare data: 
# the first column must be a character and other columns numeric between 0 and 1
df = data.frame(year = as.character(c(2019, 2020)),
               intimidation = c(0.33, 0.29),
               destruction = c(0.25, 0.34),
               assault = c(0.12, 0.15),
               other = c(0.3, 0.22)
)

# plot
ggradar::ggradar(df)

Query Using the sqldf Package

Reference: https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html

SQL is a database query language - a language designed specifically for interacting with a database. You might find this is a great tool for data wrangling. Better than tidyverse?

library(sqldf) # Load the package
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
sqldf('SELECT mpg FROM mtcars')
##     mpg
## 1  21.0
## 2  21.0
## 3  22.8
## 4  21.4
## 5  18.7
## 6  18.1
## 7  14.3
## 8  24.4
## 9  22.8
## 10 19.2
## 11 17.8
## 12 16.4
## 13 17.3
## 14 15.2
## 15 10.4
## 16 10.4
## 17 14.7
## 18 32.4
## 19 30.4
## 20 33.9
## 21 21.5
## 22 15.5
## 23 15.2
## 24 13.3
## 25 19.2
## 26 27.3
## 27 26.0
## 28 30.4
## 29 15.8
## 30 19.7
## 31 15.0
## 32 21.4
sqldf('SELECT mpg, disp FROM mtcars')
##     mpg  disp
## 1  21.0 160.0
## 2  21.0 160.0
## 3  22.8 108.0
## 4  21.4 258.0
## 5  18.7 360.0
## 6  18.1 225.0
## 7  14.3 360.0
## 8  24.4 146.7
## 9  22.8 140.8
## 10 19.2 167.6
## 11 17.8 167.6
## 12 16.4 275.8
## 13 17.3 275.8
## 14 15.2 275.8
## 15 10.4 472.0
## 16 10.4 460.0
## 17 14.7 440.0
## 18 32.4  78.7
## 19 30.4  75.7
## 20 33.9  71.1
## 21 21.5 120.1
## 22 15.5 318.0
## 23 15.2 304.0
## 24 13.3 350.0
## 25 19.2 400.0
## 26 27.3  79.0
## 27 26.0 120.3
## 28 30.4  95.1
## 29 15.8 351.0
## 30 19.7 145.0
## 31 15.0 301.0
## 32 21.4 121.0
sqldf('SELECT mpg, disp FROM mtcars WHERE am = 1 ORDER BY mpg ASC')
##     mpg  disp
## 1  15.0 301.0
## 2  15.8 351.0
## 3  19.7 145.0
## 4  21.0 160.0
## 5  21.0 160.0
## 6  21.4 121.0
## 7  22.8 108.0
## 8  26.0 120.3
## 9  27.3  79.0
## 10 30.4  75.7
## 11 30.4  95.1
## 12 32.4  78.7
## 13 33.9  71.1
sqldf('SELECT * FROM mtcars WHERE (mpg > 20 AND disp < 95) OR carb > 1000')
##    mpg cyl disp hp drat    wt  qsec vs am gear carb
## 1 32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
## 2 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
## 3 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
## 4 27.3   4 79.0 66 4.08 1.935 18.90  1  1    4    1
sqldf('SELECT * FROM mtcars WHERE carb NOT IN (1,2)')
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 4  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 5  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 6  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 7  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 8  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 9  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 10 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 11 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 12 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 13 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 14 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 15 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
sqldf("SELECT am, AVG(mpg) AS Avgmpg FROM mtcars GROUP BY am")
##   am   Avgmpg
## 1  0 17.14737
## 2  1 24.39231
sqldf("SELECT am, COUNT() as count FROM mtcars GROUP BY am")
##   am count
## 1  0    19
## 2  1    13

Another useful package that does query is RSQLite.