Visualizing the data

ggplot grammar of graphics- aesthetic = variable geometry= shape (boxplot) still using tidyverse + strings things together in ggplot aesthetics is for data, geom= make it change/pretty different themes: style, preference, gridlines, etc, cosmetic choices. default + theme classic

now make box plot need x axis and y axis in aesthetics cereals %>% ggplot(aes(x =, y = )) + geom_boxplot()

library(tidyverse)
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(here)
## here() starts at /data/biostat/a089861/A089861/R Trainings
cereals <- read_csv(here("Course #2", "cereals.csv"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   name = col_character(),
##   mfr = col_character(),
##   type = col_character(),
##   calories = col_double(),
##   protein = col_double(),
##   fat = col_double(),
##   sodium = col_double(),
##   fiber = col_double(),
##   carbo = col_double(),
##   sugars = col_double(),
##   potass = col_double(),
##   vitamins = col_double(),
##   shelf = col_double(),
##   weight = col_double(),
##   cups = col_double(),
##   rating = col_double()
## )
library(ggplot2)
cereals <- read_csv("cereals.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   name = col_character(),
##   mfr = col_character(),
##   type = col_character(),
##   calories = col_double(),
##   protein = col_double(),
##   fat = col_double(),
##   sodium = col_double(),
##   fiber = col_double(),
##   carbo = col_double(),
##   sugars = col_double(),
##   potass = col_double(),
##   vitamins = col_double(),
##   shelf = col_double(),
##   weight = col_double(),
##   cups = col_double(),
##   rating = col_double()
## )

Clean Data and Add category

##clean the data

cereals <- cereals %>%
  mutate(
    mfr = factor(mfr),
    type = factor(type),
    potass = na_if(potass, -1)
  ) %>%
  mutate_if(is.numeric,
            ~na_if(.x, -1))
  head(cereals)
## # A tibble: 6 x 16
##   name       mfr   type  calories protein   fat sodium fiber carbo sugars potass
##   <chr>      <fct> <fct>    <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 100% Bran  N     C           70       4     1    130  10     5        6    280
## 2 100% Natu… Q     C          120       3     5     15   2     8        8    135
## 3 All-Bran   K     C           70       4     1    260   9     7        5    320
## 4 All-Bran … K     C           50       4     0    140  14     8        0    330
## 5 Almond De… R     C          110       2     2    200   1    14        8     NA
## 6 Apple Cin… G     C          110       2     2    180   1.5  10.5     10     70
## # … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
## #   cups <dbl>, rating <dbl>

Turn data into new categories, mean and median, sd of sugar and remove missing data

cereals %>%
  summarize(
    mean_sugar = mean(sugars, na.rm = TRUE),
    median_sugar = median(sugars, na.rm = TRUE),
    sd_sugar = sd(sugars, na.rm = TRUE)
  )
## # A tibble: 1 x 3
##   mean_sugar median_sugar sd_sugar
##        <dbl>        <dbl>    <dbl>
## 1       7.03            7     4.38

make new variable: cal_per_cup

cereals <- cereals %>%
  mutate(cal_per_cup = calories/cups)

normal

cereals %>%
  ggplot(aes(x = mfr , y = cal_per_cup, fill )) +
  geom_boxplot()

make it pretty

cereals %>%
  ggplot(aes(x = mfr , y = cal_per_cup, fill = mfr)) +
  geom_boxplot()+
  theme_classic()

FIND ONLINE:
Scales: https://ggplot2-book.org/scale-position.html

Themes: https://ggplot2-book.org/polishing.html

Add themes

coord_flip reverse coordinates/ switches back and forth

cereals %>%
  ggplot(aes(x = mfr , y = cal_per_cup, fill = mfr)) +
  geom_boxplot()+
  theme_classic()+
  coord_flip()

TOP 5 Geometries- Visualizations come from these geometry

  1. Bar Plot making bar plot based on manufacture

geom_bar() is used to change the asthetics of the “x” axis

cereals %>%
  ggplot(aes(x = mfr)) +
  geom_bar()

add color

cereals %>%
  ggplot(aes(x = mfr, fill = mfr)) +
  geom_bar()

Another category, Calories Per cup

finding mean of the cal_per_cup

First make it into a dataframe

cereals %>%
  group_by(mfr) %>%
  summarize(avg_cal = mean(cal_per_cup)) 
## # A tibble: 7 x 2
##   mfr   avg_cal
##   <fct>   <dbl>
## 1 A        100 
## 2 G        138.
## 3 K        145.
## 4 N        125.
## 5 P        195.
## 6 Q        125.
## 7 R        134.

change from geom_bar()

add cal_per_cup on the bar plot and y= ave_cal does not accept aesthetics, only geom_col()

Now ggplot can plot it out

cereals %>%
  group_by(mfr) %>%
  summarize(avg_cal = mean(cal_per_cup)) %>%
  ggplot(aes(x = mfr, fill = mfr, y = avg_cal))+
  geom_col()

How to use barplot to compare categorical variables 1. compare: manufacture and hot or cold cereal

compare manufacture plus color it to see if it was a hot or cold cereal

categorical, not supplying a y, so we use geom_bar to see if 2 categorical variables are dependent on eachother Brands that contain hot cereals

cereals %>%
  ggplot(aes(x = mfr, fill = type)) +
  geom_bar()

Arrange hot and cold cereals by percentage

cereals %>%
  ggplot(aes(x = mfr, fill = type)) +
  geom_bar(position = "fill")

2. Box plot, color to compare a second category geom_boxplot, compare mfr and calores, color by type

cereals %>%
  ggplot(aes(x = mfr, y = cal_per_cup, fill = type)) +
  geom_boxplot()

  1. Histogram quantitative data, distribution of some numeric data can add how many bins
cereals %>%
  ggplot(aes(x = cal_per_cup)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

change bins:

cereals %>%
  ggplot(aes(x = cal_per_cup)) +
  geom_histogram(bin = 30)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

less bins:

cereals %>%
  ggplot(aes(x = cal_per_cup)) +
  geom_histogram(bin =3)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

density, change the way it looks, smoother

cereals %>%
  ggplot(aes(x = cal_per_cup, fill = mfr)) +
  geom_density()
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

Alpha, how see through do I want it to be?

cereals %>%
  ggplot(aes(x = cal_per_cup, fill = mfr)) +
  geom_density(alpha = 0.5)
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

#4 and #5 scatterplots/lineplots

instead of color, use fill to change color (gg plot= color = 1 Dimension; fill= 2 dimension)

cereals %>%
  ggplot(aes(x = sugars, y = calories)) +
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).

Use Fill to change dot color (1D)

cereals %>%
  ggplot(aes(x = sugars, y = calories, color = mfr)) +
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).

shape of points can be manufactured

cereals %>%
  ggplot(aes(x = sugars, y = calories, shape = mfr)) +
  geom_point()
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 9 rows containing missing values (geom_point).

Shape and color together:

cereals %>%
  ggplot(aes(x = sugars, y = calories, color = mfr, shape = type)) +
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).

change size in the whole graph

cereals %>%
  ggplot(aes(x = sugars, y = calories, color = mfr, shape = type )) +
  geom_point(size = 2)
## Warning: Removed 1 rows containing missing values (geom_point).

change size in aesthetic- example, to see who gives more servings per cup

cereals %>%
  ggplot(aes(x = sugars, y = calories, color = mfr, shape = type, size = cups)) +
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).

Now: data looks like its rounded to whole numbers geom_jitter: randomly shakes data, to make visualization easier due to rounding or exact numbers

cereals %>%
  ggplot(aes(x = sugars, y = calories)) +
  geom_jitter()
## Warning: Removed 1 rows containing missing values (geom_point).

ggplot cheat sheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Multiple plots across categories: assign a and b

a <- cereals %>%
ggplot(aes(x = sugars, y = calories))+
geom_point()
b <- cereals %>%
ggplot(aes(x = sugars, y = calories))+
geom_jitter()

library(patchwork)

library(patchwork)

comparing a and b

a/b
## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

Facetwrap - categorical variable only

sugars vs calories by manufacture

cereals %>%
  ggplot(aes(x = sugars, y = calories))+
  geom_jitter()+
  facet_wrap(~mfr)
## Warning: Removed 1 rows containing missing values (geom_point).

templates in R Markdown:

Conclusion

What did you learn about cereals? Write a few sentences summarizing your findings, knit your document, and admire your handiwork!