Exploration of Cereal Data

Setup

library(tidyverse)

## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

After install this:

library(here)

## here() starts at /data/biostat/a089861/A089861/R Trainings

library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

Next add Janitor

adding Janitor for clean up install.packages(“janitor”)

Run Janitor once

library(gtsummary)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

The Data

library(here)

cereals <- read_csv(here("Course #1", "cereals.csv"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   name = col_character(),
##   mfr = col_character(),
##   type = col_character(),
##   calories = col_double(),
##   protein = col_double(),
##   fat = col_double(),
##   sodium = col_double(),
##   fiber = col_double(),
##   carbo = col_double(),
##   sugars = col_double(),
##   potass = col_double(),
##   vitamins = col_double(),
##   shelf = col_double(),
##   weight = col_double(),
##   cups = col_double(),
##   rating = col_double()
## )

Documentation for dataset: https://www.kaggle.com/crawford/80-cereals/version/2

summary(cereals)

Showing the first 6 rows pipe

head(cereals)

## # A tibble: 6 x 16
##   name       mfr   type  calories protein   fat sodium fiber carbo sugars potass
##   <chr>      <chr> <chr>    <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 100% Bran  N     C           70       4     1    130  10     5        6    280
## 2 100% Natu… Q     C          120       3     5     15   2     8        8    135
## 3 All-Bran   K     C           70       4     1    260   9     7        5    320
## 4 All-Bran … K     C           50       4     0    140  14     8        0    330
## 5 Almond De… R     C          110       2     2    200   1    14        8     -1
## 6 Apple Cin… G     C          110       2     2    180   1.5  10.5     10     70
## # … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
## #   cups <dbl>, rating <dbl>

[Briefly summarize the dataset here.] There are 77 rows. Each row is a cereal.

Categorical: manufacture, type (hot/cold) Quantitative: nutrient measures

[CHECKPOINT: Knit your Markdown file!]

Data Cleaning and Transformation

clean the data

Step 1 look for missing data In this case, na is not used. Missing value is -1 in potassium column

cereals <- 
  cereals %>%
  drop_na()

When this doesnt work, then you have a problem finding the commands. 1. install.packages(“conflicted”) 2.library(conflicted) ## Warning: package ‘conflicted’ was built under R version 4.0.5 ## Warning: namespace ‘cachem’ is not available and has been replaced ## by .GlobalEnv when processing object ‘’ conflict_prefer(“select”, “dplyr”) ## [conflicted] Will prefer dplyr::select over any other package

library(conflicted)
conflict_prefer("select", "dplyr")

## [conflicted] Will prefer dplyr::select over any other package

Try again to get the data clean by dropping missing variables Do not just use drop_na!!!

-1= missing data in cereals change -1 to “na” ?replace_(na)in console ? in front means help na_if(vector, value to replace NA) ex: pottasium, can do single column or variable check value ##clean the data

cereals <- cereals %>%
  mutate(
    mfr = factor(mfr),
    type = factor(type),
    potass = na_if(potass, -1)
  ) %>%
  mutate_if(is.numeric,
            ~na_if(.x, -1))
  head(cereals)

## # A tibble: 6 x 16
##   name       mfr   type  calories protein   fat sodium fiber carbo sugars potass
##   <chr>      <fct> <fct>    <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 100% Bran  N     C           70       4     1    130  10     5        6    280
## 2 100% Natu… Q     C          120       3     5     15   2     8        8    135
## 3 All-Bran   K     C           70       4     1    260   9     7        5    320
## 4 All-Bran … K     C           50       4     0    140  14     8        0    330
## 5 Almond De… R     C          110       2     2    200   1    14        8     NA
## 6 Apple Cin… G     C          110       2     2    180   1.5  10.5     10     70
## # … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
## #   cups <dbl>, rating <dbl>

run summary to see if NA’s are accounted for

now -1 will be converted to NA’s and now we can filter out the NA’s if needed

summary(cereals)

##      name           mfr    type      calories        protein     
##  Length:77          A: 1   C:74   Min.   : 50.0   Min.   :1.000  
##  Class :character   G:22   H: 3   1st Qu.:100.0   1st Qu.:2.000  
##  Mode  :character   K:23          Median :110.0   Median :3.000  
##                     N: 6          Mean   :106.9   Mean   :2.545  
##                     P: 9          3rd Qu.:110.0   3rd Qu.:3.000  
##                     Q: 8          Max.   :160.0   Max.   :6.000  
##                     R: 8                                         
##       fat            sodium          fiber            carbo     
##  Min.   :0.000   Min.   :  0.0   Min.   : 0.000   Min.   : 5.0  
##  1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000   1st Qu.:12.0  
##  Median :1.000   Median :180.0   Median : 2.000   Median :14.5  
##  Mean   :1.013   Mean   :159.7   Mean   : 2.152   Mean   :14.8  
##  3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000   3rd Qu.:17.0  
##  Max.   :5.000   Max.   :320.0   Max.   :14.000   Max.   :23.0  
##                                                   NA's   :1     
##      sugars           potass          vitamins          shelf      
##  Min.   : 0.000   Min.   : 15.00   Min.   :  0.00   Min.   :1.000  
##  1st Qu.: 3.000   1st Qu.: 42.50   1st Qu.: 25.00   1st Qu.:1.000  
##  Median : 7.000   Median : 90.00   Median : 25.00   Median :2.000  
##  Mean   : 7.026   Mean   : 98.67   Mean   : 28.25   Mean   :2.208  
##  3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00   3rd Qu.:3.000  
##  Max.   :15.000   Max.   :330.00   Max.   :100.00   Max.   :3.000  
##  NA's   :1        NA's   :2                                        
##      weight          cups           rating     
##  Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :1.00   Median :0.750   Median :40.40  
##  Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :1.50   Max.   :1.500   Max.   :93.70  
##

Finding mean, median etc. with pipe #### Write code to show the mean and median and sd of sugar content per serving of all cereals

summaries with NA’s will give back NA’s below:

cereals %>% 
  summarize(
    mean_sug = mean(sugars), 
  median_sug = median(sugars),
  sd_sug = sd(sugars)
  )

## # A tibble: 1 x 3
##   mean_sug median_sug sd_sug
##      <dbl>      <dbl>  <dbl>
## 1       NA         NA     NA

need to exclude NA data

cereals %>% 
  summarize(
    mean_sug = mean(sugars, na.rm = TRUE), 
  median_sug = median(sugars, na.rm = TRUE),
  sd_sug = sd(sugars, na.rm = TRUE)
  )

## # A tibble: 1 x 3
##   mean_sug median_sug sd_sug
##      <dbl>      <dbl>  <dbl>
## 1     7.03          7   4.38

#### Write code to show the total calories of all cereals
cereals %>%
  summarize(sum(calories))

## # A tibble: 1 x 1
##   `sum(calories)`
##             <dbl>
## 1            8230

making new variable to show calories/cups #### Write code to create the variable “cal_per_cup” here this didnt work:

(missing) cereals %>% mutate( cal_per_cup = calories/cups )

cereals %>% summarize(mean(cal_per_cup)) data did not work because it was not saved or assigned sometimes we dont want to assigne things to replace cereal data (mean, med, sd) Need to overwrite to keep perminant like this new column. cannot run summary try:

saving file = "cereals <-

cereals <- cereals %>%
  mutate(cal_per_cup = calories/cups
)

cereals %>%
  summarize(mean(cal_per_cup))

## # A tibble: 1 x 1
##   `mean(cal_per_cup)`
##                 <dbl>
## 1                143.

Write code to include only Kellogg brand cereals, and only relevant columns

rows for brand of kellog: filter need columns = select

added the cereals <- to save to the environment and work with that data later saved as new name: cereals_k going to use filter. did not need to use filter(stats)

going back to dplyr

dplyr::filter(cereals)

## # A tibble: 77 x 17
##    name      mfr   type  calories protein   fat sodium fiber carbo sugars potass
##    <chr>     <fct> <fct>    <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1 100% Bran N     C           70       4     1    130  10     5        6    280
##  2 100% Nat… Q     C          120       3     5     15   2     8        8    135
##  3 All-Bran  K     C           70       4     1    260   9     7        5    320
##  4 All-Bran… K     C           50       4     0    140  14     8        0    330
##  5 Almond D… R     C          110       2     2    200   1    14        8     NA
##  6 Apple Ci… G     C          110       2     2    180   1.5  10.5     10     70
##  7 Apple Ja… K     C          110       2     0    125   1    11       14     30
##  8 Basic 4   G     C          130       3     2    210   2    18        8    100
##  9 Bran Chex R     C           90       2     1    200   4    15        6    125
## 10 Bran Fla… P     C           90       3     0    210   5    13        5    190
## # … with 67 more rows, and 6 more variables: vitamins <dbl>, shelf <dbl>,
## #   weight <dbl>, cups <dbl>, rating <dbl>, cal_per_cup <dbl>

then try again

conflict_prefer("filter", "dplyr")

## [conflicted] Will prefer dplyr::filter over any other package

cereals_k <- cereals %>%
  filter(mfr=="K") %>%
  select(name, mfr, cal_per_cup)

Write code to sort the dataset by calories per cup

dec = decening asc = ascending

cereals_k %>%
  arrange(desc(cal_per_cup))

## # A tibble: 23 x 3
##    name                      mfr   cal_per_cup
##    <chr>                     <fct>       <dbl>
##  1 Mueslix Crispy Blend      K            239.
##  2 Cracklin' Oat Bran        K            220 
##  3 All-Bran                  K            212.
##  4 Nutri-Grain Almond-Raisin K            209.
##  5 Just Right Fruit & Nut    K            187.
##  6 Raisin Squares            K            180 
##  7 Fruitful Bran             K            179.
##  8 Nut&Honey Crunch          K            179.
##  9 Raisin Bran               K            160 
## 10 Frosted Flakes            K            147.
## # … with 13 more rows

you can also ask to show top 3 top_n(3, cal_per_cup)

cereals_k %>%
  top_n(3, cal_per_cup)

## # A tibble: 3 x 3
##   name                 mfr   cal_per_cup
##   <chr>                <fct>       <dbl>
## 1 All-Bran             K            212.
## 2 Cracklin' Oat Bran   K            220 
## 3 Mueslix Crispy Blend K            239.

Use all as one pipeline 1. cleaning step

#### Combine all steps into one pipeline
cereals %>%
  mutate(
    mfr = factor(mfr),
    type = factor(type),
    potass = na_if(potass, -1)
  ) %>%
  mutate_if(is.numeric,
            ~na_if(.x, -1)) %>%
  mutate(
    cal_per_cup = calories/cups) %>%
  filter(mfr == "K")%>%
  select(name, mfr,cal_per_cup) %>%
  top_n(3, cal_per_cup)

## # A tibble: 3 x 3
##   name                 mfr   cal_per_cup
##   <chr>                <fct>       <dbl>
## 1 All-Bran             K            212.
## 2 Cracklin' Oat Bran   K            220 
## 3 Mueslix Crispy Blend K            239.

Grouping and summarizing calories per cup for General mills take data set, filter for General mills… etc. ect. copy pasting code group_by(mfr)

##Average calores per cup for each manufacture at the same time
cereals %>%
  group_by(mfr)%>%
  summarize(mean(cal_per_cup))

## # A tibble: 7 x 2
##   mfr   `mean(cal_per_cup)`
##   <fct>               <dbl>
## 1 A                    100 
## 2 G                    138.
## 3 K                    145.
## 4 N                    125.
## 5 P                    195.
## 6 Q                    125.
## 7 R                    134.

Top 3 calories per cup:

cereals %>%
  group_by(mfr) %>%
  top_n(3, cal_per_cup)

## # A tibble: 21 x 17
## # Groups:   mfr [7]
##    name      mfr   type  calories protein   fat sodium fiber carbo sugars potass
##    <chr>     <fct> <fct>    <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1 100% Bran N     C           70       4     1    130    10     5      6    280
##  2 All-Bran  K     C           70       4     1    260     9     7      5    320
##  3 Cap'n'Cr… Q     C          120       1     2    220     0    12     12     35
##  4 Clusters  G     C          110       3     2    140     2    13      7    105
##  5 Cracklin… K     C          110       3     3    140     4    10      7    160
##  6 Fruit & … P     C          120       3     2    160     5    12     10    200
##  7 Grape-Nu… P     C          110       3     0    170     3    17      3     90
##  8 Great Gr… P     C          120       3     3     75     3    13      4    100
##  9 Life      Q     C          100       4     2    150     2    12      6     95
## 10 Maypo     A     H          100       4     1      0     0    16      3     95
## # … with 11 more rows, and 6 more variables: vitamins <dbl>, shelf <dbl>,
## #   weight <dbl>, cups <dbl>, rating <dbl>, cal_per_cup <dbl>

But not organized at all! retained top 3 from each manufacture

[CHECKPOINT: Knit your document!]

make it pretty

cereals %>%
  ggplot(aes(x = mfr , y = cal_per_cup, fill = mfr)) +
  geom_boxplot()

Scales: https://ggplot2-book.org/scale-position.html

Themes: https://ggplot2-book.org/polishing.html

Add themes

cereals %>%
  ggplot(aes(x = mfr , y = cal_per_cup, fill = mfr)) +
  geom_boxplot()+
  theme_classic()

coord_flip reverse coordinates/ switches back and forth

cereals %>%
  ggplot(aes(x = mfr , y = cal_per_cup, fill = mfr)) +
  geom_boxplot()+
  theme_classic()+
  coord_flip()

making a column plot

cereals %>%
  ggplot(aes(x = mfr)) +
  geom_bar()

add color

cereals %>%
  ggplot(aes(x = mfr, fill = mfr)) +
  geom_bar()

finding mean of the cal_per_cup

First make it into a dataframe

cereals %>%
  group_by(mfr) %>%
  summarize(avg_cal = mean(cal_per_cup))

## # A tibble: 7 x 2
##   mfr   avg_cal
##   <fct>   <dbl>
## 1 A        100 
## 2 G        138.
## 3 K        145.
## 4 N        125.
## 5 P        195.
## 6 Q        125.
## 7 R        134.

add cal_per_cup on the bar plot and y= ave_cal does not accept aesthetics, only geom_col()

Now ggplot can plot it out

cereals %>%
  group_by(mfr) %>%
  summarize(avg_cal = mean(cal_per_cup)) %>%
  ggplot(aes(x = mfr, fill = mfr, y = avg_cal))+
  geom_col()

How to use barplot to compare categorical variables manufact/hot or cold cereal categorical, not supplying a y, so we use geom_bar to see if 2 categorical variables are dependent on eachother

cereals %>%
  ggplot(aes(x = mfr, fill = type)) +
  geom_bar(position = "fill")

Conclusion

What did you learn about cereals? Write a few sentences summarizing your findings, knit your document, and admire your handiwork!

```