library(tidyverse)
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
After install this:
library(here)
## here() starts at /data/biostat/a089861/A089861/R Trainings
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
Next add Janitor
adding Janitor for clean up install.packages(“janitor”)
Run Janitor once
library(gtsummary)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(here)
cereals <- read_csv(here("Course #1", "cereals.csv"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## name = col_character(),
## mfr = col_character(),
## type = col_character(),
## calories = col_double(),
## protein = col_double(),
## fat = col_double(),
## sodium = col_double(),
## fiber = col_double(),
## carbo = col_double(),
## sugars = col_double(),
## potass = col_double(),
## vitamins = col_double(),
## shelf = col_double(),
## weight = col_double(),
## cups = col_double(),
## rating = col_double()
## )
Documentation for dataset: https://www.kaggle.com/crawford/80-cereals/version/2
summary(cereals)
Showing the first 6 rows pipe
head(cereals)
## # A tibble: 6 x 16
## name mfr type calories protein fat sodium fiber carbo sugars potass
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100% Bran N C 70 4 1 130 10 5 6 280
## 2 100% Natu… Q C 120 3 5 15 2 8 8 135
## 3 All-Bran K C 70 4 1 260 9 7 5 320
## 4 All-Bran … K C 50 4 0 140 14 8 0 330
## 5 Almond De… R C 110 2 2 200 1 14 8 -1
## 6 Apple Cin… G C 110 2 2 180 1.5 10.5 10 70
## # … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
## # cups <dbl>, rating <dbl>
[Briefly summarize the dataset here.] There are 77 rows. Each row is a cereal.
Categorical: manufacture, type (hot/cold) Quantitative: nutrient measures
[CHECKPOINT: Knit your Markdown file!]
Step 1 look for missing data In this case, na is not used. Missing value is -1 in potassium column
cereals <-
cereals %>%
drop_na()
When this doesnt work, then you have a problem finding the commands. 1. install.packages(“conflicted”) 2.library(conflicted) ## Warning: package ‘conflicted’ was built under R version 4.0.5 ## Warning: namespace ‘cachem’ is not available and has been replaced ## by .GlobalEnv when processing object ‘
library(conflicted)
conflict_prefer("select", "dplyr")
## [conflicted] Will prefer dplyr::select over any other package
Try again to get the data clean by dropping missing variables Do not just use drop_na!!!
-1= missing data in cereals change -1 to “na” ?replace_(na)in console ? in front means help na_if(vector, value to replace NA) ex: pottasium, can do single column or variable check value ##clean the data
cereals <- cereals %>%
mutate(
mfr = factor(mfr),
type = factor(type),
potass = na_if(potass, -1)
) %>%
mutate_if(is.numeric,
~na_if(.x, -1))
head(cereals)
## # A tibble: 6 x 16
## name mfr type calories protein fat sodium fiber carbo sugars potass
## <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100% Bran N C 70 4 1 130 10 5 6 280
## 2 100% Natu… Q C 120 3 5 15 2 8 8 135
## 3 All-Bran K C 70 4 1 260 9 7 5 320
## 4 All-Bran … K C 50 4 0 140 14 8 0 330
## 5 Almond De… R C 110 2 2 200 1 14 8 NA
## 6 Apple Cin… G C 110 2 2 180 1.5 10.5 10 70
## # … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
## # cups <dbl>, rating <dbl>
run summary to see if NA’s are accounted for
now -1 will be converted to NA’s and now we can filter out the NA’s if needed
summary(cereals)
## name mfr type calories protein
## Length:77 A: 1 C:74 Min. : 50.0 Min. :1.000
## Class :character G:22 H: 3 1st Qu.:100.0 1st Qu.:2.000
## Mode :character K:23 Median :110.0 Median :3.000
## N: 6 Mean :106.9 Mean :2.545
## P: 9 3rd Qu.:110.0 3rd Qu.:3.000
## Q: 8 Max. :160.0 Max. :6.000
## R: 8
## fat sodium fiber carbo
## Min. :0.000 Min. : 0.0 Min. : 0.000 Min. : 5.0
## 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000 1st Qu.:12.0
## Median :1.000 Median :180.0 Median : 2.000 Median :14.5
## Mean :1.013 Mean :159.7 Mean : 2.152 Mean :14.8
## 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000 3rd Qu.:17.0
## Max. :5.000 Max. :320.0 Max. :14.000 Max. :23.0
## NA's :1
## sugars potass vitamins shelf
## Min. : 0.000 Min. : 15.00 Min. : 0.00 Min. :1.000
## 1st Qu.: 3.000 1st Qu.: 42.50 1st Qu.: 25.00 1st Qu.:1.000
## Median : 7.000 Median : 90.00 Median : 25.00 Median :2.000
## Mean : 7.026 Mean : 98.67 Mean : 28.25 Mean :2.208
## 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00 3rd Qu.:3.000
## Max. :15.000 Max. :330.00 Max. :100.00 Max. :3.000
## NA's :1 NA's :2
## weight cups rating
## Min. :0.50 Min. :0.250 Min. :18.04
## 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17
## Median :1.00 Median :0.750 Median :40.40
## Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :1.50 Max. :1.500 Max. :93.70
##
Finding mean, median etc. with pipe #### Write code to show the mean and median and sd of sugar content per serving of all cereals
summaries with NA’s will give back NA’s below:
cereals %>%
summarize(
mean_sug = mean(sugars),
median_sug = median(sugars),
sd_sug = sd(sugars)
)
## # A tibble: 1 x 3
## mean_sug median_sug sd_sug
## <dbl> <dbl> <dbl>
## 1 NA NA NA
need to exclude NA data
cereals %>%
summarize(
mean_sug = mean(sugars, na.rm = TRUE),
median_sug = median(sugars, na.rm = TRUE),
sd_sug = sd(sugars, na.rm = TRUE)
)
## # A tibble: 1 x 3
## mean_sug median_sug sd_sug
## <dbl> <dbl> <dbl>
## 1 7.03 7 4.38
#### Write code to show the total calories of all cereals
cereals %>%
summarize(sum(calories))
## # A tibble: 1 x 1
## `sum(calories)`
## <dbl>
## 1 8230
making new variable to show calories/cups #### Write code to create the variable “cal_per_cup” here this didnt work:
(missing) cereals %>% mutate( cal_per_cup = calories/cups )
cereals %>% summarize(mean(cal_per_cup)) data did not work because it was not saved or assigned sometimes we dont want to assigne things to replace cereal data (mean, med, sd) Need to overwrite to keep perminant like this new column. cannot run summary try:
saving file = "cereals <-
cereals <- cereals %>%
mutate(cal_per_cup = calories/cups
)
cereals %>%
summarize(mean(cal_per_cup))
## # A tibble: 1 x 1
## `mean(cal_per_cup)`
## <dbl>
## 1 143.
rows for brand of kellog: filter need columns = select
added the cereals <- to save to the environment and work with that data later saved as new name: cereals_k going to use filter. did not need to use filter(stats)
going back to dplyr
dplyr::filter(cereals)
## # A tibble: 77 x 17
## name mfr type calories protein fat sodium fiber carbo sugars potass
## <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100% Bran N C 70 4 1 130 10 5 6 280
## 2 100% Nat… Q C 120 3 5 15 2 8 8 135
## 3 All-Bran K C 70 4 1 260 9 7 5 320
## 4 All-Bran… K C 50 4 0 140 14 8 0 330
## 5 Almond D… R C 110 2 2 200 1 14 8 NA
## 6 Apple Ci… G C 110 2 2 180 1.5 10.5 10 70
## 7 Apple Ja… K C 110 2 0 125 1 11 14 30
## 8 Basic 4 G C 130 3 2 210 2 18 8 100
## 9 Bran Chex R C 90 2 1 200 4 15 6 125
## 10 Bran Fla… P C 90 3 0 210 5 13 5 190
## # … with 67 more rows, and 6 more variables: vitamins <dbl>, shelf <dbl>,
## # weight <dbl>, cups <dbl>, rating <dbl>, cal_per_cup <dbl>
then try again
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package
cereals_k <- cereals %>%
filter(mfr=="K") %>%
select(name, mfr, cal_per_cup)
dec = decening asc = ascending
cereals_k %>%
arrange(desc(cal_per_cup))
## # A tibble: 23 x 3
## name mfr cal_per_cup
## <chr> <fct> <dbl>
## 1 Mueslix Crispy Blend K 239.
## 2 Cracklin' Oat Bran K 220
## 3 All-Bran K 212.
## 4 Nutri-Grain Almond-Raisin K 209.
## 5 Just Right Fruit & Nut K 187.
## 6 Raisin Squares K 180
## 7 Fruitful Bran K 179.
## 8 Nut&Honey Crunch K 179.
## 9 Raisin Bran K 160
## 10 Frosted Flakes K 147.
## # … with 13 more rows
you can also ask to show top 3 top_n(3, cal_per_cup)
cereals_k %>%
top_n(3, cal_per_cup)
## # A tibble: 3 x 3
## name mfr cal_per_cup
## <chr> <fct> <dbl>
## 1 All-Bran K 212.
## 2 Cracklin' Oat Bran K 220
## 3 Mueslix Crispy Blend K 239.
Use all as one pipeline 1. cleaning step
#### Combine all steps into one pipeline
cereals %>%
mutate(
mfr = factor(mfr),
type = factor(type),
potass = na_if(potass, -1)
) %>%
mutate_if(is.numeric,
~na_if(.x, -1)) %>%
mutate(
cal_per_cup = calories/cups) %>%
filter(mfr == "K")%>%
select(name, mfr,cal_per_cup) %>%
top_n(3, cal_per_cup)
## # A tibble: 3 x 3
## name mfr cal_per_cup
## <chr> <fct> <dbl>
## 1 All-Bran K 212.
## 2 Cracklin' Oat Bran K 220
## 3 Mueslix Crispy Blend K 239.
Grouping and summarizing calories per cup for General mills take data set, filter for General mills… etc. ect. copy pasting code group_by(mfr)
##Average calores per cup for each manufacture at the same time
cereals %>%
group_by(mfr)%>%
summarize(mean(cal_per_cup))
## # A tibble: 7 x 2
## mfr `mean(cal_per_cup)`
## <fct> <dbl>
## 1 A 100
## 2 G 138.
## 3 K 145.
## 4 N 125.
## 5 P 195.
## 6 Q 125.
## 7 R 134.
Top 3 calories per cup:
cereals %>%
group_by(mfr) %>%
top_n(3, cal_per_cup)
## # A tibble: 21 x 17
## # Groups: mfr [7]
## name mfr type calories protein fat sodium fiber carbo sugars potass
## <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100% Bran N C 70 4 1 130 10 5 6 280
## 2 All-Bran K C 70 4 1 260 9 7 5 320
## 3 Cap'n'Cr… Q C 120 1 2 220 0 12 12 35
## 4 Clusters G C 110 3 2 140 2 13 7 105
## 5 Cracklin… K C 110 3 3 140 4 10 7 160
## 6 Fruit & … P C 120 3 2 160 5 12 10 200
## 7 Grape-Nu… P C 110 3 0 170 3 17 3 90
## 8 Great Gr… P C 120 3 3 75 3 13 4 100
## 9 Life Q C 100 4 2 150 2 12 6 95
## 10 Maypo A H 100 4 1 0 0 16 3 95
## # … with 11 more rows, and 6 more variables: vitamins <dbl>, shelf <dbl>,
## # weight <dbl>, cups <dbl>, rating <dbl>, cal_per_cup <dbl>
But not organized at all! retained top 3 from each manufacture
[CHECKPOINT: Knit your document!]
make it pretty
cereals %>%
ggplot(aes(x = mfr , y = cal_per_cup, fill = mfr)) +
geom_boxplot()
Scales: https://ggplot2-book.org/scale-position.html
Themes: https://ggplot2-book.org/polishing.html
Add themes
cereals %>%
ggplot(aes(x = mfr , y = cal_per_cup, fill = mfr)) +
geom_boxplot()+
theme_classic()
coord_flip reverse coordinates/ switches back and forth
cereals %>%
ggplot(aes(x = mfr , y = cal_per_cup, fill = mfr)) +
geom_boxplot()+
theme_classic()+
coord_flip()
making a column plot
cereals %>%
ggplot(aes(x = mfr)) +
geom_bar()
add color
cereals %>%
ggplot(aes(x = mfr, fill = mfr)) +
geom_bar()
finding mean of the cal_per_cup
First make it into a dataframe
cereals %>%
group_by(mfr) %>%
summarize(avg_cal = mean(cal_per_cup))
## # A tibble: 7 x 2
## mfr avg_cal
## <fct> <dbl>
## 1 A 100
## 2 G 138.
## 3 K 145.
## 4 N 125.
## 5 P 195.
## 6 Q 125.
## 7 R 134.
add cal_per_cup on the bar plot and y= ave_cal does not accept aesthetics, only geom_col()
Now ggplot can plot it out
cereals %>%
group_by(mfr) %>%
summarize(avg_cal = mean(cal_per_cup)) %>%
ggplot(aes(x = mfr, fill = mfr, y = avg_cal))+
geom_col()
How to use barplot to compare categorical variables manufact/hot or cold cereal categorical, not supplying a y, so we use geom_bar to see if 2 categorical variables are dependent on eachother
cereals %>%
ggplot(aes(x = mfr, fill = type)) +
geom_bar(position = "fill")
What did you learn about cereals? Write a few sentences summarizing your findings, knit your document, and admire your handiwork!
```