For question 1, you’ll be creating a data set two different ways:
Creating individual vectors, then combining them together
Creating the data set without creating the vectors previously
Create a vector called names that has the following five values: “Budlight”, “Fiddlehead”, “Blue Moon”, “Miller Lite”, “Modelo”. Have the vector appear below the code chunk in the knitted document
names <- c("Budlight", "Fiddlehead", "Blue Moon", "Miller Lite", "Modelo")
names
## [1] "Budlight" "Fiddlehead" "Blue Moon" "Miller Lite" "Modelo"
Create a vector called light that indicates if the
beer is a light beer with the following values: TRUE, FALSE, FALSE,
TRUE, FALSE. Create a table for light using
table()
light <- c(T, F, F,T, F)
table(light)
## light
## FALSE TRUE
## 3 2
Next, you’ll create two vectors, one named ABV and another named calories:
ABV: 4.2, 6.2, 5.4, 4.2, 4.4
calories: 110, missing, 168, 96, 144 - where missing indicates the calories aren’t known
Have both ABV and calories appear below the code chunk in the knitted document
# Vector of ABV
ABV <- c(4.2, 6.2, 5.4, 4.2, 4.4)
# Vector of calories
calories <- c(110, NA, 168, 96, 144)
# Shown in the knitted document
ABV
## [1] 4.2 6.2 5.4 4.2 4.4
calories
## [1] 110 NA 168 96 144
Calculate the mean of the ABV and the median of the known calories
# Mean of ABV
mean(ABV)
## [1] 4.88
# Median of calories (need na.rm = T since there is a missing value)
median(calories, na.rm = T)
## [1] 127
Using the vectors created in parts 1a, 1b, and 1c, form a data set named beers with columns: name (not names), light, ABV, calories, in that order. Have the data set appear in the knitted document
beers <-
data.frame(name = names, light, ABV, calories)
beers
## name light ABV calories
## 1 Budlight TRUE 4.2 110
## 2 Fiddlehead FALSE 6.2 NA
## 3 Blue Moon FALSE 5.4 168
## 4 Miller Lite TRUE 4.2 96
## 5 Modelo FALSE 4.4 144
Create the same beers data set from Part 1e without adding the vectors to the global environment (i.e., don’t create a vector called names, just a column called name in the beers data set). After, calculate the average ABV using the ABV column in the beers data set
# Leave this at the top of the code chunk
rm(names, light, ABV, calories)
# Create the data frame below
beers <-
data.frame(
name = c("Budlight", "Fiddlehead", "Blue Moon", "Miller Lite", "Modelo"),
light = c(T, F, F, T, F),
ABV = c(4.2, 6.2, 5.4, 4.2, 4.4),
calories = c(110, NA, 168, 96, 144)
)
# Calculate the average of the ABV column
mean(beers$ABV)
## [1] 4.88
Question 2 involves the “Beer Book.csv” file, which has 340 rows and 5 columns:
Import the “Beer Book.csv” file and name it beer. Display the first 10 rows in the knitted document
beer <- read.csv("Beer Book.csv")
head(beer)
## Beer Brewery Style
## 1 Cream on the Inside, Green on the Outside Other Half DIPA
## 2 Spontaneous Dot - Elderberry House of Fermentology Wild Ale
## 3 French Toast Brown Madison Brown ale
## 4 Vliet Threes Brewing Pilsner
## 5 Tmave pivo Simple roots Dark pilsner
## 6 Wet Dreams Foley brothers Pale ale
## ABV RateBeer.Score
## 1 9.4 4.04
## 2 5.0 NA
## 3 6.5 NA
## 4 5.1 3.43
## 5 5.2 3.75
## 6 5.5 3.55
Load the tidyverse and skimr packages, then skim the beer data set. Which column, if any, has the most missing values, and how many?
# Reading in the data
pacman::p_load(tidyverse, skimr)
# Skimming the data
skim(beer)
Name | beer |
Number of rows | 340 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Beer | 0 | 1 | 0 | 58 | 1 | 334 | 0 |
Brewery | 0 | 1 | 4 | 32 | 0 | 113 | 0 |
Style | 0 | 1 | 3 | 15 | 0 | 55 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ABV | 1 | 1.00 | 6.51 | 1.58 | 0.00 | 5.22 | 6.20 | 7.8 | 11.80 | ▁▁▇▃▁ |
RateBeer.Score | 300 | 0.12 | 3.67 | 0.26 | 3.07 | 3.50 | 3.66 | 3.8 | 4.25 | ▂▃▇▃▂ |
The column with the most missing values is your answer here
You can ignore the code chunk below. It will make some changes to the beer data set for the remaining questions
beer <-
beer |>
mutate(
Style = if_else(Style %in% c(beer |> count(Style) |> slice_max(n, n = 5, with_ties = F) |> pull(Style)),
Style, "Other")
)
Calculate the correlation between ABV and
RateBeer_Score using the cor()
function. Include
use = "complete"
to have it ignore missing
values
cor(
x = beer$ABV,
y = beer$RateBeer.Score,
use = "complete"
)
## [1] 0.2716892
Using the beer data set, select()
,
table()
, prop.table()
, and
round()
functions along with pipes, calculate the
proportion of beers of each Style, rounded to 3 decimal
places.
If you need help with any of the functions listed, the help menu in the bottom right corner is very helpful!
beer |>
select(Style) |>
table() |>
prop.table() |>
round(digits = 3)
## Style
## DIPA IPA Lager Other Pale Ale Sour
## 0.191 0.271 0.044 0.403 0.041 0.050
What are the 3 most common styles in the data set?
Create side-by-side boxplots to compare ABV by
Style using either boxplot()
or
ggplot()
and geom_boxplot()
. Color in the
boxes with an orange color
boxplot(
ABV ~ Style,
data = beer,
col = "orange"
)
Alternatively
ggplot(
data = beer,
mapping = aes(
y = ABV,
x = Style
)
) +
geom_boxplot(
fill = "orange"
)