DS 2870: Homework 1

Question 1: Creating a data set

For question 1, you’ll be creating a data set two different ways:

Creating individual vectors, then combining them together
Creating the data set without creating the vectors previously

Part 1a: Beer names

Create a vector called names that has the following five values: “Budlight”, “Fiddlehead”, “Blue Moon”, “Miller Lite”, “Modelo”. Have the vector appear below the code chunk in the knitted document

names <- c("Budlight", "Fiddlehead", "Blue Moon", "Miller Lite", "Modelo")

names

## [1] "Budlight"    "Fiddlehead"  "Blue Moon"   "Miller Lite" "Modelo"

Part 1b: Light

Create a vector called light that indicates if the beer is a light beer with the following values: TRUE, FALSE, FALSE, TRUE, FALSE. Create a table for light using table()

light <- c(T, F, F,T, F)

table(light)

## light
## FALSE  TRUE 
##     3     2

Part 1c: ABV and Calories

Next, you’ll create two vectors, one named ABV and another named calories:

ABV: 4.2, 6.2, 5.4, 4.2, 4.4
calories: 110, missing, 168, 96, 144 - where missing indicates the calories aren’t known

Have both ABV and calories appear below the code chunk in the knitted document

# Vector of ABV
ABV <- c(4.2, 6.2, 5.4, 4.2, 4.4)

# Vector of calories
calories <- c(110, NA, 168, 96, 144)

# Shown in the knitted document
ABV

## [1] 4.2 6.2 5.4 4.2 4.4

calories

## [1] 110  NA 168  96 144

Part 1d: Summary Stats

Calculate the mean of the ABV and the median of the known calories

# Mean of ABV
mean(ABV)

## [1] 4.88

# Median of calories (need na.rm = T since there is a missing value)
median(calories, na.rm = T)

## [1] 127

Part 1e: Forming the beer data set

Using the vectors created in parts 1a, 1b, and 1c, form a data set named beers with columns: name (not names), light, ABV, calories, in that order. Have the data set appear in the knitted document

beers <- 
  data.frame(name = names, light, ABV, calories)


beers

##          name light ABV calories
## 1    Budlight  TRUE 4.2      110
## 2  Fiddlehead FALSE 6.2       NA
## 3   Blue Moon FALSE 5.4      168
## 4 Miller Lite  TRUE 4.2       96
## 5      Modelo FALSE 4.4      144

Part 1f: Forming the data set with an empty global environment

Create the same beers data set from Part 1e without adding the vectors to the global environment (i.e., don’t create a vector called names, just a column called name in the beers data set). After, calculate the average ABV using the ABV column in the beers data set

# Leave this at the top of the code chunk
rm(names, light, ABV, calories)

# Create the data frame below
beers <- 
  data.frame(
    name = c("Budlight", "Fiddlehead", "Blue Moon", "Miller Lite", "Modelo"),
    light = c(T, F, F, T, F),
    ABV = c(4.2, 6.2, 5.4, 4.2, 4.4),
    calories = c(110, NA, 168, 96, 144)
  )


# Calculate the average of the ABV column
mean(beers$ABV)

## [1] 4.88

Question 2: Beers data set

Question 2 involves the “Beer Book.csv” file, which has 340 rows and 5 columns:

Beer: The name of the beer
Brewery: The name of the brewery that makes the beer
Style: The type (style) of the beer
ABV: The alcohol content percentage (alcohol by volume)
RateBeer.Score: The average user score for people who reviewed the beer on the RateBeer.com website

Part 2a: Import the data set

Import the “Beer Book.csv” file and name it beer. Display the first 10 rows in the knitted document

beer <- read.csv("Beer Book.csv")

head(beer)

##                                        Beer               Brewery        Style
## 1 Cream on the Inside, Green on the Outside            Other Half         DIPA
## 2              Spontaneous Dot - Elderberry House of Fermentology     Wild Ale
## 3                        French Toast Brown               Madison    Brown ale
## 4                                     Vliet        Threes Brewing      Pilsner
## 5                                Tmave pivo          Simple roots Dark pilsner
## 6                                Wet Dreams        Foley brothers     Pale ale
##   ABV RateBeer.Score
## 1 9.4           4.04
## 2 5.0             NA
## 3 6.5             NA
## 4 5.1           3.43
## 5 5.2           3.75
## 6 5.5           3.55

Part 2b: Load packages

Load the tidyverse and skimr packages, then skim the beer data set. Which column, if any, has the most missing values, and how many?

# Reading in the data
pacman::p_load(tidyverse, skimr)

# Skimming the data
skim(beer)

Data summary
Name	beer
Number of rows	340
Number of columns	5
_______________________
Column type frequency:
character	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Beer	1	0	58	1	334
Brewery	1	4	32	0	113
Style	1	3	15	0	55

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ABV	1	1.00	6.51	1.58	0.00	5.22	6.20	7.8	11.80	▁▁▇▃▁
RateBeer.Score	300	0.12	3.67	0.26	3.07	3.50	3.66	3.8	4.25	▂▃▇▃▂

The column with the most missing values is your answer here

You can ignore the code chunk below. It will make some changes to the beer data set for the remaining questions

beer <- 
  beer |> 
  mutate(
    Style = if_else(Style %in% c(beer |> count(Style) |> slice_max(n, n = 5, with_ties = F) |> pull(Style)),
                    Style, "Other")
  )

Part 2c: Correlation between ABV and Score

Calculate the correlation between ABV and RateBeer_Score using the cor() function. Include use = "complete" to have it ignore missing values

cor(
  x = beer$ABV,
  y = beer$RateBeer.Score,
  use = "complete"
)

## [1] 0.2716892

Part 2d: Calculating the proportion of styles

Using the beer data set, select(), table(), prop.table(), and round() functions along with pipes, calculate the proportion of beers of each Style, rounded to 3 decimal places.

If you need help with any of the functions listed, the help menu in the bottom right corner is very helpful!

beer |> 
  select(Style) |> 
  table() |> 
  prop.table() |> 
  round(digits = 3)

## Style
##     DIPA      IPA    Lager    Other Pale Ale     Sour 
##    0.191    0.271    0.044    0.403    0.041    0.050

What are the 3 most common styles in the data set?

Part 2e: Creating boxplots

Create side-by-side boxplots to compare ABV by Style using either boxplot() or ggplot() and geom_boxplot(). Color in the boxes with an orange color

boxplot(
  ABV ~ Style,
  data = beer,
  col = "orange"
)

Alternatively

ggplot(
  data = beer,
  mapping = aes(
    y = ABV,
    x = Style
  )
) + 
  
  geom_boxplot(
    fill = "orange"
  )