Load the necessary packages:
library(tidyverse)
library(fosdata)
library(HistData)
This exercise is about the hot_dogs dataset from fosdata.
(a). How many observations of how many variables are there? What types are the variables?
str(hot_dogs)
## 'data.frame': 54 obs. of 3 variables:
## $ type : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ calories: int 186 181 176 149 184 190 158 139 175 148 ...
## $ sodium : int 495 477 425 322 482 587 370 322 479 375 ...
There are 54 observations of 3 variables. The variables and their types are: type (Factor), calories (int), sodium (int).
(b). What are the three kinds of hot dogs in this data
set?
?hot_dogs
When looking at the documentation of the hot_dogs dataset we can find the three kinds of hot dogs: beef, meat and poultry.
Alternatively, we can print a unique list of the values in ‘type’.
unique(hot_dogs$type)
## [1] Beef Meat Poultry
## Levels: Beef Meat Poultry
(c). What is the highest sodium content of any hot dog in this
data set?
hot_dogs |>
slice_max(sodium, n = 1)
## type calories sodium
## 1 Beef 190 645
The highest sodium content of any hot dog is 645 mg of sodium.
(d). What is the mean calorie content for Beef hot
dogs?
hot_dogs |>
group_by(type) |>
summarize(mean(calories))
## # A tibble: 3 × 2
## type `mean(calories)`
## <fct> <dbl>
## 1 Beef 157.
## 2 Meat 159.
## 3 Poultry 119.
The mean calorie content for Beet hot dogs is 157 kCal.
This exercise is about the data set DrinksWages from the package HistData.
(a). How many observations of how many variables are there? What types are the variables?
?DrinksWages
str(DrinksWages)
## 'data.frame': 70 obs. of 6 variables:
## $ class : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
## $ trade : Factor w/ 70 levels "baker","barman",..: 38 10 25 55 36 44 68 34 14 11 ...
## $ sober : int 1 1 2 1 2 9 8 3 0 12 ...
## $ drinks: int 1 10 1 5 0 8 2 5 7 23 ...
## $ wage : num 24 18.4 21.5 21.2 19 ...
## $ n : int 2 11 3 6 2 17 10 8 7 35 ...
There are 70 observations of 6 variables. The variables and their types are: class (Factor), trade (Factor), sober (int), drinks (int), wage (num), n (int).
(b). The variable wage contains the average wage for each
profession. Which profession has the lowest wage?
DrinksWages |>
slice_min(wage, n = 1)
## class trade sober drinks wage n
## 1 A factory worker 1 3 12 4
The profession with the lowest wage is factory worker with a wage of 12 shillings per week.
(c). The variable n contains the number of workers surveyed for
each profession. Sum this to find the total number of workers
surveyed.
DrinksWages |>
summarize(sum(n))
## sum(n)
## 1 604
A total of 604 workers was surveyed.
(d). Compute the mean wage for all workers surveyed by
multiplying wage * n for each profession, summing, and dividing by the
total number of workers surveyed.
DrinksWages |>
mutate("wage*n" = wage * n) |>
summarize(sum(wage * n) / sum(n))
## sum(wage * n)/sum(n)
## 1 24.59782
The mean wage for all workers surveyed is 24.6 shillings per week.
This exercise is about the bechdel dataset from the package fosdata.
Let’s first explore the dataset a bit.
?bechdel
str(bechdel)
## 'data.frame': 1794 obs. of 15 variables:
## $ year : int 2013 2012 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ imdb : chr "tt1711425" "tt1343727" "tt2024544" "tt1272878" ...
## $ title : chr "21 & Over" "Dredd 3D" "12 Years a Slave" "2 Guns" ...
## $ test : Factor w/ 10 levels "dubious","dubious-disagree",..: 5 10 6 5 3 3 5 10 9 5 ...
## $ clean_test : Factor w/ 5 levels "dubious","men",..: 3 5 3 3 2 2 3 5 5 3 ...
## $ binary : Factor w/ 2 levels "FAIL","PASS": 1 2 1 1 1 1 1 2 2 1 ...
## $ budget : num 1.30e+07 4.50e+07 2.00e+07 6.10e+07 4.00e+07 2.25e+08 9.20e+07 1.20e+07 1.30e+07 1.30e+08 ...
## $ domgross : num 25682380 13414714 53107035 75612460 95020213 ...
## $ intgross : num 4.22e+07 4.09e+07 1.59e+08 1.32e+08 9.50e+07 ...
## $ code : Factor w/ 85 levels "1970PASS","1971FAIL",..: 84 83 84 84 84 84 84 85 85 84 ...
## $ budget_2013 : num 13000000 45658735 20000000 61000000 40000000 ...
## $ domgross_2013: num 25682380 13611086 53107035 75612460 95020213 ...
## $ intgross_2013: num 4.22e+07 4.15e+07 1.59e+08 1.32e+08 9.50e+07 ...
## $ period_code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ decade_code : int 1 1 1 1 1 1 1 1 1 1 ...
(a). How many movies in the data set pass the Bechdel
test?
bechdel |>
group_by(binary) |>
summarize(n = n())
## # A tibble: 2 × 2
## binary n
## <fct> <int>
## 1 FAIL 991
## 2 PASS 803
803 movies pass the Bechdel test.
(b). What percentage of movies in the data set pass the Bechdel
test?
bechdel |>
group_by(binary) |>
summarize(n = n()) |>
mutate("% of total" = n / sum(n) * 100)
## # A tibble: 2 × 3
## binary n `% of total`
## <fct> <int> <dbl>
## 1 FAIL 991 55.2
## 2 PASS 803 44.8
44.8% of movies in the data set pass the Bechdel test.
(c). Create a table of number of movies in the data set by year
(use table)
Read documentation on table() function:
?table
## Help on topic 'table' was found in the following packages:
##
## Package Library
## base /Library/Frameworks/R.framework/Resources/library
## vctrs /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
##
##
## Using the first match ...
table_bechdel <- table("year" = bechdel$year)
(d). Which year has the most movies in the data set? Give the
R-code to determine this, so don’t provide an answer by simply checking
the data yourself.
bechdel |>
group_by(year) |>
summarise(n = n()) |>
slice_max(n, n = 1)
## # A tibble: 1 × 2
## year n
## <int> <int>
## 1 2010 129
The year with most movies in the data set is 2010.
Wasn’t able to figure out how to use a max() function on the table created with table(), so used a different approach.
(e). How many different values are there in the clean test
variable?
n_distinct(bechdel$clean_test)
## [1] 5
There are 5 unique values in the clean test variable.
unique(bechdel$clean_test)
## [1] notalk ok men nowomen dubious
## Levels: dubious men notalk nowomen ok
The values are: dubious, men, notalk, nowomen, ok.
(f). Create a data frame that contains only those observations
that pass the Bechdel test.
bechdel_pass <- bechdel |>
filter(binary == "PASS")
(g). Create a data frame that contains all of the observations
that do not have missing values in the domgross variable.
First check domgross column for NA and null values:
sum(is.na(bechdel$domgross))
## [1] 17
sum(is.null(bechdel$domgross))
## [1] 0
The column contains 17 NA values and 0 null values. Filter out the NA values and store result in new table:
bechdel_domgross <- bechdel |>
filter(!is.na(domgross))