BADS: Advanced Statistical Methods

Load the necessary packages:

library(tidyverse)
library(fosdata)
library(HistData)

Exercise 1.1

This exercise is about the hot_dogs dataset from fosdata.

(a). How many observations of how many variables are there? What types are the variables?

str(hot_dogs)

## 'data.frame':    54 obs. of  3 variables:
##  $ type    : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ calories: int  186 181 176 149 184 190 158 139 175 148 ...
##  $ sodium  : int  495 477 425 322 482 587 370 322 479 375 ...

There are 54 observations of 3 variables. The variables and their types are: type (Factor), calories (int), sodium (int).

(b). What are the three kinds of hot dogs in this data set?

?hot_dogs

When looking at the documentation of the hot_dogs dataset we can find the three kinds of hot dogs: beef, meat and poultry.

Alternatively, we can print a unique list of the values in ‘type’.

unique(hot_dogs$type)

## [1] Beef    Meat    Poultry
## Levels: Beef Meat Poultry

(c). What is the highest sodium content of any hot dog in this data set?

hot_dogs |>
  slice_max(sodium, n = 1)

##   type calories sodium
## 1 Beef      190    645

The highest sodium content of any hot dog is 645 mg of sodium.

(d). What is the mean calorie content for Beef hot dogs?

hot_dogs |>
  group_by(type) |>
  summarize(mean(calories))

## # A tibble: 3 × 2
##   type    `mean(calories)`
##   <fct>              <dbl>
## 1 Beef                157.
## 2 Meat                159.
## 3 Poultry             119.

The mean calorie content for Beet hot dogs is 157 kCal.

Exercise 1.2

This exercise is about the data set DrinksWages from the package HistData.

(a). How many observations of how many variables are there? What types are the variables?

?DrinksWages
str(DrinksWages)

## 'data.frame':    70 obs. of  6 variables:
##  $ class : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trade : Factor w/ 70 levels "baker","barman",..: 38 10 25 55 36 44 68 34 14 11 ...
##  $ sober : int  1 1 2 1 2 9 8 3 0 12 ...
##  $ drinks: int  1 10 1 5 0 8 2 5 7 23 ...
##  $ wage  : num  24 18.4 21.5 21.2 19 ...
##  $ n     : int  2 11 3 6 2 17 10 8 7 35 ...

There are 70 observations of 6 variables. The variables and their types are: class (Factor), trade (Factor), sober (int), drinks (int), wage (num), n (int).

(b). The variable wage contains the average wage for each profession. Which profession has the lowest wage?

DrinksWages |>
  slice_min(wage, n = 1)

##   class          trade sober drinks wage n
## 1     A factory worker     1      3   12 4

The profession with the lowest wage is factory worker with a wage of 12 shillings per week.

(c). The variable n contains the number of workers surveyed for each profession. Sum this to find the total number of workers surveyed.

DrinksWages |>
  summarize(sum(n))

##   sum(n)
## 1    604

A total of 604 workers was surveyed.

(d). Compute the mean wage for all workers surveyed by multiplying wage * n for each profession, summing, and dividing by the total number of workers surveyed.

DrinksWages |>
  mutate("wage*n" = wage * n) |>
  summarize(sum(wage * n) / sum(n))

##   sum(wage * n)/sum(n)
## 1             24.59782

The mean wage for all workers surveyed is 24.6 shillings per week.

Exercise 1.3

This exercise is about the bechdel dataset from the package fosdata.

Let’s first explore the dataset a bit.

?bechdel
str(bechdel)

## 'data.frame':    1794 obs. of  15 variables:
##  $ year         : int  2013 2012 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ imdb         : chr  "tt1711425" "tt1343727" "tt2024544" "tt1272878" ...
##  $ title        : chr  "21 &amp; Over" "Dredd 3D" "12 Years a Slave" "2 Guns" ...
##  $ test         : Factor w/ 10 levels "dubious","dubious-disagree",..: 5 10 6 5 3 3 5 10 9 5 ...
##  $ clean_test   : Factor w/ 5 levels "dubious","men",..: 3 5 3 3 2 2 3 5 5 3 ...
##  $ binary       : Factor w/ 2 levels "FAIL","PASS": 1 2 1 1 1 1 1 2 2 1 ...
##  $ budget       : num  1.30e+07 4.50e+07 2.00e+07 6.10e+07 4.00e+07 2.25e+08 9.20e+07 1.20e+07 1.30e+07 1.30e+08 ...
##  $ domgross     : num  25682380 13414714 53107035 75612460 95020213 ...
##  $ intgross     : num  4.22e+07 4.09e+07 1.59e+08 1.32e+08 9.50e+07 ...
##  $ code         : Factor w/ 85 levels "1970PASS","1971FAIL",..: 84 83 84 84 84 84 84 85 85 84 ...
##  $ budget_2013  : num  13000000 45658735 20000000 61000000 40000000 ...
##  $ domgross_2013: num  25682380 13611086 53107035 75612460 95020213 ...
##  $ intgross_2013: num  4.22e+07 4.15e+07 1.59e+08 1.32e+08 9.50e+07 ...
##  $ period_code  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ decade_code  : int  1 1 1 1 1 1 1 1 1 1 ...

(a). How many movies in the data set pass the Bechdel test?

bechdel |>
  group_by(binary) |>
  summarize(n = n())

## # A tibble: 2 × 2
##   binary     n
##   <fct>  <int>
## 1 FAIL     991
## 2 PASS     803

803 movies pass the Bechdel test.

(b). What percentage of movies in the data set pass the Bechdel test?

bechdel |>
  group_by(binary) |>
  summarize(n = n()) |>
  mutate("% of total" = n / sum(n) * 100)

## # A tibble: 2 × 3
##   binary     n `% of total`
##   <fct>  <int>        <dbl>
## 1 FAIL     991         55.2
## 2 PASS     803         44.8

44.8% of movies in the data set pass the Bechdel test.

(c). Create a table of number of movies in the data set by year (use table)

Read documentation on table() function:

?table

## Help on topic 'table' was found in the following packages:
## 
##   Package               Library
##   base                  /Library/Frameworks/R.framework/Resources/library
##   vctrs                 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
## 
## 
## Using the first match ...

table_bechdel <- table("year" = bechdel$year)

(d). Which year has the most movies in the data set? Give the R-code to determine this, so don’t provide an answer by simply checking the data yourself.

bechdel |>
  group_by(year) |>
  summarise(n = n()) |>
  slice_max(n, n = 1)

## # A tibble: 1 × 2
##    year     n
##   <int> <int>
## 1  2010   129

The year with most movies in the data set is 2010.

Wasn’t able to figure out how to use a max() function on the table created with table(), so used a different approach.

(e). How many different values are there in the clean test variable?

n_distinct(bechdel$clean_test)

## [1] 5

There are 5 unique values in the clean test variable.

unique(bechdel$clean_test)

## [1] notalk  ok      men     nowomen dubious
## Levels: dubious men notalk nowomen ok

The values are: dubious, men, notalk, nowomen, ok.

(f). Create a data frame that contains only those observations that pass the Bechdel test.

bechdel_pass <- bechdel |>
  filter(binary == "PASS")

(g). Create a data frame that contains all of the observations that do not have missing values in the domgross variable.

First check domgross column for NA and null values:

sum(is.na(bechdel$domgross))

## [1] 17

sum(is.null(bechdel$domgross))

## [1] 0

The column contains 17 NA values and 0 null values. Filter out the NA values and store result in new table:

bechdel_domgross <- bechdel |>
  filter(!is.na(domgross))

BADS: Advanced Statistical Methods

Assignment-01

By Max Pronk

Date: 2023-09-09

Exercise 1.1

Exercise 1.2

Exercise 1.3