Module 6: Reviewing the Tidyverse

Workshop 6: Recap!

Open this Project on RStudio.Cloud!

Welcome to Workshop 6: Review!

In this workshop, we will review all the previous workshops up until now, through several review problems. Answers will be posted too.

These review problems are optional, but I strongly suggest that you work on them together with your group.

Task 0. Load your packages

library(tidyverse)
library(viridis)
library(nycflights13)
library(fivethirtyeight)
library(infer)

Task 1. Make a data.frame.

Please convert the data presented in the following paragraph into a data.frame containing 7 rows and 2 columns, with logical names.

The United States generates a portion of its energy supply from each of the following different sources of energy. 60.3% comes from fossil fuels, 19.7% comes from nuclear power, 8.4% comes from wind, 7.3% comes from hydropower, 2.3% comes from solar, 1.4% comes from biomass, and 0.4% comes from geothermal energy.

Finally, use the summarize function to calculate on average how much energy comes from any one source.

Task 2. Reverse-engineer the following visualization.

You will need to make the data.frame that this visual shows, apply the correct ggplot functions, add the color scheme, and labels.

Task 3. Summarizing by Group

Load into R the purchases.csv dataset, located in the files tab of this Workshop 6 project.

This dataset includes a series of Dunkin Donuts purchases, where 1 row is a purchase. Please calculate the average cost of a purchase at each location, as well as how much the cost of those purchases varied on average (a.k.a. standard deviation). The result must be in one data.frame.

Task 4. Visualize

The following dataset contains the temperature in degrees Farenheit at JFK airport in New York City for every hour in the year 2013, where each row is an hour. Please visualize the distribution of temperatures by month using several different visualization functions.

geom_histogram()
geom_jitter()
geom_boxplot()
geom_violin()

weather <- read_csv("weather_JFK.csv")

Task 5. Filtering and Summarizing

The fukushima.csv dataset contains a list of Japanese municipalities over time, where each row represents a municipality (muni) in a specific year (year). The by_exclusion_zone variable categorizes these cities into “Exclusion Zone”, “Outside Exclusion Zone”, and “Other”, based on whether they were included in the Exclusion Zone around the Fukushima Daiichi reactor. The tsunami variable indicates whether they were hit by the tsunami (1) or not (0).

Please filter the data to the year 2011, to just cities in the Exclusion Zone. Then, select just the names of those cities and whether they were hit by the tsunami. Arrange those cities based on whether they were hit by the tsunami or not.

Task 6. T-tests

The mydiamonds.csv dataset contains 1,000 diamonds (rows), recording the price of the diamond and the cut of that diamond. Zoom into just ideal and premium diamonds, and use a t-test to find out whether ideal cut diamonds cost more on average than premium diamonds.

Which category did you use as your treatment vs. control? How much more do they cost on average? What is your t-statistic, and how extreme is it? What does that tell you?

mydiamonds <- read_csv("mydiamonds.csv")

Task 7. Correlation

The fivethirtyeight package includes a drinks dataset, tallying the average number of servings of alcohol per person for beer, spirits, and wine. Below, please use a correlation test to examine the following question:

Do countries that tend to serve more wine also serve more beer? How do you know?

mydrinks <- fivethirtyeight::drinks

Now visualize this trend using geom_jitter(), with wine on the x-axis and beer on the y-axis. How strong is the trend?

Task 8. Cross-tabulation

Researchers at fivethirtyeight investigated the Bechdel test, which evaluates whether a movie features at least two women who talk to each other about something other than a man. Please load in the bechdeltest.csv file, adapted from their dataset. Here, each row is a blockbuster movie, which occurred in a given decade and either passed or failed the Bechdel test (binary).

Please cross-tabulate the number of films which passed vs. failed the test each decade. Then, visualize this using geom_col() and facet_wrap().

mybechdel <- read_csv("bechdeltest.csv")

Task 9. Chi-squared

Continuing with the same bechdeltest.csv dataset from Task 8, please use a chi-squared test from the infer package to see how cosely related are decade and binary (passing or failing the Bechdel test). How extreme is your statistic?

mybechdel <- read_csv("bechdeltest.csv")

Finally, please check your answers! How do yours compare?