# Loading package(s)
library(tidyverse)
library(fivethirtyeight)
library(ggplot2)
library(knitr)
library(dplyr)L03 Tidy Data
Data Science 1 with R (STAT 301-1)
Overview
The goal of this lab is to learn what it means to be a “tidy” dataset and how to tidy messy datasets utilizing the tidyr package – a core member of the tidyverse.
Useful resources:
Load packages
You should always begin by loading all necessary packages towards the beginning of your document.
Datasets
All datasets for the Exercises are either defined inline, provided within the core tidyverse packages (table1, table2, table4a, table4b), or provided in the fivethirtyeight package (drinks).
The Case Study dataset is stored in the data folder and called users_top7_2020.csv.
# Load dataset(s)
users_top7_2020 <- read_csv("data/users_top7_2020.csv")Exercises
Exercise 1
Recreate the plot below showing the change in cases over time using table2. What do you need to do first?
Exercise 2
Tidy the simple tibble of M&M data below and drop the NA values.
Do you need to make it wider or longer? What are the variables in your tidy version? What argument drops the NA values?
Code
# simple tibble of M&M data
mm_data <- tribble(
~mm_type, ~blue, ~orange, ~green, ~yellow, ~brown, ~red, ~cyan_blue,
"plain", 6, 18, 12, 6, 7, 7, NA,
"peanut", NA, 11, 9, 1, 12, 8, 15
)
# blue,orange,green,yellow,brown,red,cyan_blue
mm_data |>
pivot_longer(
cols=c(!mm_type),
names_to= "color",
values_to="count"
)Exercise 3
Use table4a and only the pivot_longer function to recreate the following:
Exercise 4
Use the drinks dataset and only the pivot_longer function to recreate the following:
Exercise 5
What happens if you use pivot_wider() on this table so that we have a dataset with 3 columns (respondent_name, age, height) and why?
Fix the issue by adding a new column.
Code
# dataset/table
people <- tribble(
~respondent_name, ~key, ~value, ~id,
#-----------------|--------|------
"Phillip Woods", "age", 45, 1,
"Phillip Woods", "height", 186, 1,
"Phillip Woods", "age", 50, 2,
"Jessica Cordero", "age", 37, 3,
"Jessica Cordero", "height", 156, 3
)Case Study
Tinder is interested in knowing more about their most active users and have have tasked you with exploring their 7 most active users during 2020. The dataset containing the top 7 active users during 2020 is stored in the data folder and called users_top7_2020.csv.1
The column names contain prefixes “matches”, “likes”, and “passes” followed by a number; the suffix number represents the month; and the cell represents either the total number of matches (matches), total number of times the user swiped right (likes), or total number of times the user swiped left (passes) during that month (ie: matches_1 = total number of matches during January 2020).
Use an appropriate graph to visualize the matches, likes, and passes over time for each user. What insights and conclusions can you gain from this graph, if any.
Graduate Challenge
Graduate students are required to complete this challenge. It is optional for undergraduate students.
Use pivot_wider to undo the data tidying you applied in the case study exercise above. That is, take the tidied data and turn it back into the original untidy format we had when we read in the users_top7_2020 dataset.
Footnotes
This dataset was sourced from Swipestats.io.↩︎