L03 Tidy Data

Data Science 1 with R (STAT 301-1)

Author

Sonia Bhatia

Github Repo Link

To link to your github repository, appropriately edit the example link below. Meaning replace https://your-github-repo-url with your github repo url. Suggest verifying the link works before submitting.

https://github.com/stat301-1-2023-fall/L03-tidy-data-soniab101

Overview

The goal of this lab is to learn what it means to be a “tidy” dataset and how to tidy messy datasets utilizing the tidyr package – a core member of the tidyverse.

Useful resources:

Load packages

You should always begin by loading all necessary packages towards the beginning of your document.

# Loading package(s)
library(tidyverse)
library(fivethirtyeight)
library(ggplot2)
library(knitr)
library(dplyr)

Datasets

All datasets for the Exercises are either defined inline, provided within the core tidyverse packages (table1, table2, table4a, table4b), or provided in the fivethirtyeight package (drinks).

The Case Study dataset is stored in the data folder and called users_top7_2020.csv.

# Load dataset(s)

users_top7_2020 <- read_csv("data/users_top7_2020.csv")

Exercises

Exercise 1

Recreate the plot below showing the change in cases over time using table2. What do you need to do first?

Solution

Code

table2 |> 
  pivot_wider(
    names_from=type,
    values_from=count
  )

Exercise 2

Tidy the simple tibble of M&M data below and drop the NA values.

Do you need to make it wider or longer? What are the variables in your tidy version? What argument drops the NA values?

Code

# simple tibble of M&M data
mm_data <- tribble(
  ~mm_type, ~blue, ~orange, ~green, ~yellow, ~brown, ~red, ~cyan_blue,
  "plain",  6,     18,      12,     6,       7,      7,    NA,
  "peanut", NA,    11,      9,      1,       12,       8,    15
)

# blue,orange,green,yellow,brown,red,cyan_blue

mm_data |> 
  pivot_longer(
    cols=c(!mm_type),
    names_to= "color",
    values_to="count"
  )

Solution

Code

mm_data |> 
  pivot_longer(
    cols=c(!mm_type),
    names_to= "color",
    values_to="count"
  )

Exercise 3

Use table4a and only the pivot_longer function to recreate the following:

Solution

Code

table4a |> 
  pivot_longer(
    cols=(!country),
    names_to="year",
    values_to="cases",
    names_transform = list("year" = as.integer),
    values_transform = list("cases" = as.integer)
  )

Exercise 4

Use the drinks dataset and only the pivot_longer function to recreate the following:

Solution

Code

drinks |> 
  pivot_longer(
    # cols = c("beer_servings","spirit_servings","wine_servings")
    # cols = -c(country, total_liters_of_pure_alochol)
    cols = ends_with("servings"),
    names_to = "type",
    values_to = "servings",
    names_pattern = "(.*)_servings"
  )

Exercise 5

What happens if you use pivot_wider() on this table so that we have a dataset with 3 columns (respondent_name, age, height) and why?

Fix the issue by adding a new column.

Code

# dataset/table
people <- tribble(
  ~respondent_name,  ~key,    ~value, ~id,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45, 1,
  "Phillip Woods",   "height",   186, 1,
  "Phillip Woods",   "age",       50, 2,
  "Jessica Cordero", "age",       37, 3,
  "Jessica Cordero", "height",   156, 3
)

Solution

The row entries were not unique; there are two values for age for Phillip Woods. A solution would be to create a unique id for each respondent.

Code

people |> 
  pivot_wider(
    names_from = key,
    values_from = value
  )

Case Study

Tinder is interested in knowing more about their most active users and have have tasked you with exploring their 7 most active users during 2020. The dataset containing the top 7 active users during 2020 is stored in the data folder and called users_top7_2020.csv.¹

The column names contain prefixes “matches”, “likes”, and “passes” followed by a number; the suffix number represents the month; and the cell represents either the total number of matches (matches), total number of times the user swiped right (likes), or total number of times the user swiped left (passes) during that month (ie: matches_1 = total number of matches during January 2020).

Use an appropriate graph to visualize the matches, likes, and passes over time for each user. What insights and conclusions can you gain from this graph, if any.

Solution

Code

# need to change months to dbl type (parse number), scalle x axis
users_tidy <- users_top7_2020 |> 
  pivot_longer(cols = -user_id,
               names_to = c("type", "month"),
               names_sep = "_", 
               values_to = "count") |> 
  mutate(month = as.numeric(month))

users_tidy

# A tibble: 252 × 4
   user_id                          type    month count
   <chr>                            <chr>   <dbl> <dbl>
 1 16a22fdad3de24d72d8509711c454ffb matches     1   224
 2 16a22fdad3de24d72d8509711c454ffb matches     2   360
 3 16a22fdad3de24d72d8509711c454ffb matches     3   408
 4 16a22fdad3de24d72d8509711c454ffb matches     4   592
 5 16a22fdad3de24d72d8509711c454ffb matches     5   640
 6 16a22fdad3de24d72d8509711c454ffb matches     6   276
 7 16a22fdad3de24d72d8509711c454ffb matches     7   228
 8 16a22fdad3de24d72d8509711c454ffb matches     8   280
 9 16a22fdad3de24d72d8509711c454ffb matches     9   344
10 16a22fdad3de24d72d8509711c454ffb matches    10   276
# ℹ 242 more rows

Code

ggplot(users_tidy, aes(x=month, y=count, color = type)) + geom_line() + facet_wrap(~user_id)

Graduate Challenge

Important

Graduate students are required to complete this challenge. It is optional for undergraduate students.

Use pivot_wider to undo the data tidying you applied in the case study exercise above. That is, take the tidied data and turn it back into the original untidy format we had when we read in the users_top7_2020 dataset.

Solution

YOUR SOLUTION HERE

Footnotes

This dataset was sourced from Swipestats.io.↩︎