Data Wrangling Lab 6

Download needed packages

suppressWarnings({
library(knitr)
library(dplyr)
library(here)
library(tidyverse)
library(repurrrsive)
})

Question 1

Import the Month-XX.csv files into your current R session. Do so using a for loop or purrr. Rather than have 11 separate data frames (one for each month), combine these so that you have one data frame containing all the data. Your final data frame should have 698,159 rows and 10 columns.

df <- dir("/Users/mishr/OneDrive/Desktop/UC - Data Wrangling/Lab data/data",full.names = T) %>% map_df(read_csv)
glimpse(df)

## Rows: 698,159
## Columns: 10
## $ Account_ID            <dbl> 5, 16, 28, 40, 62, 64, 69, 69, 70, 79, 88, 90, 9…
## $ Transaction_Timestamp <dttm> 2009-01-08 00:16:41, 2009-01-20 22:40:08, 2009-…
## $ Factor_A              <dbl> 2, 2, 2, 2, 2, 7, 2, 2, 2, 7, 8, 10, 10, 2, 2, 2…
## $ Factor_B              <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 18, 6, 6, 6, 6, 6,…
## $ Factor_C              <chr> "VI", "VI", "VI", "VI", "VI", "MC", "VI", "VI", …
## $ Factor_D              <dbl> 20, 20, 21, 20, 20, 20, 20, 20, 20, 20, 20, 20, …
## $ Factor_E              <chr> "A", "H", "NULL", "H", "B", "NULL", "H", "H", "B…
## $ Response              <dbl> 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, …
## $ Transaction_Status    <chr> "Approved", "Approved", "Approved", "Approved", …
## $ Month                 <chr> "Jan", "Jan", "Jan", "Jan", "Jan", "Jan", "Jan",…

Question 2

Use one of the map() function variants to check the current class of each column i.e. (class(df$Account_ID)).

df %>%
  map(class)

## $Account_ID
## [1] "numeric"
## 
## $Transaction_Timestamp
## [1] "POSIXct" "POSIXt" 
## 
## $Factor_A
## [1] "numeric"
## 
## $Factor_B
## [1] "numeric"
## 
## $Factor_C
## [1] "character"
## 
## $Factor_D
## [1] "numeric"
## 
## $Factor_E
## [1] "character"
## 
## $Response
## [1] "numeric"
## 
## $Transaction_Status
## [1] "character"
## 
## $Month
## [1] "character"

Question 3

Use one of the map() function variants to assess how many unique values exists in each column?

df %>%
  map_int(n_distinct)

##            Account_ID Transaction_Timestamp              Factor_A 
##                475413                686538                     7 
##              Factor_B              Factor_C              Factor_D 
##                     6                     4                    15 
##              Factor_E              Response    Transaction_Status 
##                    63                    42                     2 
##                 Month 
##                    11

Question 4

The “Factor_D” variable contains 15 unique values (i.e. 10, 15, 20, 21, . . . , 85, 90). There is at least one single observation where Factor_D = 26 (possibly more). Assume these observations were improperly recorded and, in fact, the value should be 25. Using ifelse() (or dplyr’s if_else()) inside mutate(), recode any values where Factor_D == 26 to be 25. After completing this, how many unique values exist in this column? How many observations are there for each level of Factor_D?

unique(df$Factor_D)

##  [1] 20 21 30 25 35 55 50 90 10 15 85 70 40 26 31

new_df <- df %>%
  mutate(Factor_D, column_values = ifelse(Factor_D == 26, 25, Factor_D)) %>%
  count(column_values)
  
#Column Factor_D after update
print(new_df)

## # A tibble: 14 × 2
##    column_values      n
##            <dbl>  <int>
##  1            10   4595
##  2            15   1089
##  3            20 527882
##  4            21  68072
##  5            25  41021
##  6            30   7030
##  7            31    512
##  8            35  25298
##  9            40   2720
## 10            50   3709
## 11            55  15200
## 12            70     54
## 13            85      4
## 14            90    973

Question 5

Unfortunately, some of the “Factor_” variables have observations that contain the value “NULL” (they are recorded as a character string, not the actual NULL value. Use filter_at() to filter out any of these observations. We have not spent much time using filter_at() so you may need to do some research on it! How many rows does your data now have (hint: it should be less than 500,000)?

df %>%
  filter_at(vars(Factor_A:Factor_E), all_vars(. != "NULL")) %>%
  nrow()

## [1] 489537

Question 6

Using mutate_at() , convert all variables except for “Transaction_Timestamp” to factors. However, make sure the “Month” variable is an ordered factor. This may require you to do two separate mutate() statements

new_df <- df %>%
  mutate_at(-2, as.factor) %>%
  mutate(Month = factor((Month), levels = month.name))

glimpse(new_df)

## Rows: 698,159
## Columns: 10
## $ Account_ID            <fct> 5, 16, 28, 40, 62, 64, 69, 69, 70, 79, 88, 90, 9…
## $ Transaction_Timestamp <dttm> 2009-01-08 00:16:41, 2009-01-20 22:40:08, 2009-…
## $ Factor_A              <fct> 2, 2, 2, 2, 2, 7, 2, 2, 2, 7, 8, 10, 10, 2, 2, 2…
## $ Factor_B              <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 18, 6, 6, 6, 6, 6,…
## $ Factor_C              <fct> VI, VI, VI, VI, VI, MC, VI, VI, VI, MC, AX, DI, …
## $ Factor_D              <fct> 20, 20, 21, 20, 20, 20, 20, 20, 20, 20, 20, 20, …
## $ Factor_E              <fct> A, H, NULL, H, B, NULL, H, H, B, NULL, NULL, NUL…
## $ Response              <fct> 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, …
## $ Transaction_Status    <fct> Approved, Approved, Approved, Approved, Approved…
## $ Month                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

levels(new_df$Month)

##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"

Question 7

Use the summarize_if() function to assess how many unique values there are for all the other variables in our data set? Hint: use n_distinct(). Now, if you group by “Transaction_Status”, are there approximately equal distributions of unique values across all the variables? In other words, for transactions that are approved versus declined, do we have near equal representation of observations across each variable?

#Using summarise_if() to check unique values for all variables

new_df %>%
  summarise_if(is.factor, n_distinct)

## # A tibble: 1 × 9
##   Account_ID Factor_A Factor_B Factor_C Factor_D Factor_E Response
##        <int>    <int>    <int>    <int>    <int>    <int>    <int>
## 1     475413        7        6        4       15       63       42
## # ℹ 2 more variables: Transaction_Status <int>, Month <int>

new_df %>%
  group_by(Transaction_Status) %>%
  summarise_if(is.factor, n_distinct)

## # A tibble: 2 × 9
##   Transaction_Status Account_ID Factor_A Factor_B Factor_C Factor_D Factor_E
##   <fct>                   <int>    <int>    <int>    <int>    <int>    <int>
## 1 Approved               462655        6        5        4       13       60
## 2 Declined                22013        7        6        4       15       58
## # ℹ 2 more variables: Response <int>, Month <int>

#When we group by Transaction_Status, the distribution for all the unique values is similar across all except the Account_ID and Response columns

Question 8

Create a function convert_to_qtr() that converts monthly values to quarters. This function should take a vector of character month values (“Jan”, “Feb”, . . . , “Dec”) and convert to “Q1”, “Q2”, “Q3”, or “Q4”.

#Creating function convert_to_qtr() 

convert_to_qtr <- function(Month) dplyr::case_when(
  Month == "Jan" ~ "Q1",
  Month == "Feb" ~ "Q1",
  Month == "Mar" ~ "Q1",
  Month == "Apr" ~ "Q2", 
  Month == "May" ~ "Q2", 
  Month == "Jun" ~ "Q2",
  Month == "Jul" ~ "Q3", 
  Month == "Aug" ~ "Q3", 
  Month == "Sep" ~ "Q3",
  Month == "Oct" ~ "Q4", 
  Month == "Nov" ~ "Q4", 
  Month == "Dec" ~ "Q4"
)

#Testing the convert_to_qtr()
example_months <- c("Jan", "Mar", "May", "May", "Aug", "Nov", "Nov", "Dec")
convert_to_qtr(example_months)

## [1] "Q1" "Q1" "Q2" "Q2" "Q3" "Q4" "Q4" "Q4"

#Using mutate to apply convert_to_qtr() on Month column and checking unique values of the new col "Qtr"


mutate(df, Qtr = convert_to_qtr(Month)) %>%
  group_by(Qtr) %>%
  count(Qtr) %>%
  summarize(total = sum(n))

## # A tibble: 4 × 2
##   Qtr    total
##   <chr>  <int>
## 1 Q1    152174
## 2 Q2    165778
## 3 Q3    205615
## 4 Q4    174592

Question 9

Take some time to understand the sw_people data set provided by the the repurrrsive package (?sw_people). Using a map_xxx() variant, extract the name of each Star Wars character in the sw_people data set. Hint: the first element of each list item is the characters name.

#extract the names
sw_people %>% map_chr(~.x$name)

##  [1] "Luke Skywalker"        "C-3PO"                 "R2-D2"                
##  [4] "Darth Vader"           "Leia Organa"           "Owen Lars"            
##  [7] "Beru Whitesun lars"    "R5-D4"                 "Biggs Darklighter"    
## [10] "Obi-Wan Kenobi"        "Anakin Skywalker"      "Wilhuff Tarkin"       
## [13] "Chewbacca"             "Han Solo"              "Greedo"               
## [16] "Jabba Desilijic Tiure" "Wedge Antilles"        "Jek Tono Porkins"     
## [19] "Yoda"                  "Palpatine"             "Boba Fett"            
## [22] "IG-88"                 "Bossk"                 "Lando Calrissian"     
## [25] "Lobot"                 "Ackbar"                "Mon Mothma"           
## [28] "Arvel Crynyd"          "Wicket Systri Warrick" "Nien Nunb"            
## [31] "Qui-Gon Jinn"          "Nute Gunray"           "Finis Valorum"        
## [34] "Jar Jar Binks"         "Roos Tarpals"          "Rugor Nass"           
## [37] "Ric Olié"              "Watto"                 "Sebulba"              
## [40] "Quarsh Panaka"         "Shmi Skywalker"        "Darth Maul"           
## [43] "Bib Fortuna"           "Ayla Secura"           "Dud Bolt"             
## [46] "Gasgano"               "Ben Quadinaros"        "Mace Windu"           
## [49] "Ki-Adi-Mundi"          "Kit Fisto"             "Eeth Koth"            
## [52] "Adi Gallia"            "Saesee Tiin"           "Yarael Poof"          
## [55] "Plo Koon"              "Mas Amedda"            "Gregar Typho"         
## [58] "Cordé"                 "Cliegg Lars"           "Poggle the Lesser"    
## [61] "Luminara Unduli"       "Barriss Offee"         "Dormé"                
## [64] "Dooku"                 "Bail Prestor Organa"   "Jango Fett"           
## [67] "Zam Wesell"            "Dexter Jettster"       "Lama Su"              
## [70] "Taun We"               "Jocasta Nu"            "Ratts Tyerell"        
## [73] "R4-P17"                "Wat Tambor"            "San Hill"             
## [76] "Shaak Ti"              "Grievous"              "Tarfful"              
## [79] "Raymus Antilles"       "Sly Moore"             "Tion Medon"           
## [82] "Finn"                  "Rey"                   "Poe Dameron"          
## [85] "BB8"                   "Captain Phasma"        "Padmé Amidala"

Question 10

Using the sw_people data set, find the number of films each Star Wars characters appears in. Be sure to use the most appropriate map_xxx() variant.

#Using map_int() to find the number of films each Star Wars characters appears in
sw_people%>%
  map_int(~length(.x$films))

##  [1] 5 6 7 4 5 3 3 1 1 6 3 2 5 4 1 3 3 1 5 5 3 1 1 2 1 2 1 1 1 1 1 3 1 2 1 1 1 2
## [39] 1 1 2 1 1 3 1 1 1 3 3 3 2 2 2 1 3 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 2 1 1 2
## [77] 1 1 2 2 1 1 1 1 1 1 3

Question 11

Create a plot that shows the number of films each Star Wars character has been in. To do so, fill in the blanks below and your plot should look similar to one below the following code chunk.

#Creating a plot to check the number of films each Star Wars character has been in

sw_people %>%
  map_chr(~.x$name) %>%
  set_names(sw_people, nm = .) %>%
  map_df(~ length(.x$films)) %>%
  pivot_longer(
    cols=everything(),
    names_to = "Character", 
    values_to = "Films",
  ) %>%
  ggplot(aes(Films, reorder(Character, Films))) +
  geom_point()