Analysis questions

Exercise 1

Import the Month-XX.csv files into your current R session. Do so using a for loop or purrr. Rather than have 11 separate data frames (one for each month), combine these so that you have one data frame containing all the data. Your final data frame should have 698,159 rows and 10 columns.

glimpse(df)
## Rows: 698,159
## Columns: 10
## $ Account_ID            <dbl> 5, 16, 28, 40, 62, 64, 69, 69, 70, 79, 88, 90, 9…
## $ Transaction_Timestamp <dttm> 2009-01-08 00:16:41, 2009-01-20 22:40:08, 2009-…
## $ Factor_A              <dbl> 2, 2, 2, 2, 2, 7, 2, 2, 2, 7, 8, 10, 10, 2, 2, 2…
## $ Factor_B              <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 18, 6, 6, 6, 6, 6,…
## $ Factor_C              <chr> "VI", "VI", "VI", "VI", "VI", "MC", "VI", "VI", …
## $ Factor_D              <dbl> 20, 20, 21, 20, 20, 20, 20, 20, 20, 20, 20, 20, …
## $ Factor_E              <chr> "A", "H", "NULL", "H", "B", "NULL", "H", "H", "B…
## $ Response              <dbl> 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, …
## $ Transaction_Status    <chr> "Approved", "Approved", "Approved", "Approved", …
## $ Month                 <chr> "Jan", "Jan", "Jan", "Jan", "Jan", "Jan", "Jan",…

Exercise 2

Use one of the map() function variants to check the current class of each column i.e. (class(df$Account_ID)).

# Using map to apply the class function to each column of the dataframe
column_classes <- map(df, class)

# View the classes of each column
column_classes
## $Account_ID
## [1] "numeric"
## 
## $Transaction_Timestamp
## [1] "POSIXct" "POSIXt" 
## 
## $Factor_A
## [1] "numeric"
## 
## $Factor_B
## [1] "numeric"
## 
## $Factor_C
## [1] "character"
## 
## $Factor_D
## [1] "numeric"
## 
## $Factor_E
## [1] "character"
## 
## $Response
## [1] "numeric"
## 
## $Transaction_Status
## [1] "character"
## 
## $Month
## [1] "character"

Exercise 3

Use one of the map() function variants to assess how many unique values exists in each column?

# Use map to apply the function to each column of the dataframe
unique_values_count <- map_int(df, ~ length(unique(.)))

# View the number of unique values in each column
unique_values_count
##            Account_ID Transaction_Timestamp              Factor_A 
##                475413                686538                     7 
##              Factor_B              Factor_C              Factor_D 
##                     6                     4                    15 
##              Factor_E              Response    Transaction_Status 
##                    63                    42                     2 
##                 Month 
##                    11

Exercise 4

The “Factor_D” variable contains 15 unique values (i.e. 10, 15, 20, 21, …, 85, 90). There is at least one single observation where Factor_D = 26 (possibly more). Assume these observations were improperly recorded and, in fact, the value should be 25. Using ifelse() (or dplyr’s if_else()) inside mutate(), recode any values where Factor_D == 26 to be 25. After completing this, how many unique values exist in this column? How many observations are there for each level of Factor_D?

# Recode Factor_D values from 26 to 25
mutated_df <- df %>%
  mutate(Factor_D = if_else(Factor_D == 26, 25, Factor_D))

# Count the number of unique values in Factor_D
num_unique_values <- length(unique(mutated_df$Factor_D))

# View the number of unique values in Factor_D
num_unique_values
## [1] 14
# Count the number of observations for each level of Factor_D
factor_d_counts <- df %>%
  group_by(Factor_D) %>%
  summarise(count = n())

# View the counts for each level of Factor_D
factor_d_counts
## # A tibble: 15 × 2
##    Factor_D  count
##       <dbl>  <int>
##  1       10   4595
##  2       15   1089
##  3       20 527882
##  4       21  68072
##  5       25  39163
##  6       26   1858
##  7       30   7030
##  8       31    512
##  9       35  25298
## 10       40   2720
## 11       50   3709
## 12       55  15200
## 13       70     54
## 14       85      4
## 15       90    973

Exercise 5

Unfortunately, some of the “Factor_” variables have observations that contain the value “NULL” (they are recorded as a character string, not the actual NULL value. Use filter_at() to filter out any of these observations. We have not spent much time using filter_at() so you may need to do some research on it! How many rows does your data now have (hint: it should be less than 500,000)?

# Filter out rows where any Factor_ variables contain "NULL"
factor_df <- df %>%
  filter_at(vars(starts_with("Factor_")), all_vars(. != "NULL"))

# Check the number of rows in the filtered data
num_rows <- nrow(factor_df)

# View the number of rows in the filtered data
num_rows
## [1] 489537

Exercise 6

Using mutate_at() , convert all variables except for “Transaction_Timestamp” to factors. However, make sure the “Month” variable is an ordered factor. This may require you to do two separate mutate() statements.

# Convert all variables except for Transaction_Timestamp to factors and make Month an ordered factor
converted_df <- df %>%
  mutate_at(vars(-Transaction_Timestamp), factor) %>%
  mutate(Month = factor(Month, levels = c("January", "February", "March", "April", "May", "June", 
                                          "July", "August", "September", "October", "November", "December"), 
                        ordered = TRUE))

# Check the structure of the dataframe to ensure all features are changed to factors
glimpse(converted_df)
## Rows: 698,159
## Columns: 10
## $ Account_ID            <fct> 5, 16, 28, 40, 62, 64, 69, 69, 70, 79, 88, 90, 9…
## $ Transaction_Timestamp <dttm> 2009-01-08 00:16:41, 2009-01-20 22:40:08, 2009-…
## $ Factor_A              <fct> 2, 2, 2, 2, 2, 7, 2, 2, 2, 7, 8, 10, 10, 2, 2, 2…
## $ Factor_B              <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 18, 6, 6, 6, 6, 6,…
## $ Factor_C              <fct> VI, VI, VI, VI, VI, MC, VI, VI, VI, MC, AX, DI, …
## $ Factor_D              <fct> 20, 20, 21, 20, 20, 20, 20, 20, 20, 20, 20, 20, …
## $ Factor_E              <fct> A, H, NULL, H, B, NULL, H, H, B, NULL, NULL, NUL…
## $ Response              <fct> 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, …
## $ Transaction_Status    <fct> Approved, Approved, Approved, Approved, Approved…
## $ Month                 <ord> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# Ensure the Month variable is an ordered factor by viewing its levels
levels(converted_df$Month)
##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"

Exercise 7

Use the summarize_if() function to assess how many unique values there are for all the other variables in our data set?

# Assess the number of unique values for all variables
unique_values_summary <- df %>%
  summarize_if(is.factor, n_distinct)

# View the summary of unique values for all variables
unique_values_summary
## # A tibble: 1 × 0
# Group by Transaction_Status and assess the distribution of unique values across all variables
grouped_unique_values_summary <- df %>%
  group_by(Transaction_Status) %>%
  summarize_if(is.factor, n_distinct)

# View the grouped summary of unique values for all variables
grouped_unique_values_summary
## # A tibble: 2 × 1
##   Transaction_Status
##   <chr>             
## 1 Approved          
## 2 Declined

Exercise 8

Create a function convert_to_qtr() that converts monthly values to quarters. This function should take a vector of character month values (“Jan”, “Feb”, . . . , “Dec”) and convert to “Q1”, “Q2”, “Q3”, or “Q4”. Do it such that:

• If the month input is Jan-Mar, then the function returns “Q1”
• If the month input is Apr-Jun, then the function returns “Q2”
• If the month input is Jul-Sep, then the function returns “Q3”
• If the month input is Oct-Dec, then the function returns “Q4”

Note, there is a function in the lubridate package (quarter()) that can accomplish this but I want you to use case_when() within the function instead. Once you’ve created this function you should be able to test is on the following vector and get the same results: Now, use this function you created above in a mutate() statement to create a new variable called “Qtr” in the data frame. How many observations do you have in each quarter?

#Find unique months across all data files
unique(df$Month)
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
# Define the convert_to_qtr function
convert_to_qtr <- function(months) {
  case_when(
    months %in% c("Jan", "Feb", "Mar") ~ "Q1",
    months %in% c("Apr", "May", "Jun") ~ "Q2",
    months %in% c("Jul", "Aug", "Sep") ~ "Q3",
    months %in% c("Oct", "Nov", "Dec") ~ "Q4"
  )
}

# Use the function to create a new Qtr variable in the dataframe
quater_df <- df %>%
  mutate(Qtr = convert_to_qtr(Month))

# Count the number of observations in each quarter
quarter_counts <- quater_df %>%
  group_by(Qtr) %>%
  summarise(count = n())

# View the counts for each quarter
quarter_counts
## # A tibble: 4 × 2
##   Qtr    count
##   <chr>  <int>
## 1 Q1    152174
## 2 Q2    165778
## 3 Q3    205615
## 4 Q4    174592

Exercise 9

Take some time to understand the sw_people data set provided by the the repurrrsive package (?sw_people). Using a map_xxx() variant, extract the name of each Star Wars character in the sw_people data set. Hint: the first element of each list item is the characters name.

# Using map_chr to extract the name of each Star Wars character in the sw_people dataset
character_names <- sw_people %>% map_chr(1)

#View character names
character_names
##  [1] "Luke Skywalker"        "C-3PO"                 "R2-D2"                
##  [4] "Darth Vader"           "Leia Organa"           "Owen Lars"            
##  [7] "Beru Whitesun lars"    "R5-D4"                 "Biggs Darklighter"    
## [10] "Obi-Wan Kenobi"        "Anakin Skywalker"      "Wilhuff Tarkin"       
## [13] "Chewbacca"             "Han Solo"              "Greedo"               
## [16] "Jabba Desilijic Tiure" "Wedge Antilles"        "Jek Tono Porkins"     
## [19] "Yoda"                  "Palpatine"             "Boba Fett"            
## [22] "IG-88"                 "Bossk"                 "Lando Calrissian"     
## [25] "Lobot"                 "Ackbar"                "Mon Mothma"           
## [28] "Arvel Crynyd"          "Wicket Systri Warrick" "Nien Nunb"            
## [31] "Qui-Gon Jinn"          "Nute Gunray"           "Finis Valorum"        
## [34] "Jar Jar Binks"         "Roos Tarpals"          "Rugor Nass"           
## [37] "Ric Olié"              "Watto"                 "Sebulba"              
## [40] "Quarsh Panaka"         "Shmi Skywalker"        "Darth Maul"           
## [43] "Bib Fortuna"           "Ayla Secura"           "Dud Bolt"             
## [46] "Gasgano"               "Ben Quadinaros"        "Mace Windu"           
## [49] "Ki-Adi-Mundi"          "Kit Fisto"             "Eeth Koth"            
## [52] "Adi Gallia"            "Saesee Tiin"           "Yarael Poof"          
## [55] "Plo Koon"              "Mas Amedda"            "Gregar Typho"         
## [58] "Cordé"                 "Cliegg Lars"           "Poggle the Lesser"    
## [61] "Luminara Unduli"       "Barriss Offee"         "Dormé"                
## [64] "Dooku"                 "Bail Prestor Organa"   "Jango Fett"           
## [67] "Zam Wesell"            "Dexter Jettster"       "Lama Su"              
## [70] "Taun We"               "Jocasta Nu"            "Ratts Tyerell"        
## [73] "R4-P17"                "Wat Tambor"            "San Hill"             
## [76] "Shaak Ti"              "Grievous"              "Tarfful"              
## [79] "Raymus Antilles"       "Sly Moore"             "Tion Medon"           
## [82] "Finn"                  "Rey"                   "Poe Dameron"          
## [85] "BB8"                   "Captain Phasma"        "Padmé Amidala"

Exercise 10

Using the sw_people data set, find the number of films each Star Wars characters appears in. Be sure to use the most appropriate map_xxx() variant.

# Extracting names and number of films each character appears in
character_names <- sw_people %>% map_chr("name")
num_films <- sw_people %>% map_int(~ length(.$films))

# Using map2_chr to print each character and the number of films they appear in
output <- map2_chr(character_names, num_films, ~ paste(.x, "appears in", .y, "films"))

# Printing the output
print(output)
##  [1] "Luke Skywalker appears in 5 films"       
##  [2] "C-3PO appears in 6 films"                
##  [3] "R2-D2 appears in 7 films"                
##  [4] "Darth Vader appears in 4 films"          
##  [5] "Leia Organa appears in 5 films"          
##  [6] "Owen Lars appears in 3 films"            
##  [7] "Beru Whitesun lars appears in 3 films"   
##  [8] "R5-D4 appears in 1 films"                
##  [9] "Biggs Darklighter appears in 1 films"    
## [10] "Obi-Wan Kenobi appears in 6 films"       
## [11] "Anakin Skywalker appears in 3 films"     
## [12] "Wilhuff Tarkin appears in 2 films"       
## [13] "Chewbacca appears in 5 films"            
## [14] "Han Solo appears in 4 films"             
## [15] "Greedo appears in 1 films"               
## [16] "Jabba Desilijic Tiure appears in 3 films"
## [17] "Wedge Antilles appears in 3 films"       
## [18] "Jek Tono Porkins appears in 1 films"     
## [19] "Yoda appears in 5 films"                 
## [20] "Palpatine appears in 5 films"            
## [21] "Boba Fett appears in 3 films"            
## [22] "IG-88 appears in 1 films"                
## [23] "Bossk appears in 1 films"                
## [24] "Lando Calrissian appears in 2 films"     
## [25] "Lobot appears in 1 films"                
## [26] "Ackbar appears in 2 films"               
## [27] "Mon Mothma appears in 1 films"           
## [28] "Arvel Crynyd appears in 1 films"         
## [29] "Wicket Systri Warrick appears in 1 films"
## [30] "Nien Nunb appears in 1 films"            
## [31] "Qui-Gon Jinn appears in 1 films"         
## [32] "Nute Gunray appears in 3 films"          
## [33] "Finis Valorum appears in 1 films"        
## [34] "Jar Jar Binks appears in 2 films"        
## [35] "Roos Tarpals appears in 1 films"         
## [36] "Rugor Nass appears in 1 films"           
## [37] "Ric Olié appears in 1 films"             
## [38] "Watto appears in 2 films"                
## [39] "Sebulba appears in 1 films"              
## [40] "Quarsh Panaka appears in 1 films"        
## [41] "Shmi Skywalker appears in 2 films"       
## [42] "Darth Maul appears in 1 films"           
## [43] "Bib Fortuna appears in 1 films"          
## [44] "Ayla Secura appears in 3 films"          
## [45] "Dud Bolt appears in 1 films"             
## [46] "Gasgano appears in 1 films"              
## [47] "Ben Quadinaros appears in 1 films"       
## [48] "Mace Windu appears in 3 films"           
## [49] "Ki-Adi-Mundi appears in 3 films"         
## [50] "Kit Fisto appears in 3 films"            
## [51] "Eeth Koth appears in 2 films"            
## [52] "Adi Gallia appears in 2 films"           
## [53] "Saesee Tiin appears in 2 films"          
## [54] "Yarael Poof appears in 1 films"          
## [55] "Plo Koon appears in 3 films"             
## [56] "Mas Amedda appears in 2 films"           
## [57] "Gregar Typho appears in 1 films"         
## [58] "Cordé appears in 1 films"                
## [59] "Cliegg Lars appears in 1 films"          
## [60] "Poggle the Lesser appears in 2 films"    
## [61] "Luminara Unduli appears in 2 films"      
## [62] "Barriss Offee appears in 1 films"        
## [63] "Dormé appears in 1 films"                
## [64] "Dooku appears in 2 films"                
## [65] "Bail Prestor Organa appears in 2 films"  
## [66] "Jango Fett appears in 1 films"           
## [67] "Zam Wesell appears in 1 films"           
## [68] "Dexter Jettster appears in 1 films"      
## [69] "Lama Su appears in 1 films"              
## [70] "Taun We appears in 1 films"              
## [71] "Jocasta Nu appears in 1 films"           
## [72] "Ratts Tyerell appears in 1 films"        
## [73] "R4-P17 appears in 2 films"               
## [74] "Wat Tambor appears in 1 films"           
## [75] "San Hill appears in 1 films"             
## [76] "Shaak Ti appears in 2 films"             
## [77] "Grievous appears in 1 films"             
## [78] "Tarfful appears in 1 films"              
## [79] "Raymus Antilles appears in 2 films"      
## [80] "Sly Moore appears in 2 films"            
## [81] "Tion Medon appears in 1 films"           
## [82] "Finn appears in 1 films"                 
## [83] "Rey appears in 1 films"                  
## [84] "Poe Dameron appears in 1 films"          
## [85] "BB8 appears in 1 films"                  
## [86] "Captain Phasma appears in 1 films"       
## [87] "Padmé Amidala appears in 3 films"

Exercise 11

Using the sw_people data set, find the number of films each Star Wars characters appears in. Be sure to use the most appropriate map_xxx() variant.

# Extracting the name and number of films for each character
character_films <- map_df(sw_people, ~tibble(Name = .x$name, Films = length(.x$films)))

# Creating the plot
ggplot(data = character_films, aes(x = Films, y = reorder(Name, Films))) +
  geom_point() +  # Adding points to the plot
  labs(title = "Number of Films Each Star Wars Character Has Been In", 
       x = "Number of Films", 
       y = "Character")  # Adding labels to the plot

# Save the plot with specific dimensions
ggsave("star_wars_character_films.png", width = 10, height = 15, units = "in")