Instructions

This assignment reviews the Describing and Visualizing Data content. You will use the describe_visualize_data.Rmd file I reviewed as part of the lecture for this class session to complete this assignment. You will copy and paste relevant code from that file and update it to answer the questions in this assignment. You will respond to questions in each section after executing relevant code to answer a question. You will submit this assignment to its Submissions folder on D2L. You will submit two files:

  1. this completed R Markdown script, and
  2. a HTML rendered version of it to D2L.

To start:

First, create a folder on your computer to save all relevant files for this course. If you did not do so already, you will want to create a folder named gsb_804 that contains all of the materials for this course.

Second, inside of gsb_804, you will create a folder to host assignments. You can name that folder assignments.

Third, inside of assignments, you will create folders for each assignment. You can name the folder for this first assignment: 01_describe_visualize_data.

Fourth, create two additional folders in 01_data_introduction named scripts and data. Store this script in the scripts folder and the data for this assignment in the data folder. Create a plots folder as well.

Fifth, go to the File menu in RStudio, select New Project…, choose Existing Directory, go to your /gsb_804/assignments/01_describe_visualize_data folder to select it as the top-level directory for this R Project.

Global Settings

The first code chunk sets the global settings for the remaining code chunks in the document. Do not change anything in this code chunk.

Activate Packages

In this code chunk, we load the packages we need for this assignment:

  1. here;
  2. tidyverse;
  3. scales;
  4. janitor;
  5. skimr;
  6. ggthemes;
  7. infer;
  8. corrr;
  9. rstatix.

Make sure you installed these packages when you reviewed the analytical lecture.

We will use functions from these packages to examine the data. Do not change anything in this code chunk.

## here for project workflow
library(here)

## tidyverse for data manipulation and plotting;
## loads eight different libraries simultaneously
library(tidyverse)

## scales for formatting variable scales
library(scales)

## janitor for variable names and tables
library(janitor)

## skimr to summarize data
library(skimr)

## ggthemes for plots
library(ggthemes)

## infer for inferential frequentist statistics
library(infer)

## corrr for correlations
library(corrr)

## rstatix to compute statistical tests
## and effect sizes
library(rstatix)

Task 1: Import Data

For this task, you import the data of interest.

Task 1.1

Use the read_csv() and here() functions to import the credit_card_customers.csv data file. Save the data as an object named customers_raw.

Question 1.1: After you load the data, look at your Global Environment window. (1) How many observations are there in the data? (2) How many variables are there in the data?

Response 1.1: (1) 400 Observations (2) 12 variables

### import data file
## save as object
## use read_csv() to import the csv data file
customers_raw <- read_csv(
  ## use here() to locate file in our project directory;
  here(
    # folder
    "data", 
    # file
    "credit_card_customers.csv"
  )
)

Task 2: Inspect Data

For this task, you will inspect the data.

Task 2.1

Use the glimpse() function to preview the customers_raw data table.

Question 2.1: Answer these questions: (1) Which variable is listed third? (2) What type of variable (e.g., numeric, factor, character, logical) is Ethnicity? (3) What is the first value of Balance?

Response 2.1: (1) Limit (2) Character (3) 333

### examine data with functions
## using glimpse() from tibble
glimpse(customers_raw)
## Rows: 400
## Columns: 12
## $ ID        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
## $ Limit     <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
## $ Rating    <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
## $ Cards     <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
## $ Age       <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
## $ Education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
## $ Married   <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
## $ Ethnicity <chr> "Caucasian", "Asian", "Asian", "Asian", "Caucasian", "Caucas…
## $ Gender    <chr> "Male", "Female", "Male", "Female", "Male", "Male", "Female"…
## $ Student   <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ Balance   <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…
## list column names
names(customers_raw)
##  [1] "ID"        "Income"    "Limit"     "Rating"    "Cards"     "Age"      
##  [7] "Education" "Married"   "Ethnicity" "Gender"    "Student"   "Balance"
### print parts of data to Console
## simply type the name of object;
## tibbles give a preview of data in tabular form
customers_raw

Task 3: Clean Data

For this task, you will clean the data.

Task 3.1

Create a new data object named customers_work from customers_raw using one piped command.

In the piped command, you will first pipe customers_raw to the mutate() function. Inside the mutate() function, you will use the across() function to convert character columns to factor columns. Make sure to correctly reference the required columns. You will pipe the result to the clean_names() function to convert variable names to snake case.

Apply glimpse() to customers_work to preview the working data table.

Question 3.1: How many factor columns (indicated by fct) did you create?

Response 3.1: 2 columns (ethnicity and gender)

### clean data
## note the pipe (%>%) operator
## save as new data object
customers_work <- customers_raw %>%
  ## mutate variables
  mutate(
    ## across variables
    across(
      ## choose variables
      .cols = Gender:Ethnicity,
      ## functions
      .fns = as_factor
    )
  ) %>%
  ## convert names to snake case
  clean_names()

### inspect clean data
## glimpse the data
glimpse(customers_work)
## Rows: 400
## Columns: 12
## $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
## $ limit     <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
## $ rating    <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
## $ cards     <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
## $ age       <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
## $ education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
## $ married   <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
## $ ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian, Africa…
## $ gender    <fct> Male, Female, Male, Female, Male, Male, Female, Male, Female…
## $ student   <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ balance   <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…

Task 4: Sample Data

For this task, you will sample data from customers_work.

Task 4.1

Create a sample of the working data named customers_work_samp from customers_work. Set the random seed to 547 and randomly sample 400 individuals from customers_work. This will simply re-order the rows of the data. Print a preview of customers_work_samp by typing its name as a line of code.

Question 4.1: What is the age of the individual with id = 73?

Response 4.1: 47 years

### create a reproducible random sample of the working data
## set the random seed of computer
set.seed(547)

### create new object of sampled data
## save as new data
customers_work_samp <- customers_work %>%
  ## randomly sample
  slice_sample(n = 400)

## print preview of updated data
customers_work_samp

Task 4.2

Update customers_work_samp by:

  1. renaming id to initial_id,
  2. creating a new column named sample_id to represent the new ordering of rows,
  3. updating income by multiplying it by 1000,
  4. calculating a new variable named income_card_ratio by taking the ratio of income to cards, and
  5. relocating sample_id to be the first column.

Print a preview of the updated customers_work_samp by typing its name as a line of code. Make sure your window is wide enough to display all of the columns before you print customers_work_samp.

Question 4.2: Answer these questions: (1) What is the income of the individual with sample_id = 6? (2) What is the income_card_ratio of the individual with sample_id = 9?

Response 4.2: (1) 23283 (2) 6408

### create 
## overwrite data
customers_work_samp <- customers_work_samp %>%
  ## rename variable
  rename(
    # new = old
    initial_id = id
  ) %>%
  ## update and calculate variables
  mutate(
    # sample id
    sample_id = 1:n(),
    # update income
    income = income * 1000,
    # limit per card
    income_card_ratio = income / cards
  ) %>%
  ## move variable
  relocate(
    # place as first variable
    sample_id
  )
 
## print preview of updated data
customers_work_samp

Task 5: Query Data

For this task, you will query customers_work_samp in more detail.

Task 5.1

Use the slice_head() function to view the first 15 rows of customers_work_samp.

Question 5.1: What is the credit rating (rating) of the credit card holder with sample_id = 13?

Response 5.1: 287

### view top set of rows
## call data
customers_work_samp %>%
  ## slice the top rows
  slice_head(n = 15)

Task 5.2

Call customers_work_samp and apply the slice_max() function to find the 10 individuals with the highest credit card limit (limit) values.

Question 5.2: Answer these questions: (1) What is the highest credit card limit value in the data table? (2) What is the initial_id of the individual with the highest credit card limit?

Response 5.2: (1) 13913 (2) 324

### select particular rows by condition
## call data
customers_work_samp %>%
  ## slice for maximum value
  slice_max(limit, n = 10)

Task 5.3

Use a piped command to:

  1. call customers_work_samp,
  2. select sample_id, income, limit, and married, and
  3. filter the rows by income greater than 60000, limit greater than 5000, and a Yes response to married.

The result should print a preview of the first 10 rows that meet these conditions.

Question 5.3: Answer these questions: (1) How many credit card customers meet these conditions? (2) Is the person with sample_id = 5 listed?

Response 5.3: (1) 10 (2) No

customers_work_samp %>%
  # select variables
  select(sample_id, income, limit, married) %>%
  # filter rows
  filter(income > 60000, limit > 5000, married == "Yes") %>%
  # preview first 10 rows
  slice_head(n = 10)

Task 6: Describe Data

For this task, you will describe the data.

Task 6.1

In one piped command, call customers_work_samp and apply:

  1. the tabyl() function to ethnicity and gender,
  2. the adorn_percentages() function with the denominator set to all,
  3. the adorn_pct_formatting() function,
  4. the adnorn_ns() function, and
  5. the adorn_title() function.

Question 6.1: Answer these questions: (1) What is the percentage of female Asians in the data table? (2) How many male African Americans are there in the data table?

Response 6.1: (1) 13.8% (2) 49

### table of percentages
## call data
customers_work_samp %>%
  ## table
  tabyl(ethnicity, gender) %>%
  ## add percentages
  adorn_percentages(
    # use total count
    denominator = "all"
  ) %>%
  ## percent format
  adorn_pct_formatting() %>%
  ## add counts
  adorn_ns() %>%
  ## column variable label
  adorn_title()

Task 6.2

In one piped command, call customers_work_samp and apply:

  1. the group_by() function to gender,
  2. the select() function to remove sample_id and initial_id, and
  3. the skim_without_charts() function.

Question 6.2: Answer these questions: (1) What is the standard deviation (sd) of credit card balance (balance) for men? (2) What is the third quartile (p75) of credit rating (rating) for women?

Response 6.2: (1) 462 (2) 440

### summarize data by group
## call data
customers_work_samp %>%
  # group by gender
  group_by(gender) %>%
  # remove sample_id and initial_id
  select(-c(sample_id, initial_id)) %>%
  # summarize without charts
  skim_without_charts()
Data summary
Name Piped data
Number of rows 400
Number of columns 12
_______________________
Column type frequency:
character 2
factor 1
numeric 8
________________________
Group variables gender

Variable type: character

skim_variable gender n_missing complete_rate min max empty n_unique whitespace
married Male 0 1 2 3 0 2 0
married Female 0 1 2 3 0 2 0
student Male 0 1 2 3 0 2 0
student Female 0 1 2 3 0 2 0

Variable type: factor

skim_variable gender n_missing complete_rate ordered n_unique top_counts
ethnicity Male 0 1 FALSE 3 Cau: 97, Afr: 49, Asi: 47
ethnicity Female 0 1 FALSE 3 Cau: 102, Asi: 55, Afr: 50

Variable type: numeric

skim_variable gender n_missing complete_rate mean sd p0 p25 p50 p75 p100
income Male 0 1 45610.32 35638.22 10354.00 20088.00 33437 58063.0 182728
income Female 0 1 44853.93 34955.47 10363.00 21917.00 32164 57269.5 186634
limit Male 0 1 4713.17 2360.21 855.00 2998.00 4534 5884.0 13913
limit Female 0 1 4756.52 2264.16 855.00 3194.00 4768 5852.0 13414
rating Male 0 1 353.52 157.80 93.00 235.00 340 431.0 982
rating Female 0 1 356.27 152.17 117.00 251.50 355 439.5 949
cards Male 0 1 2.99 1.46 1.00 2.00 3 4.0 9
cards Female 0 1 2.93 1.28 1.00 2.00 3 4.0 7
age Male 0 1 55.60 16.99 24.00 42.00 55 69.0 98
age Female 0 1 55.73 17.53 23.00 41.00 57 70.0 91
education Male 0 1 13.47 3.10 6.00 11.00 14 16.0 20
education Female 0 1 13.43 3.16 5.00 11.00 14 16.0 20
balance Male 0 1 509.80 462.12 0.00 5.00 463 815.0 1999
balance Female 0 1 529.54 458.46 0.00 92.00 456 887.0 1809
income_card_ratio Male 0 1 19943.30 20008.10 1349.12 6497.00 12987 24824.0 149316
income_card_ratio Female 0 1 19156.20 19338.68 2168.40 7078.92 12319 25110.5 158889

Task 7: Visualize Data

For this task, you will visualize the data.

Task 7.1

Use the ggplot() to call customers_work_samp. Then, add the following layers:

  1. a geom_bar() layer setting cards to the x-axis,
  2. a labs() layer to add appropriate labels for the axes, and
  3. a scale_x_continuous() layer setting breaks to seq(0, 10, 1).

Question 7.1: What is the most frequent number of credit cards held by individuals in this data table?

Response 7.1: 2 credit cards

### plot single discrete variable
## choose data and mapping
ggplot(
  # data
  data = customers_work_samp
) +
  ## choose geometry with proportion calculation
  geom_bar(
    # mapping  
    mapping = aes(
      # variable 
      x = cards, 
    )
  ) +
  ## label axes
  labs(x = "Number of Credit Cards", y = "Count") +
  ## change format of x-axis
  scale_x_continuous(
    # axis breaks
    breaks = seq(0, 10, 1)
  )

Task 7.2

Use the ggplot() to call customers_work_samp and map balance to the x-axis. Then, add the following layers:

  1. a geom_density() layer setting fill to darkred, color to purple, and alpha to 0.3,
  2. a scale_x_continuous() layer setting labels to dollar format and the number of breaks to 6,
  3. a labs() layer to add appropriate labels for the axes,
  4. a ggtitle() layer to add a Distribution of Balance title, and
  5. a theme_hc() layer.

Question 7.2: Is credit card balance normally distributed (i.e., symmetric and bell-shaped)?

Response 7.2: It is mostly symmetric and bell-shaped.

### histogram of single continuous variable
## choose data and mapping
ggplot(
  # data
  data = customers_work_samp, 
  # mapping
  mapping = aes(
    # x-axis
    x = rating
  )
) +
  ## choose geometry with proportion calculation
  geom_histogram(bins = 10, fill = "skyblue") +
  ## text geometry above bars
  stat_bin(
    ## geometry
    geom = "text",
    # add label 
    aes(
      # label and number of digits
      label = after_stat(count), group = 1
    ),
    # number of bins
    bins = 10,
    # justify horizontally above bar
    color = "black",
    # size of text
    size = 5,
    # position label in middle of bars
    position = position_stack(vjust = 0.5)
  ) +
  ## adjust x-axis scale
  scale_x_continuous(n.breaks = 10) +
  ## label axes
  labs(x = "Rating", y = "Count") 

### density plot of single continuous variable
## choose data and mapping
ggplot(
  # data
  data = customers_work_samp, 
  # mapping
  mapping = aes(x = income)
) +
  ## choose geometry with proportion calculation
  geom_density(fill = "darkblue", color = "white", alpha = 0.5) +
  ## scale x-axis
  scale_x_continuous(
    # convert labels to dollars
    labels = dollar_format(),
    # breaks
    n.breaks = 8
  ) +
  ## label axes
  labs(x = "Income", y = "Density") +
  ## title
  ggtitle(
    # main title
    "Distribution of Income",
    # subtitle
    subtitle = "Credit Card Customers"
  ) +
  ## alter theme
  theme_fivethirtyeight()

Task 7.3

Create a plot object named limit_balance_plot. Use the ggplot() to call customers_work_samp and map limit to the x-axis and balance to the y-axis. Then, add the following layers:

  1. a geom_point() layer setting alpha to 0.3 and color to red,
  2. a geom_smooth() layer setting method to loess, se to FALSE, and color to blue,
  3. a scale_x_continuous() layer setting labels to dollar format and n.breaks to 10,
  4. a scale_y_continuous() layer setting labels to dollar format and n.breaks to 8,
  5. a labs() layer to add appropriate labels for the axes,
  6. a ggtitle() layer to add a Relationship Between Credit Limit and Balance title, and
  7. a theme_few() layer.

Print the plot to display it in the Plots window.

Question 7.3: Is the relationship between credit limit and balance linear?

Response 7.3: It is not perfectly linear but more linear than curvilinear.

# Create the plot object
limit_balance_plot <- ggplot(customers_work_samp, aes(x = limit, y = balance)) +
  geom_point(alpha = 0.3, color = "red") +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  scale_x_continuous(labels = dollar_format(), n.breaks = 10) +
  scale_y_continuous(labels = dollar_format(), n.breaks = 8) +
  labs(x = "Credit Card Limit", y = "Credit Card Balance") +
  ggtitle("Relationship Between Credit Limit and Balance") +
  theme_few()
# Print the plot to display it
limit_balance_plot

Task 8: Pearson’s Correlation Coefficient

For this task, you will perform a Pearson’s linear correlation test.

Task 8.1

Create the following single piped command:

First, pipe customers_work_samp into select() to choose all numeric variables with where(is.numeric) while excluding any identifying variables. Pipe the result to correlate().

Question 8.1: Answer these questions: (1) What two variables have the smallest linear correlation in absolute value? (2) What two variables have the largest linear correlation in absolute value?

Response 8.1: (1) age and balance (2) limit and rating

### compute Pearson's linear correlation coefficient
## call data
customers_work_samp %>%
  ## select variables
  select(
    # continuous variables
    where(is.numeric),
    # remove ID variables
    -contains("id")
  ) %>%
  ## correlation
  correlate()
# Store the correlation matrix
cor_matrix <- customers_work_samp %>%
  select(
    where(is.numeric),
    -contains("id")
  ) %>%
  correlate()

# Convert to long format and remove diagonal/duplicates
cor_long <- cor_matrix %>%
  stretch() %>%                    # converts to long format
  filter(!is.na(r)) %>%           # remove NA values
  filter(r != 1)                  # remove perfect correlations (diagonal)

# Find smallest absolute correlation
smallest_cor <- cor_long %>%
  mutate(abs_r = abs(r)) %>%
  slice_min(abs_r, n = 1)

# Find largest absolute correlation  
largest_cor <- cor_long %>%
  mutate(abs_r = abs(r)) %>%
  slice_max(abs_r, n = 1)

print("Smallest absolute correlation:")
## [1] "Smallest absolute correlation:"
print(smallest_cor)
## # A tibble: 2 × 4
##   x       y             r   abs_r
##   <chr>   <chr>     <dbl>   <dbl>
## 1 age     balance 0.00184 0.00184
## 2 balance age     0.00184 0.00184
print("Largest absolute correlation:")  
## [1] "Largest absolute correlation:"
print(largest_cor)
## # A tibble: 2 × 4
##   x      y          r abs_r
##   <chr>  <chr>  <dbl> <dbl>
## 1 limit  rating 0.997 0.997
## 2 rating limit  0.997 0.997

Task 8.2

Compute a correlation test using cor_test(). Specify customers_work_samp as the data input and cards and balance as the continuous variables. Bind the name cor_test_res to the object. Print cor_test_res to view the result.

Question 8.2: Answer these questions: (1) What is the correlation value? (2) What is the empirical t-value? (3) What is the frequentist probability value?

Response 8.2: (1) 0.086 (2) 1.73 (3) 0.0842

### Pearson's linear correlation coefficient
## create object
corr_test_res <- cor_test(
  # data
  customers_work_samp,
  # continuous variables 
  cards, balance
)

## print result
corr_test_res

Task 8.3

Perform two tasks.

First, calculate the observed correlation using the infer functions and bind the name, corr_res, to it. Call customers_work_samp and pipe it to specify() and set the formula input to balance ~ cards. Pipe the result to calculate() and set stat to “correlation”. Print corr_res to view the result.

Second, produce a visualization using the infer functions. Call customers_work_samp and pipe it to specify() and set the formula input to balance ~ cards. Pipe the result to hypothesize() and set null to “independence”. Pipe the result to generate() and set reps to 2000 and type to “permute”. Pipe the result to calculate() and set stat to “correlation”. Pipe the result to visualize(). Pipe the result to shade_p_value() and set corr_res as the observed result and direction to “two-sided”. Pipe the result to labs() and set appropriate axes labels.

Question 8.3: What does the visualization highlight?

Response 8.3: The visualization highlights the null distribution - what correlation values we would expect to see between balance and cards if there were truly no relationship between these variables.

## observed correlation
## save
corr_res <- customers_work_samp %>% 
  ## specify relationship
  specify(
    # formula
    balance ~ cards
  ) %>% 
  ## calculate observed statistic
  calculate(
    # statistic
    stat = "correlation"
  )

## print
corr_res
### visualize the null distribution 
### and observed statistic
## call data
customers_work_samp %>% 
  ## specify relationship
  specify(
    # formula
    balance ~ cards
  ) %>% 
  ## null hypothesis
  hypothesize(
    # null
    null = "independence"
  ) %>%
  generate(
    # repetitions
    reps = 2000,
    # type
    type = "permute"
  ) %>%
  calculate(
    # statistic
    stat = "correlation"
  ) %>%
  ## visualize null distribution
  visualize() +
  ## add observed statistic
  shade_p_value(
    # observed result
    corr_res,
    # direction
    direction = "two-sided"
  ) +
  ## labels
  labs(
    # x-axis
    x = "Correlation",
    # y-axis
    y = "Count"
  )

Task 8.4

Perform these tasks.

First, create a plot to visualize the linear correlation by:

  1. calling ggplot() and setting the data input to customers_work_samp and map cards to the x-axis and balance to the y-axis;
  2. add a geom_point() layer with alpha set to 0.3 and color set to “red”;
  3. add a geom_smooth() layer with method set to “loess”, se to FALSE, color set to “blue”;
  4. a scale_x_continuous() layer setting n.breaks to 8,
  5. a scale_y_continuous() layer setting labels to dollar format and n.breaks to 10,
  6. update the axes labels to “Cards” and “Balance” to the x and y axes, respectively.

Second, calculate the effect size of the linear correlation.

Question 8.4: Answer these questions: (1) Examining the plot, do cards and balance display much of a relationship? (2) What does the r-squared calculation imply about the size of the relationship between cards and balance?

Response 8.4: (1) Since the blue line is relatively straight, it indicates a relatively linear relationship between cards and balance especially for those with two or more cards. (2) Since the r-squared of 0.0074747 is lower than 0.10, it represents a very weak relationship of variance between cards and balance.

### visualize correlation
## call function
ggplot(
  ## data
  customers_work_samp, 
  ## mapping
  aes(x = cards, y = balance)
) +
  ## points
  geom_point(alpha = 0.3, color = "red") +
  ## loess line
  geom_smooth(method = "loess", se = FALSE, color = "blue", span = 0.9) +
  ## scale x-axis
  scale_x_continuous(
    # breaks
    n.breaks = 8
  ) +
  ## scale y-axis
  scale_y_continuous(
    # breaks
    n.breaks = 10
  ) +
  ## label axes
  labs(x = "Cards", y = "Balance")

## r-squared: effect size calculation
corr_res^2

Task 9: Save Objects

For this task, you will save objects you created.

Task 9.1

Save the data object customers_work_samp as a data file named customers_work_samp.csv to the data folder of the project directory. Use write_csv() and here() to accomplish this task.

### save working data
## use write_csv() to export as a csv data file
write_csv(
  ## name of object
  customers_work_samp,
  ## use here() to export data to project directory;
  here(
    # folder
    "data", 
    # file
    "customers_work_samp.csv"
  )
)

Task 9.2

Save the plot object limit_balance_plot as a file named limit_balance.png to the plots folder of the project directory. Use ggsave() and here() to accomplish this task.

### save a single plot to a file
## call function
ggsave(
  ## file path
  here(
    # folder
    "plots", 
    # file
    "limit_balance.png"
  ), 
  ## plot object
  plot = limit_balance_plot,
  ## dimensions
  units = "in", width = 8, height = 5
)

Task 10: Conceptual Questions

For your last task, you will respond to conceptual questions.

Question 10.1: What are some common tasks for querying data?

Response 10.1: Some common tasks for querying data include filtering, grouping and summarizing,transforming variables, and joining data.

Question 10.2: Answer these two questions: (1) What information does a frequentist probability value from a null hypothesis test provide? (2) What information does an effect size calculation provide?

Response 10.2: (1) A frequentist probability value tells us the probability of observing our result if the null hypothesis were true. A low p-value (< 0.05) suggests strong evidence against the null hypothesis while a high p-value (≥ 0.05) suggests insufficient evidence to reject the null hypothesis. (2) The effect size helps us to understand the magnitude of the relationship/difference: how large or meaningful the observed relationship or difference is.

Question 10.3: Why is it important to write scripts for analytics?

Response 10.3: There are many great reasons why it is important to write scripts for analytics. Unlike non-script programs, scripts can be replicated exactly therefore they ensure scientific rigor. They also serve as evidence/complete record of the analytical processes that have been used and therefore are more transparent and accurate. They are great at reducing human transcription errors that occur when using point and click interfaces which makes them especially useful in handling larger data sets and therefore performing more complex analyses without forgetting steps or transcription mistakes.