This assignment reviews the Describing and Visualizing Data content. You will use the describe_visualize_data.Rmd file I reviewed as part of the lecture for this class session to complete this assignment. You will copy and paste relevant code from that file and update it to answer the questions in this assignment. You will respond to questions in each section after executing relevant code to answer a question. You will submit this assignment to its Submissions folder on D2L. You will submit two files:
To start:
First, create a folder on your computer to save all relevant files for this course. If you did not do so already, you will want to create a folder named gsb_804 that contains all of the materials for this course.
Second, inside of gsb_804, you will create a folder to host assignments. You can name that folder assignments.
Third, inside of assignments, you will create folders for each assignment. You can name the folder for this first assignment: 01_describe_visualize_data.
Fourth, create two additional folders in 01_data_introduction named scripts and data. Store this script in the scripts folder and the data for this assignment in the data folder. Create a plots folder as well.
Fifth, go to the File menu in RStudio, select New Project…, choose Existing Directory, go to your /gsb_804/assignments/01_describe_visualize_data folder to select it as the top-level directory for this R Project.
The first code chunk sets the global settings for the remaining code chunks in the document. Do not change anything in this code chunk.
In this code chunk, we load the packages we need for this assignment:
Make sure you installed these packages when you reviewed the analytical lecture.
We will use functions from these packages to examine the data. Do not change anything in this code chunk.
## here for project workflow
library(here)
## tidyverse for data manipulation and plotting;
## loads eight different libraries simultaneously
library(tidyverse)
## scales for formatting variable scales
library(scales)
## janitor for variable names and tables
library(janitor)
## skimr to summarize data
library(skimr)
## ggthemes for plots
library(ggthemes)
## infer for inferential frequentist statistics
library(infer)
## corrr for correlations
library(corrr)
## rstatix to compute statistical tests
## and effect sizes
library(rstatix)
For this task, you import the data of interest.
Use the read_csv() and here() functions to import the credit_card_customers.csv data file. Save the data as an object named customers_raw.
Question 1.1: After you load the data, look at your Global Environment window. (1) How many observations are there in the data? (2) How many variables are there in the data?
Response 1.1: (1) 400 Observations (2) 12 variables
For this task, you will inspect the data.
Use the glimpse() function to preview the customers_raw data table.
Question 2.1: Answer these questions: (1) Which variable is listed third? (2) What type of variable (e.g., numeric, factor, character, logical) is Ethnicity? (3) What is the first value of Balance?
Response 2.1: (1) Limit (2) Character (3) 333
## Rows: 400
## Columns: 12
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ Income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
## $ Limit <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
## $ Rating <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
## $ Cards <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
## $ Age <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
## $ Education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
## $ Married <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
## $ Ethnicity <chr> "Caucasian", "Asian", "Asian", "Asian", "Caucasian", "Caucas…
## $ Gender <chr> "Male", "Female", "Male", "Female", "Male", "Male", "Female"…
## $ Student <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ Balance <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…
## [1] "ID" "Income" "Limit" "Rating" "Cards" "Age"
## [7] "Education" "Married" "Ethnicity" "Gender" "Student" "Balance"
### print parts of data to Console
## simply type the name of object;
## tibbles give a preview of data in tabular form
customers_raw
For this task, you will clean the data.
Create a new data object named customers_work from customers_raw using one piped command.
In the piped command, you will first pipe customers_raw to the mutate() function. Inside the mutate() function, you will use the across() function to convert character columns to factor columns. Make sure to correctly reference the required columns. You will pipe the result to the clean_names() function to convert variable names to snake case.
Apply glimpse() to customers_work to preview the working data table.
Question 3.1: How many factor columns (indicated by fct) did you create?
Response 3.1: 2 columns (ethnicity and gender)
### clean data
## note the pipe (%>%) operator
## save as new data object
customers_work <- customers_raw %>%
## mutate variables
mutate(
## across variables
across(
## choose variables
.cols = Gender:Ethnicity,
## functions
.fns = as_factor
)
) %>%
## convert names to snake case
clean_names()
### inspect clean data
## glimpse the data
glimpse(customers_work)
## Rows: 400
## Columns: 12
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
## $ limit <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
## $ rating <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
## $ cards <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
## $ age <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
## $ education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
## $ married <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
## $ ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian, Africa…
## $ gender <fct> Male, Female, Male, Female, Male, Male, Female, Male, Female…
## $ student <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ balance <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…
For this task, you will sample data from customers_work.
Create a sample of the working data named customers_work_samp from customers_work. Set the random seed to 547 and randomly sample 400 individuals from customers_work. This will simply re-order the rows of the data. Print a preview of customers_work_samp by typing its name as a line of code.
Question 4.1: What is the age of the individual with id = 73?
Response 4.1: 47 years
### create a reproducible random sample of the working data
## set the random seed of computer
set.seed(547)
### create new object of sampled data
## save as new data
customers_work_samp <- customers_work %>%
## randomly sample
slice_sample(n = 400)
## print preview of updated data
customers_work_samp
Update customers_work_samp by:
Print a preview of the updated customers_work_samp by typing its name as a line of code. Make sure your window is wide enough to display all of the columns before you print customers_work_samp.
Question 4.2: Answer these questions: (1) What is the income of the individual with sample_id = 6? (2) What is the income_card_ratio of the individual with sample_id = 9?
Response 4.2: (1) 23283 (2) 6408
### create
## overwrite data
customers_work_samp <- customers_work_samp %>%
## rename variable
rename(
# new = old
initial_id = id
) %>%
## update and calculate variables
mutate(
# sample id
sample_id = 1:n(),
# update income
income = income * 1000,
# limit per card
income_card_ratio = income / cards
) %>%
## move variable
relocate(
# place as first variable
sample_id
)
## print preview of updated data
customers_work_samp
For this task, you will query customers_work_samp in more detail.
Use the slice_head() function to view the first 15 rows of customers_work_samp.
Question 5.1: What is the credit rating (rating) of the credit card holder with sample_id = 13?
Response 5.1: 287
### view top set of rows
## call data
customers_work_samp %>%
## slice the top rows
slice_head(n = 15)
Call customers_work_samp and apply the slice_max() function to find the 10 individuals with the highest credit card limit (limit) values.
Question 5.2: Answer these questions: (1) What is the highest credit card limit value in the data table? (2) What is the initial_id of the individual with the highest credit card limit?
Response 5.2: (1) 13913 (2) 324
### select particular rows by condition
## call data
customers_work_samp %>%
## slice for maximum value
slice_max(limit, n = 10)
Use a piped command to:
The result should print a preview of the first 10 rows that meet these conditions.
Question 5.3: Answer these questions: (1) How many credit card customers meet these conditions? (2) Is the person with sample_id = 5 listed?
Response 5.3: (1) 10 (2) No
customers_work_samp %>%
# select variables
select(sample_id, income, limit, married) %>%
# filter rows
filter(income > 60000, limit > 5000, married == "Yes") %>%
# preview first 10 rows
slice_head(n = 10)
For this task, you will describe the data.
In one piped command, call customers_work_samp and apply:
Question 6.1: Answer these questions: (1) What is the percentage of female Asians in the data table? (2) How many male African Americans are there in the data table?
Response 6.1: (1) 13.8% (2) 49
### table of percentages
## call data
customers_work_samp %>%
## table
tabyl(ethnicity, gender) %>%
## add percentages
adorn_percentages(
# use total count
denominator = "all"
) %>%
## percent format
adorn_pct_formatting() %>%
## add counts
adorn_ns() %>%
## column variable label
adorn_title()
In one piped command, call customers_work_samp and apply:
Question 6.2: Answer these questions: (1) What is the standard deviation (sd) of credit card balance (balance) for men? (2) What is the third quartile (p75) of credit rating (rating) for women?
Response 6.2: (1) 462 (2) 440
### summarize data by group
## call data
customers_work_samp %>%
# group by gender
group_by(gender) %>%
# remove sample_id and initial_id
select(-c(sample_id, initial_id)) %>%
# summarize without charts
skim_without_charts()
Name | Piped data |
Number of rows | 400 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
character | 2 |
factor | 1 |
numeric | 8 |
________________________ | |
Group variables | gender |
Variable type: character
skim_variable | gender | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|---|
married | Male | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
married | Female | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
student | Male | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
student | Female | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: factor
skim_variable | gender | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
ethnicity | Male | 0 | 1 | FALSE | 3 | Cau: 97, Afr: 49, Asi: 47 |
ethnicity | Female | 0 | 1 | FALSE | 3 | Cau: 102, Asi: 55, Afr: 50 |
Variable type: numeric
skim_variable | gender | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|---|
income | Male | 0 | 1 | 45610.32 | 35638.22 | 10354.00 | 20088.00 | 33437 | 58063.0 | 182728 |
income | Female | 0 | 1 | 44853.93 | 34955.47 | 10363.00 | 21917.00 | 32164 | 57269.5 | 186634 |
limit | Male | 0 | 1 | 4713.17 | 2360.21 | 855.00 | 2998.00 | 4534 | 5884.0 | 13913 |
limit | Female | 0 | 1 | 4756.52 | 2264.16 | 855.00 | 3194.00 | 4768 | 5852.0 | 13414 |
rating | Male | 0 | 1 | 353.52 | 157.80 | 93.00 | 235.00 | 340 | 431.0 | 982 |
rating | Female | 0 | 1 | 356.27 | 152.17 | 117.00 | 251.50 | 355 | 439.5 | 949 |
cards | Male | 0 | 1 | 2.99 | 1.46 | 1.00 | 2.00 | 3 | 4.0 | 9 |
cards | Female | 0 | 1 | 2.93 | 1.28 | 1.00 | 2.00 | 3 | 4.0 | 7 |
age | Male | 0 | 1 | 55.60 | 16.99 | 24.00 | 42.00 | 55 | 69.0 | 98 |
age | Female | 0 | 1 | 55.73 | 17.53 | 23.00 | 41.00 | 57 | 70.0 | 91 |
education | Male | 0 | 1 | 13.47 | 3.10 | 6.00 | 11.00 | 14 | 16.0 | 20 |
education | Female | 0 | 1 | 13.43 | 3.16 | 5.00 | 11.00 | 14 | 16.0 | 20 |
balance | Male | 0 | 1 | 509.80 | 462.12 | 0.00 | 5.00 | 463 | 815.0 | 1999 |
balance | Female | 0 | 1 | 529.54 | 458.46 | 0.00 | 92.00 | 456 | 887.0 | 1809 |
income_card_ratio | Male | 0 | 1 | 19943.30 | 20008.10 | 1349.12 | 6497.00 | 12987 | 24824.0 | 149316 |
income_card_ratio | Female | 0 | 1 | 19156.20 | 19338.68 | 2168.40 | 7078.92 | 12319 | 25110.5 | 158889 |
For this task, you will visualize the data.
Use the ggplot() to call customers_work_samp. Then, add the following layers:
Question 7.1: What is the most frequent number of credit cards held by individuals in this data table?
Response 7.1: 2 credit cards
### plot single discrete variable
## choose data and mapping
ggplot(
# data
data = customers_work_samp
) +
## choose geometry with proportion calculation
geom_bar(
# mapping
mapping = aes(
# variable
x = cards,
)
) +
## label axes
labs(x = "Number of Credit Cards", y = "Count") +
## change format of x-axis
scale_x_continuous(
# axis breaks
breaks = seq(0, 10, 1)
)
Use the ggplot() to call customers_work_samp and map balance to the x-axis. Then, add the following layers:
Question 7.2: Is credit card balance normally distributed (i.e., symmetric and bell-shaped)?
Response 7.2: It is mostly symmetric and bell-shaped.
### histogram of single continuous variable
## choose data and mapping
ggplot(
# data
data = customers_work_samp,
# mapping
mapping = aes(
# x-axis
x = rating
)
) +
## choose geometry with proportion calculation
geom_histogram(bins = 10, fill = "skyblue") +
## text geometry above bars
stat_bin(
## geometry
geom = "text",
# add label
aes(
# label and number of digits
label = after_stat(count), group = 1
),
# number of bins
bins = 10,
# justify horizontally above bar
color = "black",
# size of text
size = 5,
# position label in middle of bars
position = position_stack(vjust = 0.5)
) +
## adjust x-axis scale
scale_x_continuous(n.breaks = 10) +
## label axes
labs(x = "Rating", y = "Count")
### density plot of single continuous variable
## choose data and mapping
ggplot(
# data
data = customers_work_samp,
# mapping
mapping = aes(x = income)
) +
## choose geometry with proportion calculation
geom_density(fill = "darkblue", color = "white", alpha = 0.5) +
## scale x-axis
scale_x_continuous(
# convert labels to dollars
labels = dollar_format(),
# breaks
n.breaks = 8
) +
## label axes
labs(x = "Income", y = "Density") +
## title
ggtitle(
# main title
"Distribution of Income",
# subtitle
subtitle = "Credit Card Customers"
) +
## alter theme
theme_fivethirtyeight()
Create a plot object named limit_balance_plot. Use the ggplot() to call customers_work_samp and map limit to the x-axis and balance to the y-axis. Then, add the following layers:
Print the plot to display it in the Plots window.
Question 7.3: Is the relationship between credit limit and balance linear?
Response 7.3: It is not perfectly linear but more linear than curvilinear.
# Create the plot object
limit_balance_plot <- ggplot(customers_work_samp, aes(x = limit, y = balance)) +
geom_point(alpha = 0.3, color = "red") +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
scale_x_continuous(labels = dollar_format(), n.breaks = 10) +
scale_y_continuous(labels = dollar_format(), n.breaks = 8) +
labs(x = "Credit Card Limit", y = "Credit Card Balance") +
ggtitle("Relationship Between Credit Limit and Balance") +
theme_few()
# Print the plot to display it
limit_balance_plot
For this task, you will perform a Pearson’s linear correlation test.
Create the following single piped command:
First, pipe customers_work_samp into select() to choose all numeric variables with where(is.numeric) while excluding any identifying variables. Pipe the result to correlate().
Question 8.1: Answer these questions: (1) What two variables have the smallest linear correlation in absolute value? (2) What two variables have the largest linear correlation in absolute value?
Response 8.1: (1) age and balance (2) limit and rating
### compute Pearson's linear correlation coefficient
## call data
customers_work_samp %>%
## select variables
select(
# continuous variables
where(is.numeric),
# remove ID variables
-contains("id")
) %>%
## correlation
correlate()
# Store the correlation matrix
cor_matrix <- customers_work_samp %>%
select(
where(is.numeric),
-contains("id")
) %>%
correlate()
# Convert to long format and remove diagonal/duplicates
cor_long <- cor_matrix %>%
stretch() %>% # converts to long format
filter(!is.na(r)) %>% # remove NA values
filter(r != 1) # remove perfect correlations (diagonal)
# Find smallest absolute correlation
smallest_cor <- cor_long %>%
mutate(abs_r = abs(r)) %>%
slice_min(abs_r, n = 1)
# Find largest absolute correlation
largest_cor <- cor_long %>%
mutate(abs_r = abs(r)) %>%
slice_max(abs_r, n = 1)
print("Smallest absolute correlation:")
## [1] "Smallest absolute correlation:"
## # A tibble: 2 × 4
## x y r abs_r
## <chr> <chr> <dbl> <dbl>
## 1 age balance 0.00184 0.00184
## 2 balance age 0.00184 0.00184
## [1] "Largest absolute correlation:"
## # A tibble: 2 × 4
## x y r abs_r
## <chr> <chr> <dbl> <dbl>
## 1 limit rating 0.997 0.997
## 2 rating limit 0.997 0.997
Compute a correlation test using cor_test(). Specify customers_work_samp as the data input and cards and balance as the continuous variables. Bind the name cor_test_res to the object. Print cor_test_res to view the result.
Question 8.2: Answer these questions: (1) What is the correlation value? (2) What is the empirical t-value? (3) What is the frequentist probability value?
Response 8.2: (1) 0.086 (2) 1.73 (3) 0.0842
### Pearson's linear correlation coefficient
## create object
corr_test_res <- cor_test(
# data
customers_work_samp,
# continuous variables
cards, balance
)
## print result
corr_test_res
Perform two tasks.
First, calculate the observed correlation using the infer functions and bind the name, corr_res, to it. Call customers_work_samp and pipe it to specify() and set the formula input to balance ~ cards. Pipe the result to calculate() and set stat to “correlation”. Print corr_res to view the result.
Second, produce a visualization using the infer functions. Call customers_work_samp and pipe it to specify() and set the formula input to balance ~ cards. Pipe the result to hypothesize() and set null to “independence”. Pipe the result to generate() and set reps to 2000 and type to “permute”. Pipe the result to calculate() and set stat to “correlation”. Pipe the result to visualize(). Pipe the result to shade_p_value() and set corr_res as the observed result and direction to “two-sided”. Pipe the result to labs() and set appropriate axes labels.
Question 8.3: What does the visualization highlight?
Response 8.3: The visualization highlights the null distribution - what correlation values we would expect to see between balance and cards if there were truly no relationship between these variables.
## observed correlation
## save
corr_res <- customers_work_samp %>%
## specify relationship
specify(
# formula
balance ~ cards
) %>%
## calculate observed statistic
calculate(
# statistic
stat = "correlation"
)
## print
corr_res
### visualize the null distribution
### and observed statistic
## call data
customers_work_samp %>%
## specify relationship
specify(
# formula
balance ~ cards
) %>%
## null hypothesis
hypothesize(
# null
null = "independence"
) %>%
generate(
# repetitions
reps = 2000,
# type
type = "permute"
) %>%
calculate(
# statistic
stat = "correlation"
) %>%
## visualize null distribution
visualize() +
## add observed statistic
shade_p_value(
# observed result
corr_res,
# direction
direction = "two-sided"
) +
## labels
labs(
# x-axis
x = "Correlation",
# y-axis
y = "Count"
)
Perform these tasks.
First, create a plot to visualize the linear correlation by:
Second, calculate the effect size of the linear correlation.
Question 8.4: Answer these questions: (1) Examining the plot, do cards and balance display much of a relationship? (2) What does the r-squared calculation imply about the size of the relationship between cards and balance?
Response 8.4: (1) Since the blue line is relatively straight, it indicates a relatively linear relationship between cards and balance especially for those with two or more cards. (2) Since the r-squared of 0.0074747 is lower than 0.10, it represents a very weak relationship of variance between cards and balance.
### visualize correlation
## call function
ggplot(
## data
customers_work_samp,
## mapping
aes(x = cards, y = balance)
) +
## points
geom_point(alpha = 0.3, color = "red") +
## loess line
geom_smooth(method = "loess", se = FALSE, color = "blue", span = 0.9) +
## scale x-axis
scale_x_continuous(
# breaks
n.breaks = 8
) +
## scale y-axis
scale_y_continuous(
# breaks
n.breaks = 10
) +
## label axes
labs(x = "Cards", y = "Balance")
For this task, you will save objects you created.
Save the data object customers_work_samp as a data file named customers_work_samp.csv to the data folder of the project directory. Use write_csv() and here() to accomplish this task.
Save the plot object limit_balance_plot as a file named limit_balance.png to the plots folder of the project directory. Use ggsave() and here() to accomplish this task.
For your last task, you will respond to conceptual questions.
Question 10.1: What are some common tasks for querying data?
Response 10.1: Some common tasks for querying data include filtering, grouping and summarizing,transforming variables, and joining data.
Question 10.2: Answer these two questions: (1) What information does a frequentist probability value from a null hypothesis test provide? (2) What information does an effect size calculation provide?
Response 10.2: (1) A frequentist probability value tells us the probability of observing our result if the null hypothesis were true. A low p-value (< 0.05) suggests strong evidence against the null hypothesis while a high p-value (≥ 0.05) suggests insufficient evidence to reject the null hypothesis. (2) The effect size helps us to understand the magnitude of the relationship/difference: how large or meaningful the observed relationship or difference is.
Question 10.3: Why is it important to write scripts for analytics?
Response 10.3: There are many great reasons why it is important to write scripts for analytics. Unlike non-script programs, scripts can be replicated exactly therefore they ensure scientific rigor. They also serve as evidence/complete record of the analytical processes that have been used and therefore are more transparent and accurate. They are great at reducing human transcription errors that occur when using point and click interfaces which makes them especially useful in handling larger data sets and therefore performing more complex analyses without forgetting steps or transcription mistakes.