RPUBS Link https://rpubs.com/enewington/716981
#install.packages('dplyr', dependencies=TRUE)
#install.packages('tidyverse')
#install.packages("rlang")
#install.packages("magrittr")
library(magrittr)
library(plyr)
library(rlang)
library(readr)
library(foreign)
library(rvest)
library(knitr)
library(dplyr)
library(tidyverse)
#read data
customer_data <- read.csv('Assignment 1 Dataset - Master.csv', stringsAsFactors = TRUE)
#view the first 6 rows of data
head(customer_data)
#remove the last two variables
customer_data2 <- customer_data[ , -c(22,23)]
#view the first 6 rows of data again
head(customer_data2)
##Steps taken to import the data
*To import the data, the base R function of ‘read.csv’ was used, with the ‘stringasfactors’ call used in order to ensure characters are treated as factors, this makes ordering categorical variables easier
*Upon viewing the first 6 rows using the ‘head’ call, we can see the last two variables are not needed so columns 22 and 23 have been dropped. We then view the first 6 rows again just to ensure the columns for naive bayes classifier varaibles were removed.
The dataset was sourced from https://leapsapp.analyttica.com/cases/11
The dataset is used from the view of a business manager of a consumer credit card portfolio, as the fictitious bank has been facing severe customer attrition problem in the recent months.The business manager wants to leverage the power of data analytics to understand the primary reasons of attrition and wants to have an ability to understand the customers who are likely to close their accounts with the bank in near future, so that she can focus her efforts, well in advance to retain those customers.
VARIABLE DESCRIPTIONS Clientnum: Client number. Unique identifier for the customer holding the account
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then 1 else 0
Customer_Age: Demographic variable - Customer’s Age in Years
Gender: Demographic variable - M=Male, F=Female
Dependent_count: Demographic variable - Number of dependents
Education_Level: Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
Marital_Status: Demographic variable - Married, Single, Unknown
Income_Category: Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
Card_Category: Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
Months_on_book: Months on book (Time of Relationship)
Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: Total Revolving Balance on the Credit Card
Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio: Average Card Utilization Ratio
(Predict Customer Attrition Using Naïve Bayes Classification, 2021)
The following R chunks have been used to inspect the data:
str() has been used to understand the structure,class and number of observations and variables
#inspect structure of the dataframe
str(customer_data2)
'data.frame': 10127 obs. of 21 variables:
$ CLIENTNUM : int 768805383 818770008 713982108 769911858 709106358 713061558 810347208 818906208 710930508 719661558 ...
$ Attrition_Flag : Factor w/ 2 levels "Attrited Customer",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Customer_Age : int 45 49 51 40 40 44 51 32 37 48 ...
$ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 2 2 2 2 ...
$ Dependent_count : int 3 5 3 4 3 2 4 0 3 2 ...
$ Education_Level : Factor w/ 7 levels "College","Doctorate",..: 4 3 3 4 6 3 7 4 6 3 ...
$ Marital_Status : Factor w/ 4 levels "Divorced","Married",..: 2 3 2 4 2 2 2 4 3 3 ...
$ Income_Category : Factor w/ 6 levels "$120K +","$40K - $60K",..: 3 5 4 5 3 2 1 3 3 4 ...
$ Card_Category : Factor w/ 4 levels "Blue","Gold",..: 1 1 1 1 1 1 2 4 1 1 ...
$ Months_on_book : int 39 44 36 34 21 36 46 27 36 36 ...
$ Total_Relationship_Count: int 5 6 4 3 5 3 6 2 5 6 ...
$ Months_Inactive_12_mon : int 1 1 1 4 1 1 1 2 2 3 ...
$ Contacts_Count_12_mon : int 3 2 0 1 0 2 3 2 0 3 ...
$ Credit_Limit : num 12691 8256 3418 3313 4716 ...
$ Total_Revolving_Bal : int 777 864 0 2517 0 1247 2264 1396 2517 1677 ...
$ Avg_Open_To_Buy : num 11914 7392 3418 796 4716 ...
$ Total_Amt_Chng_Q4_Q1 : num 1.33 1.54 2.59 1.4 2.17 ...
$ Total_Trans_Amt : int 1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
$ Total_Trans_Ct : int 42 33 20 20 28 24 31 36 24 32 ...
$ Total_Ct_Chng_Q4_Q1 : num 1.62 3.71 2.33 2.33 2.5 ...
$ Avg_Utilization_Ratio : num 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
dim() has been used to as a quick way to identify the number of observations and variables. There are 10,127 observations (rows) and 21 variables (columns).
#inspect dimensions of the dataframe
dim(customer_data2)
[1] 10127 21
The education, income and card category variables have been converted to a factor and reordered. Other variables have then been renamed for ease of use
#rename variables
customer_data3 <- as_tibble(customer_data2)
customer_data3 %<>%
rename(
age = Customer_Age,
Dependents = Dependent_count,
months_as_customer = Months_on_book,
total_products_held = Total_Relationship_Count,
months_inactive_prev_year = Months_Inactive_12_mon,
contacts_in_prev_year = Contacts_Count_12_mon,
avg_available_credit = Avg_Open_To_Buy,
transaction_amt_change_Q4Q1 = Total_Amt_Chng_Q4_Q1,
transaction_count_change_Q4Q1 = Total_Ct_Chng_Q4_Q1,
gender = Gender,
dependants =Dependent_count,
marital_status = Marital_Status,
credit_limit = Credit_Limit,
revolving_balance = Total_Revolving_Bal,
total_trans_amount = Total_Trans_Amt,
total_trans_count = Total_Trans_Ct,
avg_utilisation_ratio =Avg_Utilization_Ratio,
education = Education_Level,
income = Income_Category,
card_category = Card_Category
)
#order variables that have a hierarchy
education <- as.factor(education)
education <- factor(education,
levels=c('Unknown',
'Uneducated',
'High School',
'College',
'Graduate',
'Post-Graduate',
'Doctorate'),
ordered=TRUE)
summary(education)
Unknown Uneducated High School College Graduate Post-Graduate Doctorate
1519 1487 2013 1013 3128 516 451
income <- as.factor(income)
income <- factor(income,
levels=c('Unknown',
'Less than $40K',
'$40K - $60K',
'$60K - $80K',
'$80K - $120K',
'$120K +'),
ordered=TRUE)
summary(income)
Unknown Less than $40K $40K - $60K $60K - $80K $80K - $120K $120K +
1112 3561 1790 1402 1535 727
card_category <- as.factor(card_category)
card_category <- factor(card_category,
levels=c('Blue',
'Silver',
'Gold',
'Platinum'),
ordered=TRUE)
summary(card_category)
Blue Silver Gold Platinum
9436 555 116 20
head(customer_data3)
This dataset did not require additional cleaning, as it already adheres to the three tidy data principles.
each variable forms a column: all variables are clearly defined into logical columns, there are no variables that should be split to form independent columns
each observation forms a row: all observations are contained in a unique row
each type of observational unit forms a table: the dataset is contained within a single table and each cell only contains one response
The dplyr package has been used to group a qualitative variable, such as ‘card category’ and then apply summary statistics for a numeric variable within each category, such as ‘age’.
The groupings performed are Card Category by Age and Credit Limit, and Education Level by credit limit.
The summary statistics performed are mean, median, minimum, maximum and standard deviation.
A limitation with this function I have found is that the ordering of the factors ‘card category’ and ‘education level’ have not persisted. When the ‘arrange’ function was applied, the ordering remained alphabetical and not the ordered the factors were given.
summarydataage <- customer_data3 %>%
group_by(card_category) %>%
summarise(meanage =mean(age),
medianage = median(age),
minage = min(age),
maxage = max(age),
stdevage = sd(age)
)
summarydataage
summarydatacredit <- customer_data3 %>%
group_by(card_category) %>%
summarise(mean.creditlim =mean(credit_limit),
median.creditlim = median(credit_limit),
min.creditlim = min(credit_limit),
max.creditlim = max(credit_limit),
stdev.creditlim = sd(credit_limit)
)
summarydatacredit
summarydataeducation <- customer_data3 %>%
group_by(education) %>%
summarise(mean.creditlim =mean(credit_limit),
median.creditlim = median(credit_limit),
min.creditlim = min(credit_limit),
max.creditlim = max(credit_limit),
stdev.creditlim = sd(credit_limit)
)
summarydataeducation
arrange(summarydataeducation,education)
This list has been created to contain a numeric value (using as.numeric()) for each response to the categorical variable of education, meaning that each categorical value is represented by a number from 1-n. The list also includes marital status in it’s original form,
The numerical version of education has been renamed to ‘educationum’ to clearly distinguish between the original variable and the new numeric versions.
#Step 1: Create a list using list() function.
customer_list <- list(educationnum = list(as.numeric(education)),
marital_status = list(marital_status))
# check the class of list1
class(customer_list)
[1] "list"
# Step 2: View structure of the lists. We can see the items have been named correctly.
str(customer_list)
List of 2
$ educationnum :List of 1
..$ : num [1:10127] 3 5 5 3 2 5 1 3 2 5 ...
$ marital_status:List of 1
..$ : Factor w/ 4 levels "Divorced","Married",..: 2 3 2 4 2 2 2 4 3 3 ...
The right join function has been used to join customer_list2 and the original dataframe using marital status as the key. The numeric version of education has been maintained. The Rpubs link printed within the header of this document will provide an interactive version to scroll across to see the educationnum column.
#convert the list back into a dataframe
customer_list2 <- as.data.frame(customer_list)
str(customer_list2)
'data.frame': 10127 obs. of 2 variables:
$ c.3..5..5..3..2..5..1..3..2..5..2..1..4..5..5..1..6..1..3..5.. : num 3 5 5 3 2 5 1 3 2 5 ...
$ structure.c.2L..3L..2L..4L..2L..2L..2L..4L..3L..3L..4L..2L..3L..: Factor w/ 4 levels "Divorced","Married",..: 2 3 2 4 2 2 2 4 3 3 ...
class(customer_list2)
[1] "data.frame"
#name the columns
colnames(customer_list2) <- c("educationnum", "marital_status")
head(customer_list2)
##create the join
customerjoin <- customer_data3 %>%
right_join(customer_list2, by = 'marital_status')
head(customerjoin)
The ‘subset_df’ is a new dataframe that has taken rows 1 to 10 of the customer_data3 dataset and as kept all of the columns.
The data.matrix fuction has been used to convert the dataframe to a matrix and we can see that in the printed matrix and the structure of the matrix, that the categorical variables have been coerced into numeric variables, as all elements of the matrix must be the same type (with the column names remaining as characters).
#create the subset
subset_df <- customer_data3[1:10, ]
#print the subset
subset_df
# create a numeric matrix
customer_matrix <- data.matrix(subset_df)
customer_matrix
CLIENTNUM Attrition_Flag age gender dependants education marital_status income card_category
[1,] 768805383 2 45 2 3 4 2 3 1
[2,] 818770008 2 49 1 5 3 3 5 1
[3,] 713982108 2 51 2 3 3 2 4 1
[4,] 769911858 2 40 1 4 4 4 5 1
[5,] 709106358 2 40 2 3 6 2 3 1
[6,] 713061558 2 44 2 2 3 2 2 1
[7,] 810347208 2 51 2 4 7 2 1 2
[8,] 818906208 2 32 2 0 4 4 3 4
[9,] 710930508 2 37 2 3 6 3 3 1
[10,] 719661558 2 48 2 2 3 3 4 1
months_as_customer total_products_held months_inactive_prev_year contacts_in_prev_year credit_limit
[1,] 39 5 1 3 12691
[2,] 44 6 1 2 8256
[3,] 36 4 1 0 3418
[4,] 34 3 4 1 3313
[5,] 21 5 1 0 4716
[6,] 36 3 1 2 4010
[7,] 46 6 1 3 34516
[8,] 27 2 2 2 29081
[9,] 36 5 2 0 22352
[10,] 36 6 3 3 11656
revolving_balance avg_available_credit transaction_amt_change_Q4Q1 total_trans_amount total_trans_count
[1,] 777 11914 1.335 1144 42
[2,] 864 7392 1.541 1291 33
[3,] 0 3418 2.594 1887 20
[4,] 2517 796 1.405 1171 20
[5,] 0 4716 2.175 816 28
[6,] 1247 2763 1.376 1088 24
[7,] 2264 32252 1.975 1330 31
[8,] 1396 27685 2.204 1538 36
[9,] 2517 19835 3.355 1350 24
[10,] 1677 9979 1.524 1441 32
transaction_count_change_Q4Q1 avg_utilisation_ratio
[1,] 1.625 0.061
[2,] 3.714 0.105
[3,] 2.333 0.000
[4,] 2.333 0.760
[5,] 2.500 0.000
[6,] 0.846 0.311
[7,] 0.722 0.066
[8,] 0.714 0.048
[9,] 1.182 0.113
[10,] 0.882 0.144
str(customer_matrix)
num [1:10, 1:21] 7.69e+08 8.19e+08 7.14e+08 7.70e+08 7.09e+08 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:21] "CLIENTNUM" "Attrition_Flag" "age" "gender" ...
To subset the dataframe with the first and last variables, I referred back to the original dim(customer_data3) answer provided earlier to recall that there are 21 variables. This subset is saying; take all the rows, but only columns 1 and 21. The second line is to check that this action has been performed correctly. To save the file as an RDS, the saveRDS function has been used to save the new ‘subset_df2’ object into my working directory.
#create a subset
subset_df2 <- customer_data3[,c(1,21)]
#print the subset
subset_df2
#save as RDS
saveRDS(subset_df2, file = "subset_df2.rds")
The first step was to create the vectors that will be used in the dataframe.
‘daysofweek’ has been created as a vector and then ordered in a separate variable called ‘ordered_daysofweek’ The subsequent code below checks that the variable is a factor and has been ordered correctly.
The second variable ‘stepsperday’ has been created with the specific class of integer and checked using class() and str()
The dataframe (erins_df) has then been created using data.frame() with the two variables that were just created, and then subsequently printed below.
A third numeric variable was then created (avg_heartrate) and combined into the previously created df using the column bind (cbind()) function. The new dataframe has then been printed to confirm it has been created correctly.
# Create first vector
daysofweek <- as.factor(c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday")
)
ordered_daysofweek <- factor(daysofweek,
levels=c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"),
ordered=TRUE)
ordered_daysofweek
[1] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Levels: Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday
class(daysofweek)
[1] "factor"
class(ordered_daysofweek)
[1] "ordered" "factor"
str(daysofweek)
Factor w/ 7 levels "Friday","Monday",..: 2 6 7 5 1 3 4
str(ordered_daysofweek)
Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 1 2 3 4 5 6 7
#create second vector
stepsperday <- as.integer(c(6532,10524,9536,5214,114562,2312,7845))
class(stepsperday)
[1] "integer"
str(stepsperday)
int [1:7] 6532 10524 9536 5214 114562 2312 7845
#create the dataframe
erins_df <- data.frame(ordered_daysofweek, stepsperday)
erins_df
#create another numeric variable
avg_heartrate <- as.numeric(c(85,65,75,66,85,99,55))
avg_heartrate
[1] 85 65 75 66 85 99 55
class(avg_heartrate)
[1] "numeric"
#add the new variable into the original dataset; cbind has been used to include the new variable as a new column into the erins_df dataframe
combined_df <- cbind(erins_df, avg_heartrate)
combined_df
A new dataframe (erins_df2) has been created using one new vector (participant) and one previously used vector (stepsperday).
The join has been created using the full join function in dplyr, by joining combined_df and the new dataframe (erins_df2). This join specifically states to join by the common variable ‘ordered_daysofweek’ and to return all observations from both tables and the left and right tables.
# Create the vectors, including a common variable from step 11
# Create first vector
daysofweek <- as.factor(c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday")
)
ordered_daysofweek <- factor(daysofweek,
levels=c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"),
ordered=TRUE)
#create the new vector
particpants <- c("Erin","Kate","Ryan","Madelyn","Grace","Wayne","Brett")
#create a new data frame
erins_df2 <- data.frame(particpants,ordered_daysofweek)
erins_df2
##create the join
fulljoindf <- combined_df %>%
full_join(erins_df2, by = 'ordered_daysofweek')
fulljoindf
##References
DataNovia. (2021, 01 25). Rename Data Frame Columns in R . Retrieved from DataNovia: https://www.datanovia.com/en/lessons/rename-data-frame-columns-in-r/
Predict Customer Attrition Using Naïve Bayes Classification. (2021, 01 25). Retrieved from LEAPS: https://leaps.analyttica.com/sample_cases/11
Wickham, H., Francois, R., Henry, L., & Muller, K. (2021, 01 25). Group by one or more variables. Retrieved from dplyr part of the tiyverse: https://dplyr.tidyverse.org/reference/group_by.html