My primary goal for this assignment was to identify potential factors that influence a cardholder’s credit limit.
Upon viewing this data, my initial hypothesis was that, on average, the more dependents a cardholder has, the higher their credit limit is.
Below, I have created two histograms that plot the count of cardholders in different credit limit ranges based on the number of dependents they have. This first histogram only graphs those who have one dependent. Examining the graph, we can see that the majority of cardholders have a credit limit below $6,000. It also shows that, for the most part, there is a decline in the number of cardholders as the credit limit increases, until it reaches $35,000.
data = read.csv("BankChurners.csv", header = TRUE)
data = data[ , -c(22, 23)]
hist(data$Credit_Limit[data$Dependent_count == 1], breaks = 15, main = "Credit Limit For 1 Dependent", xlab = "Credit Limit")
The second histogram graphs those who have 5 dependents. Similarly, this graph follows the same pattern as the previous one however there is a greater count of card holders in the $35,000 range.
hist(data$Credit_Limit[data$Dependent_count == 5], breaks = 15, main = "Credit Limit For 5 Dependent", xlab = "Credit Limit", ylim = c(0, 140))
To test my hypothesis, I perform an aggregate function to find the mean of the credit limit for each dependent count. Analyzing the results we can see that the average credit limit increases with the number of the dependents a cardholder has. These results can be an indication that the hypothesis is correct.
credit_avg = aggregate(data$Credit_Limit ~ data$Dependent_count, FUN = mean)
credit_avg
## data$Dependent_count data$Credit_Limit
## 1 0 7160.764
## 2 1 7905.123
## 3 2 8717.175
## 4 3 8976.507
## 5 4 9454.955
## 6 5 9110.453
I plotted the means over a line graph to better visualize the relationship between the number of dependents and the credit limit on the account. Here, we can see that the average credit limit increases as the number of dependents rises from zero to four, followed by a slight drop to five dependents.
plot(credit_avg, type = "l", main = "Avrage Credit Limit Over Dependent Count", xlab = "Dependent Count", ylab = "Avrage Credit Limit")
To further test this theory, I perform a correlation analysis to detect the strength of association between the dependent count and credit limit. This association is measured between the numbers -1 and 1, with -1 indicating a strong negative relationship, 0 indicating a weak or no relationship, and 1 indicating a strong positive relationship. In this case, we received a score of 0.068 as the output, indicating a weak positive relationship between the number of dependents and the credit limit. Considering that the score seems to contradict the previous results, this may indicate that there is no relationship between dependent count and credit limit, and that the previous data is being misrepresented. Another possibility is that an error was made along the way, leading to the contradiction.
cor.test(data$Dependent_count, data$Credit_Limit)
##
## Pearson's product-moment correlation
##
## data: data$Dependent_count and data$Credit_Limit
## t = 6.8648, df = 10125, p-value = 7.049e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04865232 0.08742548
## sample estimates:
## cor
## 0.0680646
Another column I would like to examine is the age column to see if it has any possible effect on the credit limit. First, I would like to graph the age of the cardholders to visualize the distribution better.
Analyzing the graph, we can see that a majority of the cardholders are between the ages of 40 and 50. Another interesting fact is that none of the cardholders are below the age of 25 or above 70. It may be that the data provided excluded these categories of clients.
hist(data$Customer_Age, main = "Age of Credit Card Holders", breaks = 15, xlab = "Age", ylim = c(0, 2500))
After reviewing the histogram, I chose to divide the data set into two groups based on whether the cardholder is younger or older than 45 and used the t-test to compare the two groups. In this case, the Null hypothesis is that the mean credit limit is the same for both groups, and the alternative hypothesis is that the mean credit limit is different between these age groups. Seeing as the p-value is 0.2441, this means we do not reject the null hypothesis, as the value is much larger than the common threshold of 0.05.
data$Age_Group = ifelse(data$Customer_Age < 45, "Under 45", "Over 45")
t.test(Credit_Limit ~ Age_Group, data = data)
##
## Welch Two Sample t-test
##
## data: Credit_Limit by Age_Group
## t = 1.165, df = 9127.7, p-value = 0.2441
## alternative hypothesis: true difference in means between group Over 45 and group Under 45 is not equal to 0
## 95 percent confidence interval:
## -145.2398 570.7616
## sample estimates:
## mean in group Over 45 mean in group Under 45
## 8719.667 8506.907
boxplot(Credit_Limit ~ Age_Group, data = data, main = "Credit Limit by Age Group", xlab = "Age Group", ylab = "Credit Limit", col = c("skyblue", "salmon"))