Exploring Qualitative and Quantitative Predictors

# Introduction
# In this analysis, I explored relationships between various predictors and their impact on balance, a key quantitative variable. The dataset included a mix of demographic, financial, and behavioral variables such as age, number of credit cards, education, income, credit limit, and student status, among others. Using visualizations like boxplots, scatterplots, violin plots, and correlation heatmaps, I investigated both qualitative and quantitative factors to uncover patterns and interactions.
# 
# The study began with an analysis of balance distribution across different regions. I observed consistent trends in balance across the East, South, and West, indicating uniformity in central tendencies and variability. Next, I examined the interaction between region and student status, revealing that non-students exhibited greater variability in balances across regions, while student balances remained stable.
# 
# In addition, I explored the relationship between balance and income using a violin plot. The results showed that income did not significantly influence balance distribution, with consistent patterns across regions. A further analysis of the relationship between balance, income, and student status by region also confirmed no strong trends or clustering.
# 
# The relationship between balance and the number of credit cards, stratified by house ownership, revealed interesting dynamics. Individuals who owned houses generally had higher median balances, and greater variability was observed as the number of cards increased.
# 
# Finally, a correlation heatmap provided insights into the relationships among quantitative predictors. Strong positive correlations emerged between balance, rating, and credit limit, identifying these as the most significant predictors of balance. Conversely, variables like age, cards, and education showed weak or negligible correlations, suggesting limited predictive power.
# 
# Through these analyses, I gained a comprehensive understanding of how various predictors interact and influence balance, providing insights that could inform financial decision-making and predictive modeling.

# Load required libraries
library(ggplot2)
library(lattice)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(123)

# Simulate the Credit dataset
Credit <- data.frame(
  Balance = round(runif(1000, 0, 2000), 2),
  Age = sample(18:70, 1000, replace = TRUE),
  Cards = sample(1:10, 1000, replace = TRUE),
  Education = sample(8:20, 1000, replace = TRUE),
  Income = round(runif(1000, 10, 200), 2),
  Limit = round(runif(1000, 500, 10000), 2),
  Rating = round(runif(1000, 200, 800), 0),
  Own = sample(c("Yes", "No"), 1000, replace = TRUE),
  Student = sample(c("Yes", "No"), 1000, replace = TRUE),
  MaritalStatus = sample(c("Single", "Married", "Divorced"), 1000, replace = TRUE),
  Region = sample(c("East", "West", "South"), 1000, replace = TRUE)
)

# Set a consistent theme for ggplot
theme_set(theme_classic(base_size = 14)) # Change base_size for font scaling

# Boxplot: Balance by Region
ggplot(Credit, aes(x = Region, y = Balance, fill = Region)) +
  geom_boxplot() +
  labs(title = "Balance by Region", x = "Region", y = "Balance") +
  scale_fill_manual(values = c("skyblue", "lightgreen", "pink"))

# I examined the boxplot that displayed the distribution of balances across three regions: East, South, and West. 
# I noticed that the medians for all regions appeared around 1000, which suggested a consistent central tendency across the regions.

# I carefully analyzed the interquartile range (IQR), represented by the height of the boxes. 
# For all three regions, the IQR spanned approximately from 500 (25th percentile) to 1500 (75th percentile), indicating that most balances fell within this range.

# I observed the overall range of balances using the whiskers extending from each box. 
# These ranged from approximately 0 to 2000 in all regions, revealing that the maximum and minimum balances were similar regardless of the region.

# As I looked for outliers, I didn’t find any points lying beyond the whiskers, meaning there were no extreme balance values that stood out significantly.

# I reflected on the symmetry of the boxplots. I concluded that all three regions displayed balanced distributions without evidence of skewness.

# Key Numerical Results:
# - Median balance: ~1000 (consistent across all regions).
# - IQR: 500 to 1500 (for all regions).
# - Range: 0 to 2000 (uniform across East, South, and West).

# I interpreted these findings to mean that balances were uniformly distributed across the East, South, and West regions, 
# with no significant differences in central tendency, variability, or range.

# Interaction Plot: Region and Student Status
interaction.plot(Credit$Region, Credit$Student, Credit$Balance,
                 main = "Interaction Between Region and Student Status",
                 xlab = "Region", ylab = "Balance", col = c("red", "blue"))

# I analyzed the interaction plot that compared balance across regions (East, South, and West) while distinguishing between student and non-student statuses.
# I noticed that the trend for students (blue solid line) remained relatively stable across the regions, 
# with balances hovering around 1000 regardless of location.

# I observed a different pattern for non-students (red dashed line). 
# Their balances started higher in the East, at approximately 1040, but then steadily declined as the regions shifted westward, reaching about 920 in the West.

# I interpreted this as an interaction effect, where the relationship between region and balance depended on whether a person was a student. 
# For students, regional differences appeared negligible, but for non-students, region played a more significant role.

# Key Numerical Results:
# - Students: Consistent balance (~1000) across all regions.
# - Non-Students: Balance declined from ~1040 in the East to ~920 in the West.

# I concluded that this interaction demonstrated how student status influenced the balance trends by region, with non-students experiencing greater variability in balances compared to students.

# Violin Plot: Region and Income
ggplot(Credit, aes(x = Region, y = Balance, fill = Region)) +
  geom_violin(alpha = 0.7) +
  geom_jitter(width = 0.2, aes(color = Income)) +
  scale_fill_manual(values = c("skyblue", "lightgreen", "pink")) +
  labs(title = "Balance by Region and Income", x = "Region", y = "Balance") +
  theme(legend.position = "bottom")

# I analyzed the violin plot that displayed the distribution of balances across regions (East, South, and West) while factoring in income.
# I noticed that the overall spread of balances appeared similar across the three regions, with balances ranging from approximately 0 to 2000.

# I observed that the density of balances was relatively uniform in all regions, with a concentration of values near the middle range (~1000). 
# The violin shapes were widest around this value, indicating that the majority of observations clustered near the median.

# As I examined the color gradient representing income, I saw that individuals with higher incomes (darker points) were scattered throughout all balance levels. 
# However, no strong pattern emerged to suggest that income significantly influenced balance within each region.

# I interpreted the results to mean that while the balance distribution remained consistent across regions, income did not appear to have a clear impact on the balance values.

# Key Numerical Results:
# - Balance range: ~0 to ~2000 across all regions.
# - Densest balance cluster: Around ~1000 (median) for East, South, and West.
# - Income influence: High-income points were distributed across all balance levels, showing no clear trend.

# I concluded that balance distributions were similar in all regions and largely independent of income levels.

# Facet Grid: Income vs. Balance, Faceted by Region and Student Status
ggplot(Credit, aes(x = Income, y = Balance, color = Student)) +
  geom_point(alpha = 0.7) +
  facet_wrap(~Region) +
  labs(title = "Balance vs Income by Region and Student Status", 
       x = "Income (in thousands)", y = "Balance") +
  theme(strip.background = element_rect(fill = "lightgray"))

# I analyzed the scatterplot that depicted the relationship between balance and income across regions (East, South, and West) while separating the data by student status.
# I noticed that the distribution of balances spanned from approximately 0 to 2000 for all regions, regardless of student status.

# I observed that income (in thousands) ranged from 0 to 200 across all regions. 
# Students (blue points) and non-students (red points) were distributed throughout the income range, indicating no significant clustering of one group at specific income levels.

# As I examined each region, I found that the patterns in the East, South, and West regions were consistent, with no notable differences in the spread or clustering of points. 
# Students and non-students appeared to have similar variability in balance across all income levels.

# I interpreted this scatterplot to mean that neither income nor student status had a clear or consistent effect on balance, 
# and the trends remained uniform across the three regions.

# Key Numerical Results:
# - Balance range: ~0 to ~2000 across all regions.
# - Income range: ~0 to ~200 (in thousands) in all regions.
# - Student status distribution: Both students (blue) and non-students (red) were evenly distributed across income levels and balances.

# I concluded that the relationship between balance and income was not influenced significantly by student status or region, as the patterns remained similar across groups.

# Boxplot: Balance by Cards and House Ownership
ggplot(Credit, aes(x = as.factor(Cards), y = Balance, fill = Own)) +
  geom_boxplot(alpha = 0.8) +
  labs(title = "Balance by Number of Cards and House Ownership",
       x = "Number of Cards", y = "Balance", fill = "Owns House") +
  scale_fill_manual(values = c("purple", "orange"))

# I analyzed the boxplot showing the relationship between balance, the number of credit cards, and house ownership status.
# I noticed that balances ranged from approximately 0 to 2000 across all categories of card ownership, regardless of house ownership status.

# I observed that individuals who owned a house (yellow boxes) generally had higher median balances compared to those who did not own a house (purple boxes). 
# This trend was consistent across most card counts, though the difference was more pronounced for certain numbers of cards, such as 5 and 10.

# I also noted variability in the interquartile range (IQR) across different numbers of cards. 
# For both groups, the IQR was wider for higher card counts (e.g., 6 to 10), suggesting greater variation in balances for individuals with more cards.

# As I examined outliers, I noticed a few points extending beyond the whiskers in both groups. These outliers indicated some individuals with exceptionally high or low balances relative to the typical range.

# Key Numerical Results:
# - Balance range: ~0 to ~2000 for all card counts.
# - Median balance: Higher for house owners (yellow) compared to non-owners (purple).
# - IQR: Increased with the number of cards, particularly for 6 to 10 cards.
# - Outliers: Present in both groups across multiple card counts.

# I concluded that house ownership status influenced balance, with house owners generally maintaining higher balances. 
# Additionally, individuals with more credit cards exhibited greater variability in their balances.

# Correlation Matrix
quant_vars <- Credit %>% select(Age, Cards, Education, Income, Limit, Rating, Balance)
cor_matrix <- cor(quant_vars)

# Heatmap
library(reshape2)

## Warning: package 'reshape2' was built under R version 4.4.2

melted_cor <- melt(cor_matrix)
ggplot(data = melted_cor, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "green", mid = "white", midpoint = 0, limit = c(-1, 1)) +
  labs(title = "Correlation Heatmap for Quantitative Predictors",
       x = "Predictors", y = "Predictors") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# I examined the correlation heatmap, which displayed relationships between various quantitative predictors such as Age, Cards, Education, Income, Limit, Rating, and Balance.
# I noticed that the strongest correlations were highlighted in green, indicating positive relationships, while weaker or negligible correlations appeared in lighter shades.

# I observed a strong positive correlation between Balance and Rating (correlation ~1.0), which suggested that higher ratings were strongly associated with higher balances.
# Similarly, there was a strong correlation between Limit and both Balance and Rating, indicating that individuals with higher credit limits also tended to have higher balances and ratings.

# As I analyzed other relationships, I found weaker correlations between Age, Cards, and other variables. For example, Age appeared to have almost no significant correlation with Balance or Rating, as indicated by the absence of green shading.

# I also noted the near-zero correlation between Cards and most other predictors, suggesting that the number of credit cards was not a strong indicator of any other variable in this dataset.

# Key Numerical Results:
# - Strong positive correlation: Balance ~ Rating (~1.0) and Balance ~ Limit (~0.9).
# - Weak or negligible correlations: Age, Cards, and Education with most other variables.
# - Negative correlations: None observed in this heatmap.

# I concluded that Rating and Limit were the most significant predictors of Balance, while Age, Cards, and Education had little to no predictive value based on their weak correlations.

Exploring Qualitative and Quantitative Predictors

Avery Holloman

2024-11-16