Credit Card Data

I will analyze dependents as a dichotomous variable (0 == having no dependents, 1 == having 1 or more dependents). income will be my quadratic term, and I will make it quadratic by squaring it. Finally, I will add income * dependents as my interaction term

data("CreditCard")

#Setting up a dichotomous variable
Credit <- CreditCard %>% mutate(dependents10 = ifelse(dependents == 0, 0, 1))

#Setting up a quadratic term
Credit <- Credit %>% mutate(income_quad = income^2)
lm <- lm(active ~ income_quad + dependents10 + dependents * income, data = Credit)

summary(lm)
## 
## Call:
## lm(formula = active ~ income_quad + dependents10 + dependents * 
##     income, data = Credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.129  -4.690  -1.320   3.651  38.591 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.85373    0.77230   3.695 0.000229 ***
## income_quad       -0.08704    0.03459  -2.516 0.011977 *  
## dependents10       1.41908    0.58640   2.420 0.015656 *  
## dependents        -0.34610    0.43724  -0.792 0.428767    
## income             1.45185    0.35503   4.089 4.59e-05 ***
## dependents:income  0.03149    0.08086   0.389 0.697022    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.176 on 1313 degrees of freedom
## Multiple R-squared:  0.04428,    Adjusted R-squared:  0.04064 
## F-statistic: 12.17 on 5 and 1313 DF,  p-value: 1.513e-11

Based on the summary diagnostics, we can see that the model, itself is significant at the p < .001 level (F(5,1313) = 12.17); however, the Multiple R^2 value (4%) and the Adjusted R^2 (4%) are very low, whereas the standard error is rather high (6.18), suggesting that the model may be a poor fit. Most of the coefficients, themselves, are significant at least at the p < .05 level, with the exception of dependents and dependents * income (our non-quadratic/dichotomized terms)

Residual Analysis

#Plotting the residuals
plot(fitted(lm), resid(lm))

While the residuals are scattered around the 0 value, it would appear that there is a lot of funnelling near the lower end of the graph, and many of the residuals are highly concentrated; I believe these data violate the homoscedasticity assumption

qqnorm(resid(lm))
qqline(resid(lm))

The qqplot shows that the residuals are not normally distributed, demonstrating that the linearity assumption has been violated (I assume this is because of the inclusion of a quadratic term)

par(mfrow = c(2,2))
plot(lm)

Conclusions

Based on all of this, it appears that the model is not a good fit for our data, as the scales of our predictors are too disparate. Because of this, we would need a more robust multiple regression model, or we would need to perform further transformations to our variables to normalize our residuals. If I were to guess, however, I believe the primary issue with this model would be that one power transformation was performed to the income variable, and no further transformations were performed to the outcome variable.