In this take-home lab exercise, we will apply linear regression models to analyze Credit dataset. First we will apply linear regression model to Credit dataset.

Part 1

The summary of the Credit data set is as follows:

##        ID            Income           Limit           Rating     
##  Min.   :  1.0   Min.   : 10.35   Min.   :  855   Min.   : 93.0  
##  1st Qu.:100.8   1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2  
##  Median :200.5   Median : 33.12   Median : 4622   Median :344.0  
##  Mean   :200.5   Mean   : 45.22   Mean   : 4736   Mean   :354.9  
##  3rd Qu.:300.2   3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2  
##  Max.   :400.0   Max.   :186.63   Max.   :13913   Max.   :982.0  
##      Cards            Age          Education        Gender    Student  
##  Min.   :1.000   Min.   :23.00   Min.   : 5.00    Male :193   No :360  
##  1st Qu.:2.000   1st Qu.:41.75   1st Qu.:11.00   Female:207   Yes: 40  
##  Median :3.000   Median :56.00   Median :14.00                         
##  Mean   :2.958   Mean   :55.67   Mean   :13.45                         
##  3rd Qu.:4.000   3rd Qu.:70.00   3rd Qu.:16.00                         
##  Max.   :9.000   Max.   :98.00   Max.   :20.00                         
##  Married              Ethnicity      Balance       
##  No :155   African American: 99   Min.   :   0.00  
##  Yes:245   Asian           :102   1st Qu.:  68.75  
##            Caucasian       :199   Median : 459.50  
##                                   Mean   : 520.01  
##                                   3rd Qu.: 863.00  
##                                   Max.   :1999.00

Part 2. Estimating the coefficients for simple linear regression.

Use the techniques/code you learned for linear regression in this week’s in-class lab to analyze the Credit data for predicting “Balance” by “Income.”

Question 1.1:

Produce a graphic/visualization suitable for briefing a decision-maker that illustrates the observed data and your fitted (and anything else you think would be helpful). Especially useful to conduct the diagnostic analysis including:

Answer 1.1:

First, we want to determine if there a relationship between balance and income. To answer this question we will try to fit the data with a simple linear regression model. First we will try to see any linear relation between Income and Balance. In this case, Balance will represent the y variable, or response variable, and Income will represent the x variable, or explanatory variable.

The summary data for this simple linear regression model is as follows:

## 
## Call:
## lm(formula = Balance ~ Income, data = Credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -803.64 -348.99  -54.42  331.75 1100.25 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 246.5148    33.1993   7.425  6.9e-13 ***
## Income        6.0484     0.5794  10.440  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 398 degrees of freedom
## Multiple R-squared:  0.215,  Adjusted R-squared:  0.213 
## F-statistic:   109 on 1 and 398 DF,  p-value: < 2.2e-16

The estimates of the coefficient for the simple linear regression model is 6.048 for income. In other words, there is a positive correlation between the independent variable (Income) and the dependent variable (Balance). As income increases 6 dollars, balance in the model is expected to increase by one dollar. See coefficient table below:

## (Intercept)      Income 
##  246.514751    6.048363

The confidence intervals for the estimate of the simple linear regression model are as follows:

##                  2.5 %     97.5 %
## (Intercept) 181.246749 311.782753
## Income        4.909394   7.187332

Next we will portray four built-in diagnostic plots using the “plot” function and provide subsequent analysis of each plot:

Plot 1 (above): The residuals vs. the fitted values. The residuals vs. fitted is used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, In this case, the line appears to be somewhat linear, which is good, and means that we are able to assume a linear relationship exists between the predictor (Income) and the outcome variable (Balance).

Plot 2 (above): The Normal Q-Q plot. This plot is used to examine whether the residuals are normally distributed. It is good if the residuals follow the straight dashed line. In this case, they do not follow the dashed line, so this does not depict normality and the normality assumption does not hold in this case. Since the residuals are not Gaussian, it can be deduced that our errors will also not be Gaussian. Additionally, this lack of normality means that for small sample sizes, we would not be able to assume our estimator is Gaussian either, so any standard confidence intervals and significance tests would not be valid.

Plot 3 (above): Scale-Location (or Spread-Location). This plot shows if residuals are spread equally along the ranges of predictors. It is used to check the homogeneity of variance of the residuals (homoscedasticity). If we see a horizontal line with points that are equally and randomly spread, it means that the model is good. This is a good indication of homoscedasticity. If not, it would imply that there is a heteroscedasticity problem (non-constant variances in the residuals errors) with the residuals. In our example above, the residuals do appear to be randomly and pretty equally spread, with the exception of the line of points extending from the center-left and upwards at a 30 degree angle.

Plot 4 (above): Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. The plot above highlights the top 3 most extreme points (#122, #276 and #324). Also, it is good that there are no outliers in this data set that exceed 3 standard deviations. Additionally, there are no high leverage points depicted in the plot above.

Question 1.2:

Is the overall Credit card balance higher or lower when there is a higher income? (i.e. what is the trend)? Support your assertions using your analysis (a good visualization helps here).

Answer 1.2:

We discussed this earlier when reviewing the coefficient for this simple linear regression model. Furthermore, the scatterplot below also shows that income and balance are positively correlated. In general, as income increases, the balance increases as well. It is interesting to note that individuals with 50k income or less are more likely to have a zero balance when compared to individuals with income higher than 50k. See plot below.

Part 3. Multiple Linear Regression.

Use the techniques/code you learned for linear regression in this week’s in-class lab to analyze the Credit data for predicting “Balance” by all other variables.

In the Credit dataset, we have 12 variables: Income, Limit, Rating, Cards, Age, Education, Gender, Student, Married, Ethnicity, and Balance. ID is a label of each data point and will not be utilized in the multi-linear model. Income and Balance are continuous variables. Limit, Rating, Cards, Education, and Age are discrete variables. Gender, Student, and Married are binary categorical variables. Ethnicity is also a categorical variable with three categories. In subsequent multi-linear regression analysis, will use as.factor for all categorical independent variables: Gender, Student, Married, Ethnicity.

Question 2.1:

Produce a graphic/visualization suitable for briefing a decision-maker that illustrates the observed data and your fitted (and anything else you think would be helpful). Especially useful to conduct the diagnostic analysis including:

Answer 2.1:

To answer question 2.1 above, we will show the four built-in diagnostic plots using the “plot” function for the multilinear model built with all available variables (with the exception of ID) and provide subsequent analysis. Later on in the analysis, once the final model is determined, will repeat this same analysis with the new model

Plot 1 (above): The residuals vs. the fitted values. The residuals vs. fitted is used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, In this case, the line appears to definitely not be linear, which shows that there is not a solid linear relationship when using all the explanatory variables in this case.

Plot 2 (above): The Normal Q-Q plot. This is used to examine whether the residuals are normally distributed. It’s good if the points of the residuals follow the straight dashed line. In this case, they do not, so this does not depict normality.

Plot 3 (above): Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This definitely is not the case in the example above, which implies that there is a heteroscedasticity problem when utilizing all the explanatory variables.

Plot 4 (above): Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. The plot above highlights the top 3 most extreme points (#242, #96, and #346). Also, there are no outliers in this data set that exceed 3 standard deviations, which is good. Additionally, there are no high leverage points depicted in the plot above.

Question 2.2.1:

Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response?

Answer 2.2.1:

## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Cards + Age + 
##     Education + as.factor(Gender) + as.factor(Student) + as.factor(Married) + 
##     as.factor(Ethnicity) - ID, data = Credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -161.64  -77.70  -13.49   53.98  318.20 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -479.20787   35.77394 -13.395  < 2e-16 ***
## Income                          -7.80310    0.23423 -33.314  < 2e-16 ***
## Limit                            0.19091    0.03278   5.824 1.21e-08 ***
## Rating                           1.13653    0.49089   2.315   0.0211 *  
## Cards                           17.72448    4.34103   4.083 5.40e-05 ***
## Age                             -0.61391    0.29399  -2.088   0.0374 *  
## Education                       -1.09886    1.59795  -0.688   0.4921    
## as.factor(Gender)Female        -10.65325    9.91400  -1.075   0.2832    
## as.factor(Student)Yes          425.74736   16.72258  25.459  < 2e-16 ***
## as.factor(Married)Yes           -8.53390   10.36287  -0.824   0.4107    
## as.factor(Ethnicity)Asian       16.80418   14.11906   1.190   0.2347    
## as.factor(Ethnicity)Caucasian   10.10703   12.20992   0.828   0.4083    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 98.79 on 388 degrees of freedom
## Multiple R-squared:  0.9551, Adjusted R-squared:  0.9538 
## F-statistic: 750.3 on 11 and 388 DF,  p-value: < 2.2e-16

Based on the summary statistics above, when using all explanatory variables (except ID) to predict Balance, it appears that there is at least one of the predictors that would be useful in predicting Balance. More specifically, Limit, Rating, Cards, and as.factor(Student) appear to be highly significant based on the p-values above. Even Rating and Age may prove to be decent predictors as well. We will validate this initial observation with a forward/backward step-wise regression in question 2.2.2 below.

Question 2.2.2:

Do all the predictors help to explain Balance, or is only a subset of the predictors useful?

Answer 2.2.2:

It appears that all of the predictors do not help in explanation of Balance. This statement is strictly based on the “PR(>|t|)” column - Income, Limit, Rating, Cards, Age, and StudentYes may be good predictors but will perform forward/backward step wise regression to help discover the best explanatory variables to use as predictors in our multilinear regression model to predict Balance:

## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Cards + Age + 
##     as.factor(Student), data = Credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -170.00  -77.85  -11.84   56.87  313.52 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -493.73419   24.82476 -19.889  < 2e-16 ***
## Income                  -7.79508    0.23342 -33.395  < 2e-16 ***
## Limit                    0.19369    0.03238   5.981 4.98e-09 ***
## Rating                   1.09119    0.48480   2.251   0.0250 *  
## Cards                   18.21190    4.31865   4.217 3.08e-05 ***
## Age                     -0.62406    0.29182  -2.139   0.0331 *  
## as.factor(Student)Yes  425.60994   16.50956  25.780  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 98.61 on 393 degrees of freedom
## Multiple R-squared:  0.9547, Adjusted R-squared:  0.954 
## F-statistic:  1380 on 6 and 393 DF,  p-value: < 2.2e-16

The forward/backward stepwise regression analysis above further reinforced the subset that I previously mentioned to consist of the “best” aggregated set of predictors for Balance. This will be considered our best-model for this Lab.

Question 2.2.3:

How well does the model fit the data?

Answer 2.2.3:

It is usually the case that the larger the R-squared value, the more precisely the predictor variables are able to predict the value of the response variable. Based on the R-squared value of the model of .955, it appears to be a decent fitting model. Interestingly, the The initial model that utilized all variables had just a slightly higher R-squared value (.9551 compared to the “best” model built using step-wise regression of .9547). It may be due to the fact that there was such a high number of data points (400) that we used to build the regression model.

Next, want to look at the plots of the best stepwise model to see if there were any improvements from the initial plots we previously discussed for the first model we built using all variables.

I do not see much change in the plots above using the “best” model. Since the plots do not meet the criteria for normality based on the fact that the four required conditions for model fit are not met. The probability distribution is not normal. The mean of the distribution is not zero. Since the residuals are not Gaussian, it can be deduced that our errors will also not be Gaussian. Additionally, this lack of normality means that for small sample sizes, we would not be able to assume our estimator is Gaussian either, so any standard confidence intervals and significance tests would not be valid with the data as is.

Question 2.2.4:

What does the coefficient for the Income variable suggest?

Answer 2.2.4:

##           (Intercept)                Income                 Limit 
##          -493.7341870            -7.7950824             0.1936914 
##                Rating                 Cards                   Age 
##             1.0911874            18.2118976            -0.6240560 
## as.factor(Student)Yes 
##           425.6099369

The -7.8 coefficient for the Income variable suggests that there is a negative correlation between Income and Balance. More specifically, when income increases, the credit card balance will decrease by a factor of 7.8. If income decreases, the credit card balance is expected to increase by a factor of 7.8. This is interesting because when we initially predicted Balance using only Income, the coefficient was positive.