In this take-home lab exercise, we will apply linear regression models to analyze Credit dataset. First we will apply linear regression model to Credit dataset.
Part 1
The summary of the Credit data set is as follows:
## ID Income Limit Rating
## Min. : 1.0 Min. : 10.35 Min. : 855 Min. : 93.0
## 1st Qu.:100.8 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2
## Median :200.5 Median : 33.12 Median : 4622 Median :344.0
## Mean :200.5 Mean : 45.22 Mean : 4736 Mean :354.9
## 3rd Qu.:300.2 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2
## Max. :400.0 Max. :186.63 Max. :13913 Max. :982.0
## Cards Age Education Gender Student
## Min. :1.000 Min. :23.00 Min. : 5.00 Male :193 No :360
## 1st Qu.:2.000 1st Qu.:41.75 1st Qu.:11.00 Female:207 Yes: 40
## Median :3.000 Median :56.00 Median :14.00
## Mean :2.958 Mean :55.67 Mean :13.45
## 3rd Qu.:4.000 3rd Qu.:70.00 3rd Qu.:16.00
## Max. :9.000 Max. :98.00 Max. :20.00
## Married Ethnicity Balance
## No :155 African American: 99 Min. : 0.00
## Yes:245 Asian :102 1st Qu.: 68.75
## Caucasian :199 Median : 459.50
## Mean : 520.01
## 3rd Qu.: 863.00
## Max. :1999.00
Part 2. Estimating the coefficients for simple linear regression.
Use the techniques/code you learned for linear regression in this week’s in-class lab to analyze the Credit data for predicting “Balance” by “Income.”
Question 1.1:
Produce a graphic/visualization suitable for briefing a decision-maker that illustrates the observed data and your fitted (and anything else you think would be helpful). Especially useful to conduct the diagnostic analysis including:
Residuals vs Fitted.
Normal Q-Q.
Scale-Location (or Spread-Location).
Residuals vs Leverage.
Answer 1.1:
First, we want to determine if there a relationship between balance and income. To answer this question we will try to fit the data with a simple linear regression model. First we will try to see any linear relation between Income and Balance. In this case, Balance will represent the y variable, or response variable, and Income will represent the x variable, or explanatory variable.
The summary data for this simple linear regression model is as follows:
##
## Call:
## lm(formula = Balance ~ Income, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -803.64 -348.99 -54.42 331.75 1100.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 246.5148 33.1993 7.425 6.9e-13 ***
## Income 6.0484 0.5794 10.440 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 398 degrees of freedom
## Multiple R-squared: 0.215, Adjusted R-squared: 0.213
## F-statistic: 109 on 1 and 398 DF, p-value: < 2.2e-16
The estimates of the coefficient for the simple linear regression model is 6.048 for income. In other words, there is a positive correlation between the independent variable (Income) and the dependent variable (Balance). As income increases 6 dollars, balance in the model is expected to increase by one dollar. See coefficient table below:
## (Intercept) Income
## 246.514751 6.048363
The confidence intervals for the estimate of the simple linear regression model are as follows:
## 2.5 % 97.5 %
## (Intercept) 181.246749 311.782753
## Income 4.909394 7.187332
Next we will portray four built-in diagnostic plots using the “plot” function and provide subsequent analysis of each plot:
Plot 1 (above): The residuals vs. the fitted values. The residuals vs. fitted is used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, In this case, the line appears to be somewhat linear, which is good, and means that we are able to assume a linear relationship exists between the predictor (Income) and the outcome variable (Balance).
Plot 2 (above): The Normal Q-Q plot. This plot is used to examine whether the residuals are normally distributed. It is good if the residuals follow the straight dashed line. In this case, they do not follow the dashed line, so this does not depict normality and the normality assumption does not hold in this case. Since the residuals are not Gaussian, it can be deduced that our errors will also not be Gaussian. Additionally, this lack of normality means that for small sample sizes, we would not be able to assume our estimator is Gaussian either, so any standard confidence intervals and significance tests would not be valid.
Plot 3 (above): Scale-Location (or Spread-Location). This plot shows if residuals are spread equally along the ranges of predictors. It is used to check the homogeneity of variance of the residuals (homoscedasticity). If we see a horizontal line with points that are equally and randomly spread, it means that the model is good. This is a good indication of homoscedasticity. If not, it would imply that there is a heteroscedasticity problem (non-constant variances in the residuals errors) with the residuals. In our example above, the residuals do appear to be randomly and pretty equally spread, with the exception of the line of points extending from the center-left and upwards at a 30 degree angle.
Plot 4 (above): Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. The plot above highlights the top 3 most extreme points (#122, #276 and #324). Also, it is good that there are no outliers in this data set that exceed 3 standard deviations. Additionally, there are no high leverage points depicted in the plot above.
Question 1.2:
Is the overall Credit card balance higher or lower when there is a higher income? (i.e. what is the trend)? Support your assertions using your analysis (a good visualization helps here).
Answer 1.2:
We discussed this earlier when reviewing the coefficient for this simple linear regression model. Furthermore, the scatterplot below also shows that income and balance are positively correlated. In general, as income increases, the balance increases as well. It is interesting to note that individuals with 50k income or less are more likely to have a zero balance when compared to individuals with income higher than 50k. See plot below.
Part 3. Multiple Linear Regression.
Use the techniques/code you learned for linear regression in this week’s in-class lab to analyze the Credit data for predicting “Balance” by all other variables.
In the Credit dataset, we have 12 variables: Income, Limit, Rating, Cards, Age, Education, Gender, Student, Married, Ethnicity, and Balance. ID is a label of each data point and will not be utilized in the multi-linear model. Income and Balance are continuous variables. Limit, Rating, Cards, Education, and Age are discrete variables. Gender, Student, and Married are binary categorical variables. Ethnicity is also a categorical variable with three categories. In subsequent multi-linear regression analysis, will use as.factor for all categorical independent variables: Gender, Student, Married, Ethnicity.
Question 2.1:
Produce a graphic/visualization suitable for briefing a decision-maker that illustrates the observed data and your fitted (and anything else you think would be helpful). Especially useful to conduct the diagnostic analysis including:
Residuals vs Fitted.
Normal Q-Q.
Scale-Location (or Spread-Location).
Residuals vs Leverage.
Answer 2.1:
To answer question 2.1 above, we will show the four built-in diagnostic plots using the “plot” function for the multilinear model built with all available variables (with the exception of ID) and provide subsequent analysis. Later on in the analysis, once the final model is determined, will repeat this same analysis with the new model
Plot 1 (above): The residuals vs. the fitted values. The residuals vs. fitted is used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, In this case, the line appears to definitely not be linear, which shows that there is not a solid linear relationship when using all the explanatory variables in this case.
Plot 2 (above): The Normal Q-Q plot. This is used to examine whether the residuals are normally distributed. It’s good if the points of the residuals follow the straight dashed line. In this case, they do not, so this does not depict normality.
Plot 3 (above): Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This definitely is not the case in the example above, which implies that there is a heteroscedasticity problem when utilizing all the explanatory variables.
Plot 4 (above): Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. The plot above highlights the top 3 most extreme points (#242, #96, and #346). Also, there are no outliers in this data set that exceed 3 standard deviations, which is good. Additionally, there are no high leverage points depicted in the plot above.
Question 2.2.1:
Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response?
Answer 2.2.1:
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Cards + Age +
## Education + as.factor(Gender) + as.factor(Student) + as.factor(Married) +
## as.factor(Ethnicity) - ID, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -161.64 -77.70 -13.49 53.98 318.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -479.20787 35.77394 -13.395 < 2e-16 ***
## Income -7.80310 0.23423 -33.314 < 2e-16 ***
## Limit 0.19091 0.03278 5.824 1.21e-08 ***
## Rating 1.13653 0.49089 2.315 0.0211 *
## Cards 17.72448 4.34103 4.083 5.40e-05 ***
## Age -0.61391 0.29399 -2.088 0.0374 *
## Education -1.09886 1.59795 -0.688 0.4921
## as.factor(Gender)Female -10.65325 9.91400 -1.075 0.2832
## as.factor(Student)Yes 425.74736 16.72258 25.459 < 2e-16 ***
## as.factor(Married)Yes -8.53390 10.36287 -0.824 0.4107
## as.factor(Ethnicity)Asian 16.80418 14.11906 1.190 0.2347
## as.factor(Ethnicity)Caucasian 10.10703 12.20992 0.828 0.4083
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98.79 on 388 degrees of freedom
## Multiple R-squared: 0.9551, Adjusted R-squared: 0.9538
## F-statistic: 750.3 on 11 and 388 DF, p-value: < 2.2e-16
Based on the summary statistics above, when using all explanatory variables (except ID) to predict Balance, it appears that there is at least one of the predictors that would be useful in predicting Balance. More specifically, Limit, Rating, Cards, and as.factor(Student) appear to be highly significant based on the p-values above. Even Rating and Age may prove to be decent predictors as well. We will validate this initial observation with a forward/backward step-wise regression in question 2.2.2 below.
Question 2.2.2:
Do all the predictors help to explain Balance, or is only a subset of the predictors useful?
Answer 2.2.2:
It appears that all of the predictors do not help in explanation of Balance. This statement is strictly based on the “PR(>|t|)” column - Income, Limit, Rating, Cards, Age, and StudentYes may be good predictors but will perform forward/backward step wise regression to help discover the best explanatory variables to use as predictors in our multilinear regression model to predict Balance:
##
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Cards + Age +
## as.factor(Student), data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.00 -77.85 -11.84 56.87 313.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -493.73419 24.82476 -19.889 < 2e-16 ***
## Income -7.79508 0.23342 -33.395 < 2e-16 ***
## Limit 0.19369 0.03238 5.981 4.98e-09 ***
## Rating 1.09119 0.48480 2.251 0.0250 *
## Cards 18.21190 4.31865 4.217 3.08e-05 ***
## Age -0.62406 0.29182 -2.139 0.0331 *
## as.factor(Student)Yes 425.60994 16.50956 25.780 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98.61 on 393 degrees of freedom
## Multiple R-squared: 0.9547, Adjusted R-squared: 0.954
## F-statistic: 1380 on 6 and 393 DF, p-value: < 2.2e-16
The forward/backward stepwise regression analysis above further reinforced the subset that I previously mentioned to consist of the “best” aggregated set of predictors for Balance. This will be considered our best-model for this Lab.
Question 2.2.3:
How well does the model fit the data?
Answer 2.2.3:
It is usually the case that the larger the R-squared value, the more precisely the predictor variables are able to predict the value of the response variable. Based on the R-squared value of the model of .955, it appears to be a decent fitting model. Interestingly, the The initial model that utilized all variables had just a slightly higher R-squared value (.9551 compared to the “best” model built using step-wise regression of .9547). It may be due to the fact that there was such a high number of data points (400) that we used to build the regression model.
Next, want to look at the plots of the best stepwise model to see if there were any improvements from the initial plots we previously discussed for the first model we built using all variables.
I do not see much change in the plots above using the “best” model. Since the plots do not meet the criteria for normality based on the fact that the four required conditions for model fit are not met. The probability distribution is not normal. The mean of the distribution is not zero. Since the residuals are not Gaussian, it can be deduced that our errors will also not be Gaussian. Additionally, this lack of normality means that for small sample sizes, we would not be able to assume our estimator is Gaussian either, so any standard confidence intervals and significance tests would not be valid with the data as is.
Question 2.2.4:
What does the coefficient for the Income variable suggest?
Answer 2.2.4:
## (Intercept) Income Limit
## -493.7341870 -7.7950824 0.1936914
## Rating Cards Age
## 1.0911874 18.2118976 -0.6240560
## as.factor(Student)Yes
## 425.6099369
The -7.8 coefficient for the Income variable suggests that there is a negative correlation between Income and Balance. More specifically, when income increases, the credit card balance will decrease by a factor of 7.8. If income decreases, the credit card balance is expected to increase by a factor of 7.8. This is interesting because when we initially predicted Balance using only Income, the coefficient was positive.