As we see in table 1, there are not missing values in the whole dataset.
| Number of ‘NA’ values | |
|---|---|
| Status | 0 |
| Duration | 0 |
| Credit history | 0 |
| Purpose | 0 |
| Amount | 0 |
| Saving account/bond | 0 |
| Employment duration | 0 |
| Installment rate | 0 |
| Personal status and sex | 0 |
| Other debtors | 0 |
| Present residence | 0 |
| Property | 0 |
| Age | 0 |
| Other Installment Plans | 0 |
| Housing | 0 |
| Number Credits | 0 |
| Job | 0 |
| People Liable | 0 |
| Telephone | 0 |
| Foreign Worker | 0 |
| Credit Risk | 0 |
The next graphs compares each of the quantitative variables with the qualitative outcome variable. In all the cases, we find distributions with a relative high presence of outliers (in special on graphs 2.1 and 3.1).The red dots over the different boxplots represent the mean of each of the distribution and they are joined by a line representing the difference between both groups in each variable.
The difference between the means help us understand the influence of the quantitative variables over the outcome. In each of the variables we see that there’s an actual difference , pointing out that they might be associated with the outcome variable, this fact has to be obviously contrasted with hypothesis tests that measures the significance of this difference between groups. Normality plots were used to check the symmetry of the distributions in each of the cases. As we might expect by looking at the difference between means and medians in boxplots, mainly due to the presence of outliers, the distributions don’t follow a normal distribution and this most be taken into account on the modeling process . We can also note, on the normality plots, that both groups on the variables “Credit Amount” and “Age” present a similar variance; however, “Duration” variable present a difference of variance between these groups.
Further studies, before modeling, could consist on carry out a “Shapiro-Wilk Normality test” to prove the normality of our variables and depending on this we could try parametric or non parametric tests, given the results of the last test, in order to find significant differences between the means from the groups in each of the cases.
On graph 4 we see the pearson’s correlation coefficients between each of the quantitative variables, we see there’s a slightly high correlation between the variables “Amount” and “Duration” ; even though , the correlation values is not high enough to be worried at this point .
Status variable
Graph 5.1 shows the distribution of the variable “Status variable” by category, while graph 5.2 present the distribution of the variable divided by the response variable groups. The proportion of “status” categories is quiet good, there are no signs of imbalance between groups .
Credit history
Graph 6.1 shows the distribution of the variable “Credit history” by category, while graph 6.2 present the distribution of the variable divided by the response variable groups. The proportion of “credit history” categories is imbalanced, it will make sense to remove categories: A31,A32,A33 , or to join all of them in one single category. Having few data in any of the categories could induce an overfitting while modeling.
Purpose
Graph 7.1 shows the distribution of the variable “Purpose” by category, while graph 7.2 present the distribution of the variable divided by the response variable groups. The proportion of “purpose” categories is imbalanced, it will make sense to remove categories: A45,A410,A44,A48 ,or to join all of them in one single category. Having few data in any of the categories could induce an overfitting while modeling.
Saving account Bond
Graph 8.1 shows the distribution of the variable “Saving account Bond” by category, while graph 8.2 present the distribution of the variable divided by the response variable groups. The proportion of “saving account/bond” categories is imbalanced, it will make sense to remove categories: A62,A63,A64 ,or to join all of them in one single category. Having few data in any of the categories could induce an overfitting while modeling.
Employment duration
Graph 9.1 shows the distribution of the variable “Employment duration” by category, while graph 9.2 present the distribution of the variable divided by the response variable groups. The proportion of “employment duration” categories is imbalanced, it will make sense to remove the last category “A71” ,or to join it with the rest of categories. Having few data in any of the categories could induce an overfitting while modeling.
Installment rate
Graph 10.1 shows the distribution of the variable “Installment rate” by category, while graph 10.2 present the distribution of the variable divided by the response variable groups. The proportion of “purposed of”installment rate" categories is well balanced.
Personal status and sex
Graph 11.1 shows the distribution of the variable “Personal status and sex” by category, while graph 11.2 present the distribution of the variable divided by the response variable groups. The proportion of “Personal status and sex” categories is imbalanced, it will make sense to remove categories: A94,A91 ,or to join all of them in one single category. Having few data in any of the categories could induce an overfitting while modeling.
Other debtors
Graph 12.1 shows the distribution of the variable “other debtors” by category, while graph 12.2 present the distribution of the variable divided by the response variable groups. The proportion of “Other debtors” categories extremely imbalanced, categories A103 and A104 are not quantitative significant so it’s highly recommendable to remove them and avoid overfitting issues while modeling.
Present residence
Graph 13.1 shows the distribution of the variable “Present residence” by category, while graph 13.2 present the distribution of the variable divided by the response variable groups. The proportion of “Present residence” categories is well balanced.
Property
Graph 14.1 shows the distribution of the variable “Property” by category, while graph 14.2 present the distribution of the variable divided by the response variable groups. The proportion of “Property” categories is well balanced.
Other installment plans
Graph 15.1 shows the distribution of the variable “Other installment plans” by category, while graph 15.2 present the distribution of the variable divided by the response variable groups. The proportion of “Other installment plans” categories extremely imbalanced, categories A141 and A142 are not quantitative significant so it’s highly recommendable to remove them and avoid overfitting issues while modeling.
Housing
Graph 16.1 shows the distribution of the variable “Housing” by category, while graph 16.2 present the distribution of the variable divided by the response variable groups. The proportion of “Housing” categories extremely imbalanced, categories A151 and A153 are not quantitative significant so it’s highly recommendable to remove them and avoid overfitting issues while modeling.
Number Credits
Graph 17.1 shows the distribution of the variable “Number Credits” by category, while graph 17.2 present the distribution of the variable divided by the response variable groups. The proportion of “Number Credits” categories extremely imbalanced, categories 3 and 4 are not quantitative significant so it’s highly recommendable to remove them or to joint to the rest of categories, if possible, in order to avoid overfitting issues while modeling.
Job
Graph 18.1 shows the distribution of the variable “Job” by category, while graph 18.2 present the distribution of the variable divided by the response variable groups. The proportion of “Job” categories is well balanced, but category A171 is not quantitative significant so it’s highly recommendable to remove it or to joint it to the rest of categories ,if possible, in order to avoid overfitting issues while modeling.
People liable
Graph 19.1 shows the distribution of the variable “People liable” by category, while graph 19.2 present the distribution of the variable divided by the response variable groups. The proportion of “People liable” categories is imbalanced, category 2 is not quantitative significant so it’s highly recommendable to remove it or to joint it to category 1 ,if possible, in order to avoid overfitting issues while modeling.
Telephone
Graph 20.1 shows the distribution of the variable “Telephone” by category, while graph 20.2 present the distribution of the variable divided by the response variable groups. The proportion of “Telephone” categories is well balanced.
Foreign worker
Graph 21.1 shows the distribution of the variable “Foreign worker” by category, while graph 21.2 present the distribution of the variable divided by the response variable groups. The proportion of “Foreign worker” categories is imbalanced, category A202 is not quantitative significant so it’s highly recommendable to remove it or to joint it to category A201 ,if possible, in order to avoid overfitting issues while modeling.
For the analysis between categorical values, within predictors and predictor vs response, a series of chi-square test were carried out, thus measuring the significance of the association between variables. Note that in the case of contrasting the response variable with the rest of categorical independent variables we are interested on rejecting the null hypothesis of independence, hence we want p-values lower than our significance level, at a very first glance variables such as “Installment rate”, “Present residence”, “Number credits”,“Job”,“People Liable” and “Telephone” have greater values than the given significance level, thus they appear to not be associated with the response variable.
In the case of comparison between categorical predictors we are interested on having independent variables, thus values greater than our significance level evidence this fact. There’s a lot cases where pairs of categorical predictors fails to prove the hypothesis of independence.
Further studies may include to compute the strength of these associations as Cramer’s V test.
he “German credit data” doesn’t present any missing value along it’s columns. All numerical predictors seem to be associated with the categorical response, but the significance of these associations must be measured by non-parametric test, due to the lack of normality of the variables. There is a slight correlation between the variables “Amount” and “Duration”. It’s important to know the presence of outliers in each one the quantitative predictors, these values are affecting the normality (gaussian symmetry) of the data in each of the cases. Depending on the model that we select on the future, having non normal distributed data can affect the accuracy of our future model, thus it will be important to consider removing these points from the data set. In the case of categorical variables, we have several cases of imbalanced categories within some of the variables that might lead to overfitting while modeling. Thus, as it was mentioned before, it’s important to consider removing them or joining to other categories, so the frequencies are well balanced between categories. Having problems such as an overfitted model could inflate our model variance reducing the accuracy on future predictions. From a statistical point of view, we see some significant evidence of association between categorical predictors and the categorical response, such as in the cases of variables “Installment rate”, “Present Residence”, “Number Credits”, “Job” and “People Liable”. Again, these findings must be contrasted with another statistical test in other to measure the strength of the associations. Finally, it’s important to have all these ideas in mind while modeling, measurements such as Variance Inflation Factor (VIF) can help identify the predictors that are more important in our model. Moreover, it’s important to consider possible interactions between predictors in order to improve the model accuracy.