Descriptive Statistics Report 4
Bivariate Analysis
Introduction
This is our first lab when we are considering 2 dimensions and instead of calculating univariate statistics by groups (or factors) of other variable - we will measure their common relationships based on co-variance and correlation coefficients.
*Please be very careful when choosing the measure of correlation! In case of different measurument scales we have to recode one of the variables into weaker scale.
It would be nice to add some additional plots in the background. Feel free to add your own sections and use external packages.
Data
This time we are going to use a typical credit scoring data with predefined “default” variables and personal demografic and income data. Please take a look closer at headers and descriptions of each variable.
Scatterplots
First let’s visualize our quantitative relationships using scatterplots. For example scatterplot visualising relationship between age and income
Or scatterplot visualising relationship between age and income and taking into account if this person was defaulted with size change for default variable.
Or scatterplot visualising level of education level and income and taking into account this person’s education level with shape change in default variable.
You can also normalize the skewed distribution of incomes using log:
We can add an estimated linear regression line:
INTERPRETATION:
We can observe positive relationship between (log)income and age, which means as age increases so does income.
Scatterplots by groups
We can finally see if there any differences between risk status:
INTERPRETATION:
We can see that there are overall more people who are not defaulted in our sample.
INTERPRETATION:
Now we can see that not being defaulted is associated with higher income for people (about) 21+. For people in early twenties people with higher income are most likely defaulted.
We can also see more closely if there any differences between those two distributions adding their estimated density plots:
We can also put those plots together:
Scatterplots with density curves
We can also see more closely if there any differences between those two distributions adding their estimated density plots:
INTERPRETATION:
From density plot of age we discover that age groups 20-31 and 51+ there are more people defaulted than not defaulted. From density plot of logarithm of income we see that there are more people with log income between 3.4 and 5.3 that are not defaulted.
Correlation coefficients - Pearson’s linear correlation
Ok, let’s move to some calculations. In R, we can use the cor() function. It takes three arguments and the method: cor(x, y, method) For 2 quantitative data, with all assumptions met, we can calculate simple Pearson’s coefficient of linear correlation:
First taking into account age and log(income) variables.
## [1] 0.574346
Pearsosn’s coefficient is equal to 0.574346.
INTERPRETATION:
Correlation between income and age is positive and medium strong.
Ok, what about the percentage of the explained variability?
## [1] 0.3298734
Percentage of the explained variability is around 33%
INTERPRETATION:
So as we can see almost 33% of total log of incomes’ variability is explained by differences in age. The rest (67%) is probably explained by other factors.
Now let’s check correlation between variables employ (years spent with current employer) and log(income)
## [1] 0.7239081
Pearsosn’s coefficient is equal to 0.7239081.
INTERPRETATION:
Correlation between income and years spent with current employer is also positive but stronger than correlation between age and income.
## [1] 0.5240429
Percentage of the explained variability is around 52%
INTERPRETATION:
So as we can see almost 52% of total log of incomes’ variability is explained by differences in age. The rest (48%) is probably explained by other factors.
Partial and semipartial correlation
The partial and semi-partial (also known as part) correlations are used to express the specific portion of variance explained by eliminating the effect of other variables when assessing the correlation between two variables.
Partial correlation holds constant one variable when computing the relations to others. Suppose we want to know the correlation between X and Y holding Z constant for both X and Y. That would be the partial correlation between X and Y controlling for Z.
Semipartial correlation holds Z constant for either X or Y, but not both, so if we wanted to control X for Z, we could compute the semipartial correlation between X and Y holding Z constant for X.
Suppose we want to know the correlation between the log of income and age controlling for years of employment. How highly correlated are these after controlling for tenure?
**There can be more than one control variable.
PARTIAL CORRELATION BETWEEN INCOME AND AGE CONTROLLING EMPLOY:
## logincome age employ
## logincome 1.0000000 0.3194263 0.6018467
## age 0.3194263 1.0000000 0.2137607
## employ 0.6018467 0.2137607 1.0000000
## estimate p.value statistic n gp Method
## 1 0.3194263 4.805085e-18 8.899323 700 1 pearson
INTERPRETATION:
Estimate presents partial correlation coefficient. We should look in age row and logincome column (or logincome row and age column) to find coefficient of correlation between those two in control of employ (years spent with current employer). In our case it is equal to 0.3194263. Age and income are still positively correlated with each other with medium strength if we do not take into account employ variable.
Now we can count correlation of income and age controlling employ and address variables:
PARTIAL CORRELATION BETWEEN INCOME AND AGE CONTROLLING EMPLOY AND ADRESS:
## logincome age employ address
## logincome 1.00000000 0.2524574 0.60220579 0.04682541
## age 0.25245738 1.0000000 0.19794652 0.50197727
## employ 0.60220579 0.1979465 1.00000000 -0.02616105
## address 0.04682541 0.5019773 -0.02616105 1.00000000
## estimate p.value statistic n gp Method
## 1 0.2524574 1.307404e-11 6.883245 700 2 pearson
INTERPRETATION:
Now we can see that partial correlation in this case is equal to 0.2524574, which is smaller than previously. We can say that income and age is still positively correlated but this correlation is slightly weaker when we do not take into account employ and address variables. This means that address variable has influence on relationship of age and income.
What is the difference between that one and the semi-partial coefficient: In semipartial correlation coefficient we control third variable only for one variable in relationship we examine.
SEMIPARTIAL CORRELATION BETWEEN INCOME AND AGE CONTROLLING EMPLOY:
## logincome age employ
## logincome 1.0000000 0.2203711 0.4926793
## age 0.2695646 1.0000000 0.1749872
## employ 0.5078999 0.1474728 1.0000000
## [1] "Controlling employ variable for log(income): "
## estimate p.value statistic n gp Method
## 1 0.2695646 4.205961e-13 7.390282 700 1 pearson
## [1] "Controlling employ variable for age: "
## estimate p.value statistic n gp Method
## 1 0.2203711 3.899134e-09 5.964597 700 1 pearson
INTERPRETATION:
We calculated semipartial correlation between age and income variables controlling employ. Coefficient with control of employ for log(income) is equal 0.2695646 and for age 0.2203711. We can see that employ variable has more influence on age variable.
SEMIPARTIAL CORRELATION BETWEEN INCOME AND AGE CONTROLLING EMPLOY AND ADRESS:
## logincome age employ address
## logincome 1.00000000 0.1703829 0.49259951 0.03061228
## age 0.18045429 1.0000000 0.13967095 0.40142684
## employ 0.50820124 0.1360525 1.00000000 -0.01763125
## address 0.03754455 0.4648539 -0.02096006 1.00000000
## [1] "Controlling employ and address variable for log(income): "
## estimate p.value statistic n gp Method
## 1 0.1804543 1.599e-06 4.840171 700 2 pearson
## [1] "Controlling employ and address variable for age: "
## estimate p.value statistic n gp Method
## 1 0.1703829 5.995901e-06 4.561712 700 2 pearson
INTERPRETATION:
Semipartial correlation coefficient for age and income controlling employ and address for income is 0.1804543 and controlling for age equals 0.1703829 and this means that address and employ variables influence more on age (just like on case when employ was controlled).
We can also check influence of address veriable to see if it is not influencing more income variable:
SEMIPARTIAL CORRELATION BETWEEN INCOME AND AGE CONTROLLING ADDRESS:
## logincome age address
## logincome 1.00000000 0.4414656 0.03187095
## age 0.38080799 1.0000000 0.41503742
## address 0.03121644 0.4712663 1.00000000
## [1] "Controlling address variable for log(income): "
## estimate p.value statistic n gp Method
## 1 0.380808 1.525655e-25 10.87284 700 1 pearson
## [1] "Controlling address variable for age: "
## estimate p.value statistic n gp Method
## 1 0.4414656 1.054223e-34 12.98931 700 1 pearson
INTERPRETATION:
Semipartial correlation between age and income variables controlling address has higher values than when employ was controlled. This means that address has more influence. Coefficient with control of address for log(income) is equal 0.380808 and for age 0.4414656 We can see that address variable has more influence on age variable like previously with employ variable controlled.
Rank correlation
For 2 different scales - like for example this pair of variables: income vs. education levels - we cannot use Pearson’s coefficient. The only possibility is to rank also incomes… and lose some more detailed information about them.
First, let’s see boxplots of income by education levels.
Now, let’s see Kendal’s coefficient of rank correlation (robust for ties) and how it differs from Sperman’s coefficient.
## [1] "Correlation between education level and income:"
## [1] "Kendal's coefficient:"
## [1] 0.1577567
## [1] "Spearman's coefficient:"
## [1] 0.2020778
INTERPRETATION:
Both coefficients values tell us that correlation is positive (as education level is higher, income is increasing). Kendal’s coefficient indicates weaker correlation than Spearman’s because of different methods of calculating. Also Spearman’s method is more sensitive to outliers and those are present as we can see on the boxplot above.
Point-biserial correlation
Let’s try to verify if there is a significant relationship between incomes and risk status. First, let’s take a look at the boxplot:
If you would like to compare 1 quantitative variable (income) and 1 dychotomous variable (default status - binary), then you can use point-biserial coefficient:
## [1] "Kendal's coefficient: "
##
## Kendall's rank correlation tau
##
## data: bank$income and bank$default
## z = -4.032, p-value = 5.53e-05
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.1256213
## [1] "Spearman's coefficient"
##
## Pearson's product-moment correlation
##
## data: bank$income and bank$default
## t = -1.8797, df = 698, p-value = 0.06056
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.144313546 0.003149751
## sample estimates:
## cor
## -0.07096966
INTERPRETATION:
Both coefficient values indicate negative but weak correlation between default and income variables. That means higher income is associated with lower likelihood of being defaulted and vice versa.
Nonlinear correlation - eta coefficient
If you would like to check if there are any nonlinearities between 2 variables, the only possibility (beside transformations and linear analysis) is to calculate “eta” coefficient and compare it with the Pearson’s linear coefficient.
## [1] "Correlation between default and income variables: "
## [1] "eta: "
## [1] 0.1352258
## [1] "Pearson's coefficient: "
## [1] -0.1352258
## [1] "Correlation between age and income variables: "
## [1] "eta: "
## [1] 0.6230619
## [1] "Pearson's coefficient: "
## [1] 0.574346
## [1] "Correlation between employ and income variables: "
## [1] "eta: "
## [1] 0.7387405
## [1] "Pearson's coefficient: "
## [1] 0.7239081
INTERPRETATION:
Eta coefficient focuses on strength when Pearson’s coefficient gives us information abut strength and direction of linear relationship. Since in correlation between default and income variables both values are the same but Pearson’s coefficient is negative, relationship is linear and negative. Two the correlations are positive, but less linear. Correlation between age and income variables is the least linear, because values differ the most (we consider absolute values). Correlation between employ and income variables is the strongest and correlation between default and income variables is the weakest.
Correlation matrix
We can also prepare the correlation matrix for all quantitative variables stored in our data frame.
We can use ggcorr() function:
Without grouping by default variable:
With grouping by default variable:
As you can see - the default correlation matrix is not the best idea for all measurement scales (including binary variable “default”).
That’s why now we can perform our bivariate analysis with ggpair with grouping.
Correlation matrix with scatterplots
Here is what we are about to calculate: - The correlation matrix between age, log_income, employ, address, debtinc, creddebt, and othdebt variable grouped by whether the person has a default status or not. - Plot the distribution of each variable by group - Display the scatter plot with the trend by group
Without faceting by default variable:
With faceting by default variable:
Qualitative data
In case of two variables measured on nominal or ordinal&nominal scale - we are forced to organize so called “contingency” table with frequencies and calculate some kind of the correlation coefficient based on them. This is so called “contingency analysis”.
Let’s consider one example based on our data: verify, if there is any significant correlation between education level and credit risk.
We will use contingency table of education level and default variable, chi-squered Pearson test and CramerV coefficent:
##
## 0 1
## 1 293 79
## 2 139 59
## 3 57 30
## 4 24 14
## 5 4 1
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 11.492, df = 4, p-value = 0.02155
##
## 0 1
## 1 0.418571429 0.112857143
## 2 0.198571429 0.084285714
## 3 0.081428571 0.042857143
## 4 0.034285714 0.020000000
## 5 0.005714286 0.001428571
## [1] "Cramer's V: "
## [1] 0.1281313
## X^2 df P(> X^2)
## Likelihood Ratio 11.332 4 0.023077
## Pearson 11.492 4 0.021554
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.127
## Cramer's V : 0.128
INTERPRETATION:
From X-squared we can say that the association between those two variables exists. Those variables are not independent of each other. But from Cramer’s V coefficient, Contingency Coefficient and p-value we can tell that correlation is weak, they do not influence each other much.
Exercise 1. Contingency analysis.
Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life A survey was conducted and a random sample of 1091 questionnaires is given in the form of the following contingency table:
## Believe
## Gender Yes No
## Female 435 375
## Male 147 134
Our task is to check if there is a significant relationship between the belief in the afterlife and gender. We can perform this procedure with the simple chi-square statistics and chosen qualitative correlation coefficient (two-way 2x2 table).
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: dane
## X-squared = 0.11103, df = 1, p-value = 0.739
## Believe
## Gender Yes No
## Female 0.3987168 0.3437214
## Male 0.1347388 0.1228231
As you can see we can calculate our chi-square statistic really quickly for two-way tables or larger. Now we can standardize this contingency measure to see if the relationship is significant.
## [1] 0.01218871
Exercise 2. Contingency analysis for the ‘Titanic’ data.
Let’s consider the titanic dataset which contains a complete list of passengers and crew members on the RMS Titanic. It includes a variable indicating whether a person did survive the sinking of the RMS Titanic on April 15, 1912. A data frame contains 2456 observations on 14 variables.
The website http://www.encyclopedia-titanica.org/ offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were aboard.
8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a free ticket, which is why they have NA values in fare. In addition to that, fare is truely missing for a few regular passengers.
Answer
Answer to Exercise 2 and some more analisis for problems above are available in my repository on github in file “bivariate analisys”.