Simpsons’ paradox

Simpson paradox

If one interprets data at aggregate level , results may be interpretted differently and when same data are evaluated at disaggregated level, one may reach at a different results. This is mainly due to some variable missing in the model. Therefore, care should always be observed while giving causal interpretation to a coefficient in econometric model.

This is data used from Introduction to Statistical Learning built in R-package ISLR. Using Credit data , we shall explore relationship between debt, income and credit limit. For more details , moderndive.com which is a book in econometric using R can be consulted. We shall select only variables of our interest from this credit data.

library(ISLR)
library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

credit_ch6 <- Credit %>% as_tibble() %>% 
  select(ID, debt = Balance, credit_limit = Limit,
         income = Income, credit_rating = Rating, age = Age)

One can look at raw data for selected variables using View or glimpse

glimpse(credit_ch6)

## Observations: 400
## Variables: 6
## $ ID            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ debt          <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1...
## $ credit_limit  <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300,...
## $ income        <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20...
## $ credit_rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589...
## $ age           <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 4...

Exploratory Data Analysis

Let’s have a look at 5 values selected randomly out of 400. Note down that due to random sampling you may have different values

credit_ch6 %>% sample_n(5)

## # A tibble: 5 x 6
##      ID  debt credit_limit income credit_rating   age
##   <int> <int>        <int>  <dbl>         <int> <int>
## 1   254   218         5182   85.4           402    60
## 2     6  1151         8047   80.2           569    77
## 3     4   964         9504  149.            681    36
## 4   167     0         2880   35.7           214    35
## 5   373   840         3782   19.8           293    46

To further explore data. lets have a summary statistics using skimr package

## 
## Attaching package: 'skimr'

## The following object is masked from 'package:stats':
## 
##     filter

## Skim summary statistics
##  n obs: 400 
##  n variables: 3 
## 
## -- Variable type:integer -----------------------------------------------------------
##      variable missing complete   n    mean      sd  p0     p25    p50     p75
##  credit_limit       0      400 400 4735.6  2308.2  855 3088    4622.5 5872.75
##          debt       0      400 400  520.01  459.76   0   68.75  459.5  863   
##   p100     hist
##  13913 <U+2585><U+2587><U+2587><U+2583><U+2582><U+2581><U+2581><U+2581>
##   1999 <U+2587><U+2583><U+2583><U+2583><U+2582><U+2581><U+2581><U+2581>
## 
## -- Variable type:numeric -----------------------------------------------------------
##  variable missing complete   n  mean    sd    p0   p25   p50   p75   p100
##    income       0      400 400 45.22 35.24 10.35 21.01 33.12 57.47 186.63
##      hist
##  <U+2587><U+2583><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>

If one looks at the summary statistics, 25% of the values of outcome variable debt are less than $68.5 while than $68.5 while $$68.75 while the outcome variables are credit_limit and income. credit_limit: the mean and median credit card limit are $4735.6 and $4622.50, respectively, while 75% of card holders had incomes of $57,470 or less.

As our outcome variable debt and explanatory variables income and credit_limit are all numerical, therefore, we shall find correlation between pairs of variables among these variables.

credit_ch6 %>% 
  select(debt, credit_limit, income) %>% 
  cor()

##                   debt credit_limit    income
## debt         1.0000000    0.8616973 0.4636565
## credit_limit 0.8616973    1.0000000 0.7920883
## income       0.4636565    0.7920883 1.0000000

Correlation between explanatory variables is 0.792 which is quite high and given credit_limit or income, one can guess the other. So there seems high collinearity. But at the moment we dont discuss it.

Lest visualize the relationship between outcome variable debt and explanatory variables

ggplot(credit_ch6, aes(x = credit_limit, y = debt)) +
  geom_point() +
  labs(x = "Credit limit (in $)", y = "Credit card debt (in $)", 
       title = "Debt and credit limit") +
  geom_smooth(method = "lm", se = FALSE)

ggplot(credit_ch6, aes(x = income, y = debt)) +
  geom_point() +
  labs(x = "Income (in $1000)", y = "Credit card debt (in $)", 
       title = "Debt and income") +
  geom_smooth(method = "lm", se = FALSE)

Observe there is a positive relationship between credit limit and credit card debt: as credit limit increases so also does credit card debt. This is consistent with the strongly positive correlation coefficient of 0.862 we computed earlier.

From the graph one can see positive relationship between income and debt. Now we run multiple regress line

#Regression model
 debt_model <- lm(debt ~ credit_limit + income, data = credit_ch6)
 
 summary(debt_model)

## 
## Call:
## lm(formula = debt ~ credit_limit + income, data = credit_ch6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -232.79 -115.45  -48.20   53.36  549.77 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -385.17926   19.46480  -19.79   <2e-16 ***
## credit_limit    0.26432    0.00588   44.95   <2e-16 ***
## income         -7.66332    0.38507  -19.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 165.5 on 397 degrees of freedom
## Multiple R-squared:  0.8711, Adjusted R-squared:  0.8705 
## F-statistic:  1342 on 2 and 397 DF,  p-value: < 2.2e-16

Observe that regression coefficient of income is now negative and for every $1000 income increase there is debt reduction of $7.66. This is in contrast to earlier correlation and graphical analysis which shows positive association between income and debt.

So when credit_limit variable is included there is negative relationship between income and debt . Now lets look at disaggregated analysis between credit_limit and outcome variable debt.

ggplot(credit_ch6, aes(x = credit_limit)) +
  geom_histogram(color = "white") +
  geom_vline(xintercept = quantile(credit_ch6$credit_limit, probs = c(0.25, 0.5, 0.75)), linetype = "dashed", size = 1) +
  labs(x = "Credit limit", title = "Credit limit and 4 credit limit brackets.")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

25% (100) persons have credit limits between 0 and 3308 (low credit limit). 25% have between 3308 and 4622 (medium credit limit), 25% have credit limit between 4662 and 5873 (medium hight credit limit) and 25% have credit limit abot $5873 (hig credit limit).

Disaggregated data visualization

Lets have a plot between income and debt for the 4 credit limit separately and find out whether relationship is still positive.

credit_ch6 <- credit_ch6 %>%
  mutate(limit_bracket = cut_number(credit_limit, 4)) %>%
  mutate(limit_bracket = fct_recode(limit_bracket,
    "low" =  "[855,3.09e+03]",
    "med-low" = "(3.09e+03,4.62e+03]",
    "med-high" = "(4.62e+03,5.87e+03]",
    "high" = "(5.87e+03,1.39e+04]"
  ))
head(credit_ch6)

## # A tibble: 6 x 7
##      ID  debt credit_limit income credit_rating   age limit_bracket
##   <int> <int>        <int>  <dbl>         <int> <int> <fct>        
## 1     1   333         3606   14.9           283    34 med-low      
## 2     2   903         6645  106.            483    82 high         
## 3     3   580         7075  105.            514    71 high         
## 4     4   964         9504  149.            681    36 high         
## 5     5   331         4897   55.9           357    68 med-high     
## 6     6  1151         8047   80.2           569    77 high

model3_balance_vs_income_plot <- ggplot(credit_ch6, aes(x = income, y = debt)) +
  geom_point() +
  labs(x = "Income (in $1000)", y = "Credit card debt (in $)",
       title = "Two scatterplots of credit card debt vs income") +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_continuous(limits = c(0, NA))

model3_balance_vs_income_plot_colored <- ggplot(credit_ch6,
                                                aes(x = income, y = debt,
                                                    col = limit_bracket)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Income (in $1000)", y = "Credit card debt (in $)",
       color = "Credit limit\nbracket") +
  scale_y_continuous(limits = c(0, NA)) +
  theme(axis.title.y = element_blank())

  model3_balance_vs_income_plot

  model3_balance_vs_income_plot_colored

## Warning: Removed 18 rows containing missing values (geom_smooth).

Aggregate relationship between debt and income is positive but when we plot debt and income relationship for 4 separate categories, we find that relationship for med-low and med-high is negative, it is flat for low category and is only somewhat positive for high category. credit_limit is a confounding variable and without including credit-limit, results will be misleading.

Simpsons’ paradox

Zahid Asghar

12/25/2019

Simpson paradox

Exploratory Data Analysis

Disaggregated data visualization