If one interprets data at aggregate level , results may be interpretted differently and when same data are evaluated at disaggregated level, one may reach at a different results. This is mainly due to some variable missing in the model. Therefore, care should always be observed while giving causal interpretation to a coefficient in econometric model.
This is data used from Introduction to Statistical Learning built in R-package ISLR. Using Credit data , we shall explore relationship between debt, income and credit limit. For more details , moderndive.com which is a book in econometric using R can be consulted. We shall select only variables of our interest from this credit data.
library(ISLR)
library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
credit_ch6 <- Credit %>% as_tibble() %>%
select(ID, debt = Balance, credit_limit = Limit,
income = Income, credit_rating = Rating, age = Age)
One can look at raw data for selected variables using View or glimpse
glimpse(credit_ch6)
## Observations: 400
## Variables: 6
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ debt <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1...
## $ credit_limit <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300,...
## $ income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20...
## $ credit_rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589...
## $ age <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 4...
Let’s have a look at 5 values selected randomly out of 400. Note down that due to random sampling you may have different values
credit_ch6 %>% sample_n(5)
## # A tibble: 5 x 6
## ID debt credit_limit income credit_rating age
## <int> <int> <int> <dbl> <int> <int>
## 1 254 218 5182 85.4 402 60
## 2 6 1151 8047 80.2 569 77
## 3 4 964 9504 149. 681 36
## 4 167 0 2880 35.7 214 35
## 5 373 840 3782 19.8 293 46
To further explore data. lets have a summary statistics using skimr package
##
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
##
## filter
## Skim summary statistics
## n obs: 400
## n variables: 3
##
## -- Variable type:integer -----------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75
## credit_limit 0 400 400 4735.6 2308.2 855 3088 4622.5 5872.75
## debt 0 400 400 520.01 459.76 0 68.75 459.5 863
## p100 hist
## 13913 <U+2585><U+2587><U+2587><U+2583><U+2582><U+2581><U+2581><U+2581>
## 1999 <U+2587><U+2583><U+2583><U+2583><U+2582><U+2581><U+2581><U+2581>
##
## -- Variable type:numeric -----------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75 p100
## income 0 400 400 45.22 35.24 10.35 21.01 33.12 57.47 186.63
## hist
## <U+2587><U+2583><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
If one looks at the summary statistics, 25% of the values of outcome variable debt are less than $68.5 while than $68.5 while $$68.75 while the outcome variables are credit_limit and income. credit_limit: the mean and median credit card limit are $4735.6 and $4622.50, respectively, while 75% of card holders had incomes of $57,470 or less.
As our outcome variable debt and explanatory variables income and credit_limit are all numerical, therefore, we shall find correlation between pairs of variables among these variables.
credit_ch6 %>%
select(debt, credit_limit, income) %>%
cor()
## debt credit_limit income
## debt 1.0000000 0.8616973 0.4636565
## credit_limit 0.8616973 1.0000000 0.7920883
## income 0.4636565 0.7920883 1.0000000
Correlation between explanatory variables is 0.792 which is quite high and given credit_limit or income, one can guess the other. So there seems high collinearity. But at the moment we dont discuss it.
Lest visualize the relationship between outcome variable debt and explanatory variables
ggplot(credit_ch6, aes(x = credit_limit, y = debt)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Credit card debt (in $)",
title = "Debt and credit limit") +
geom_smooth(method = "lm", se = FALSE)
ggplot(credit_ch6, aes(x = income, y = debt)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Credit card debt (in $)",
title = "Debt and income") +
geom_smooth(method = "lm", se = FALSE)
Observe there is a positive relationship between credit limit and credit card debt: as credit limit increases so also does credit card debt. This is consistent with the strongly positive correlation coefficient of 0.862 we computed earlier.
From the graph one can see positive relationship between income and debt. Now we run multiple regress line
#Regression model
debt_model <- lm(debt ~ credit_limit + income, data = credit_ch6)
summary(debt_model)
##
## Call:
## lm(formula = debt ~ credit_limit + income, data = credit_ch6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -232.79 -115.45 -48.20 53.36 549.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -385.17926 19.46480 -19.79 <2e-16 ***
## credit_limit 0.26432 0.00588 44.95 <2e-16 ***
## income -7.66332 0.38507 -19.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 165.5 on 397 degrees of freedom
## Multiple R-squared: 0.8711, Adjusted R-squared: 0.8705
## F-statistic: 1342 on 2 and 397 DF, p-value: < 2.2e-16
Observe that regression coefficient of income is now negative and for every $1000 income increase there is debt reduction of $7.66. This is in contrast to earlier correlation and graphical analysis which shows positive association between income and debt.
So when credit_limit variable is included there is negative relationship between income and debt . Now lets look at disaggregated analysis between credit_limit and outcome variable debt.
ggplot(credit_ch6, aes(x = credit_limit)) +
geom_histogram(color = "white") +
geom_vline(xintercept = quantile(credit_ch6$credit_limit, probs = c(0.25, 0.5, 0.75)), linetype = "dashed", size = 1) +
labs(x = "Credit limit", title = "Credit limit and 4 credit limit brackets.")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
25% (100) persons have credit limits between 0 and 3308 (low credit limit). 25% have between 3308 and 4622 (medium credit limit), 25% have credit limit between 4662 and 5873 (medium hight credit limit) and 25% have credit limit abot $5873 (hig credit limit).
Lets have a plot between income and debt for the 4 credit limit separately and find out whether relationship is still positive.
credit_ch6 <- credit_ch6 %>%
mutate(limit_bracket = cut_number(credit_limit, 4)) %>%
mutate(limit_bracket = fct_recode(limit_bracket,
"low" = "[855,3.09e+03]",
"med-low" = "(3.09e+03,4.62e+03]",
"med-high" = "(4.62e+03,5.87e+03]",
"high" = "(5.87e+03,1.39e+04]"
))
head(credit_ch6)
## # A tibble: 6 x 7
## ID debt credit_limit income credit_rating age limit_bracket
## <int> <int> <int> <dbl> <int> <int> <fct>
## 1 1 333 3606 14.9 283 34 med-low
## 2 2 903 6645 106. 483 82 high
## 3 3 580 7075 105. 514 71 high
## 4 4 964 9504 149. 681 36 high
## 5 5 331 4897 55.9 357 68 med-high
## 6 6 1151 8047 80.2 569 77 high
model3_balance_vs_income_plot <- ggplot(credit_ch6, aes(x = income, y = debt)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Credit card debt (in $)",
title = "Two scatterplots of credit card debt vs income") +
geom_smooth(method = "lm", se = FALSE) +
scale_y_continuous(limits = c(0, NA))
model3_balance_vs_income_plot_colored <- ggplot(credit_ch6,
aes(x = income, y = debt,
col = limit_bracket)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Income (in $1000)", y = "Credit card debt (in $)",
color = "Credit limit\nbracket") +
scale_y_continuous(limits = c(0, NA)) +
theme(axis.title.y = element_blank())
model3_balance_vs_income_plot
model3_balance_vs_income_plot_colored
## Warning: Removed 18 rows containing missing values (geom_smooth).
Aggregate relationship between debt and income is positive but when we plot debt and income relationship for 4 separate categories, we find that relationship for med-low and med-high is negative, it is flat for low category and is only somewhat positive for high category. credit_limit is a confounding variable and without including credit-limit, results will be misleading.