Looking at the built in data set “economics” to see the relationship between unemployment and the savings rate. The obvious expectation is there will be a strong connection.
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
lin = lm(economics$uempmed ~ economics$psavert)
p1 = economics %>%
ggplot(aes(uempmed, psavert, color = date))+ geom_point() + stat_smooth(method = "lm", col = "red") + xlab("Unemployment")+ ylab("Savings Rate")
print(p1)
## `geom_smooth()` using formula 'y ~ x'
** The summary below describes a pretty strong relationship, but the scatter plot suggest a portion of the data is behaving differently at significant Unemployment levels and in recent years. We can see this clearly in the residual plots below**
summary(lin)
##
## Call:
## lm(formula = economics$uempmed ~ economics$psavert)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.241 -2.279 -1.084 0.357 15.840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.46786 0.49657 25.108 < 2e-16 ***
## economics$psavert -0.45045 0.05478 -8.223 1.34e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.887 on 572 degrees of freedom
## Multiple R-squared: 0.1057, Adjusted R-squared: 0.1042
## F-statistic: 67.62 on 1 and 572 DF, p-value: 1.34e-15
plot(fitted(lin),resid(lin))
qqnorm(resid(lin))
qqline(resid(lin))
One hypothesis is that there may have been some paradigm shift that happened circa year 2000,maybe cultural of governmental, that is causing the drift from the model, lets separate the millenniums to check.
p2 = economics %>%
filter(date > "2000-01-01") %>%
ggplot(aes(uempmed, psavert, color = date,))+ geom_point() + geom_smooth(method = "lm", col = "red") + xlab("Unemployment")+ ylab("Savings Rate")+ ggtitle( "After 2000") + theme(legend.position = "none")
p3 = economics %>%
filter(date <= "2000-01-01") %>%
ggplot(aes(uempmed, psavert, color = date))+ geom_point() + geom_smooth(method = "lm", col = "red") + xlab("Unemployment")+ ylab("Savings Rate")+ ggtitle( "Before 2000") + theme(legend.position = "none")
grid.arrange(p1,p2,p3, widths = c(1,1), layout_matrix = rbind(c(1,1),c(2,3)))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
lin2 = lm(uempmed ~ psavert,economics %>%
filter(date > "2000-01-01"))
lin3 = lm(uempmed ~ psavert,economics %>%
filter(date <= "2000-01-01"))
Interestingly enough the correlation is reversed after the year 2000, and the Multiple R-Squared value is much higher
print(summary(lin2))
##
## Call:
## lm(formula = uempmed ~ psavert, data = economics %>% filter(date >
## "2000-01-01"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1658 -2.8539 -0.3202 2.0787 10.1102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06015 1.03473 -0.058 0.954
## psavert 2.19564 0.17529 12.526 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.798 on 181 degrees of freedom
## Multiple R-squared: 0.4643, Adjusted R-squared: 0.4614
## F-statistic: 156.9 on 1 and 181 DF, p-value: < 2.2e-16
print(summary(lin3))
##
## Call:
## lm(formula = uempmed ~ psavert, data = economics %>% filter(date <=
## "2000-01-01"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8602 -1.1582 -0.0500 0.9909 5.4697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.31156 0.31230 26.614 < 2e-16 ***
## psavert -0.14962 0.03058 -4.893 1.45e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 389 degrees of freedom
## Multiple R-squared: 0.05799, Adjusted R-squared: 0.05556
## F-statistic: 23.94 on 1 and 389 DF, p-value: 1.455e-06
The data makes it pretty clear that there is a correlation; but there appears to be another strong variable acting on the results. In this case we can say that the subset for after the year 2000 give us the best results. A possible assumption may be the unprecedented government spending in recent years is having an affect on the savings behavior