Pick any two quantitative variables from any data set that interests you. If you are at a loss, look at the R datasets and find one. Then conduct both correlation and simple regression analysis. Interpret the residuals. Did the assumptions hold?
I chose to go with the LifeCycleSavings data set built into R. This data set provides data on the savings ratio within 50 countries between 1960 and 1970. Further information can be found here. Looking at the potential correlation across all variables I chose to narrow on level of real per-capita disposable income (dpi) and the percentage of the population over the age of 75 (pop75). The R code below shows the correlation coefficient and simple regression analysis.
# Correlation Coefficient
r = cor(LifeCycleSavings$pop75, LifeCycleSavings$dpi)
cat("Correlation of dpi and pop75 =", round(r, digits=4))
## Correlation of dpi and pop75 = 0.787
# Let's plot it out
plot(LifeCycleSavings$dpi ~ LifeCycleSavings$pop75, xlab = "Population over 75 (%)", ylab = "Real Per-Capita Disposable Income (Dollars)", main = "pop75 vs. dpi")
# Regression Analysis
model = lm(dpi ~ pop75, data = LifeCycleSavings)
abline(model, col="red", lwd=3)
summary(model)
##
## Call:
## lm(formula = dpi ~ pop75, data = LifeCycleSavings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1112.87 -388.13 -74.37 311.50 2208.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -278.55 179.44 -1.552 0.127
## pop75 604.15 68.36 8.838 1.23e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 617.7 on 48 degrees of freedom
## Multiple R-squared: 0.6194, Adjusted R-squared: 0.6114
## F-statistic: 78.11 on 1 and 48 DF, p-value: 1.23e-11
# Display and plot the residuals
model$residuals
## Australia Austria Belgium Bolivia Brazil
## 874.32983 -877.74376 -289.34667 -541.24595 505.57601
## Canada Chile China Colombia Costa Rica
## 1539.61273 131.85196 163.28924 -85.19738 61.06100
## Denmark Ecuador Finland France Germany
## 400.78593 -152.61626 527.97242 -347.11587 711.78014
## Greece Guatamala Honduras Iceland India
## -723.45357 42.65020 160.58231 317.87934 -212.49286
## Ireland Italy Japan Korea Luxembourg
## -1112.87182 -433.87874 381.90921 -63.54560 474.47496
## Malta Norway Netherlands New Zealand Nicaragua
## -612.64210 292.36368 55.77465 -149.07373 -126.92916
## Panama Paraguay Peru Philippines Portugal
## 122.13229 -135.24593 -94.69932 -246.08609 -863.75727
## South Africa South Rhodesia Spain Sweden Switzerland
## -447.79451 -388.79417 -686.56017 835.21736 656.04496
## Turkey Tunisia United Kingdom United States Venezuela
## 15.72971 -202.59916 -602.01102 2208.21852 548.20585
## Zambia Jamaica Uruguay Libya Malaysia
## 78.55521 -386.15466 -598.18840 -848.45402 122.50069
plot(LifeCycleSavings$pop75, model$residuals, xlab = "Population over 75 (%)", ylab = "Residuals", main = "Residuals Scatterplot")
abline(a = 0, b = 0, col = "red", lwd = 3)
# clean up
rm(model, r)
As you can see there is positive correlation to the percentage of the population over the age of 75 and the real per-capita disposable income. However, while there is a positive correlation, the residuals appear to grow as the percentage of the population grows. This would be something to consider when applying linear regression since the scatterplot fans out.