Discussion Question

Pick any two quantitative variables from any data set that interests you. If you are at a loss, look at the R datasets and find one. Then conduct both correlation and simple regression analysis. Interpret the residuals. Did the assumptions hold?

Answer

I chose to go with the LifeCycleSavings data set built into R. This data set provides data on the savings ratio within 50 countries between 1960 and 1970. Further information can be found here. Looking at the potential correlation across all variables I chose to narrow on level of real per-capita disposable income (dpi) and the percentage of the population over the age of 75 (pop75). The R code below shows the correlation coefficient and simple regression analysis.

# Correlation Coefficient
r = cor(LifeCycleSavings$pop75, LifeCycleSavings$dpi)
cat("Correlation of dpi and pop75 =", round(r, digits=4))
## Correlation of dpi and pop75 = 0.787
# Let's plot it out
plot(LifeCycleSavings$dpi ~ LifeCycleSavings$pop75, xlab = "Population over 75 (%)", ylab = "Real Per-Capita Disposable Income (Dollars)", main = "pop75 vs. dpi")

# Regression Analysis
model = lm(dpi ~ pop75, data = LifeCycleSavings)
abline(model, col="red", lwd=3)

summary(model)
## 
## Call:
## lm(formula = dpi ~ pop75, data = LifeCycleSavings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1112.87  -388.13   -74.37   311.50  2208.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -278.55     179.44  -1.552    0.127    
## pop75         604.15      68.36   8.838 1.23e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 617.7 on 48 degrees of freedom
## Multiple R-squared:  0.6194, Adjusted R-squared:  0.6114 
## F-statistic: 78.11 on 1 and 48 DF,  p-value: 1.23e-11
# Display and plot the residuals
model$residuals
##      Australia        Austria        Belgium        Bolivia         Brazil 
##      874.32983     -877.74376     -289.34667     -541.24595      505.57601 
##         Canada          Chile          China       Colombia     Costa Rica 
##     1539.61273      131.85196      163.28924      -85.19738       61.06100 
##        Denmark        Ecuador        Finland         France        Germany 
##      400.78593     -152.61626      527.97242     -347.11587      711.78014 
##         Greece      Guatamala       Honduras        Iceland          India 
##     -723.45357       42.65020      160.58231      317.87934     -212.49286 
##        Ireland          Italy          Japan          Korea     Luxembourg 
##    -1112.87182     -433.87874      381.90921      -63.54560      474.47496 
##          Malta         Norway    Netherlands    New Zealand      Nicaragua 
##     -612.64210      292.36368       55.77465     -149.07373     -126.92916 
##         Panama       Paraguay           Peru    Philippines       Portugal 
##      122.13229     -135.24593      -94.69932     -246.08609     -863.75727 
##   South Africa South Rhodesia          Spain         Sweden    Switzerland 
##     -447.79451     -388.79417     -686.56017      835.21736      656.04496 
##         Turkey        Tunisia United Kingdom  United States      Venezuela 
##       15.72971     -202.59916     -602.01102     2208.21852      548.20585 
##         Zambia        Jamaica        Uruguay          Libya       Malaysia 
##       78.55521     -386.15466     -598.18840     -848.45402      122.50069
plot(LifeCycleSavings$pop75, model$residuals, xlab = "Population over 75 (%)", ylab = "Residuals", main = "Residuals Scatterplot")
abline(a = 0, b = 0, col = "red", lwd = 3)

# clean up
rm(model, r)

As you can see there is positive correlation to the percentage of the population over the age of 75 and the real per-capita disposable income. However, while there is a positive correlation, the residuals appear to grow as the percentage of the population grows. This would be something to consider when applying linear regression since the scatterplot fans out.