Linear Model, Chap. 7

Tom Detzel
11.5.15

Urban Homeowners: Exercise 7.39

Three variables:

  • State
  • Percent home ownership (dependent)
  • Percent urban population (explanatory)
head(urbOwn,5)
       State OwnPct UrbPct
1    Alabama   69.7  59.04
2     Alaska   63.1  66.02
3    Arizona   66.0  89.81
4   Arkansas   67.0  56.16
5 California   55.9  94.95

Summary Stats - 1

Check for skew, normality.

  • Median, mean are close.
     OwnPct          UrbPct     
 Min.   :53.30   Min.   :38.66  
 1st Qu.:65.60   1st Qu.:65.39  
 Median :67.60   Median :74.20  
 Mean   :67.07   Mean   :73.98  
 3rd Qu.:69.65   3rd Qu.:87.53  
 Max.   :73.40   Max.   :94.95  

Summary Stats - 2

Check for skew: Negative, so left skewed.

myStats[c(4,11,12,13)]
          sd  skew kurtosis   se
OwnPct  4.23 -1.23     1.51 0.59
UrbPct 14.69 -0.44    -0.57 2.06

Not Exactly Normal

Percent Home Ownership

plot of chunk unnamed-chunk-6

Percent Urban

plot of chunk unnamed-chunk-7

Question 7.39, Part A

(a) For these data, \( { R }^{ 2 } \) = 0.28. What is the correlation? How can you tell if it is positive or negative?

plot of chunk unnamed-chunk-8

Eyeballing It

  • Regression line slopes left to right: negative
  • Slope is mild, so correlation is weak
  • Correlation is \( \sqrt { { R }^{ 2 } } \) = 0.5292
  • Check it using function cor.test()

Question 7.39, Part A

(a) For these data, \( { R }^{ 2 } \) = 0.28. What is the correlation?

  • Use cor.test(y,x)
$statistic
        t 
-4.235275 

$parameter
df 
49 

$estimate
       cor 
-0.5176625 

Question 7.39, Part B

(b) Examine the residual plot. What do you observe? Is a simple least squares fit appropriate for these data?

  • Residuals = Data - Fit, vertical distance from fit line
  • Recreate the model so we can plot the residuals
  • Use fortify() function in ggplot2 to create plottable data
ownFit <- lm(OwnPct~UrbPct, data=urbOwn)
fOwn <- fortify(ownFit)
head(fOwn[c(1,2,6,7)], 3)
  OwnPct UrbPct  .fitted     .resid
1   69.7  59.04 69.29798  0.4020211
2   63.1  66.02 68.25808 -5.1580759
3   66.0  89.81 64.71376  1.2862355

Question 7.39, Part B

(b) Examine the residual plot. What do you observe? Is a simple least squares fit appropriate for these data?

  • Plot the residuals and examine spread
ggplot(fOwn, aes(x=.fitted, y=.resid)) +
    geom_point(shape=1) +
    geom_hline(y=0, lty="dashed") +
    theme_fivethirtyeight() +
    labs(x="Fitted Values", y="Residuals")

Question 7.39, Part B

(b) Is a simple least squares fit appropriate for these data?

  • Residuals appear random, so linear model is OK.

plot of chunk ggplot2

Question 7.39, Part B

(b) Is a simple least squares fit appropriate for these data?

  • Residuals appear approximately normal

plot of chunk unnamed-chunk-12

Summary

(b) Is a simple least squares fit appropriate for these data?

  • Linearity: Do data show a linear trend? Yes
  • Nearly normal residuals: Yes
  • Constant variability: Borderline, some spread at lower end
  • Independent Observations: Yes, but border cities could influence

Bottom Line:

Conditions are partially met, but with an \( { R }^{ 2 } \) of only .28, you're not getting much bang for the buck out of this model. That's only 28 percent of the variability in the dependent variable (home ownership); other unidentified factors could be more important than the size of a state's urban population.