Processing math: 100%

Regression in R

Chris Brunsdon

UQ July 2018

Overview

Overview

  • Data
  • Exploration
  • Multiple Linear Regression
  • Testing Assumptions
  • Outliers

You will find Chapters 5 and Chapter 8 in An Introduction to R for Spatial Analysis and Mapping by Chris Brunsdon and Lex Comber useful.

Data

  • you will use georgia dataset
  • in GISTools and the GWmodel packages w
  • Some package installation may be needed
  • only needs to be done once

Once we are sure all the packages are installed, you need to load them into the current session:

The Data Variables

The data set contains a number of variables for the counties in Georgia from the 1990 census including the percentage of the population in each County that

  • is Rural (PctRural)
  • have a college degree (PctBach)
  • are elderly (PctEld)
  • that are foreign born (PctFB)
  • that are classed as being in poverty (PctPov)
  • that are black (PctBlack)

and the median income of the county (MedInc) (in 1000s of dollars)

The Data Itself

MedInc PctRural PctBach PctEld PctFB PctPov PctBlack
32.152 75.6 8.2 11.43 0.64 19.9 20.76
27.657 100.0 6.4 11.77 1.58 26.0 26.86
29.342 61.7 6.6 11.11 0.27 24.1 15.42
29.610 100.0 9.4 13.17 0.11 24.8 51.67
36.414 42.7 13.3 8.64 1.43 17.5 42.39
41.783 100.0 6.4 11.37 0.34 15.1 3.49

Initial Explorations

Initial Explorations

Visually, it seems that there may be some colinearity between PctPov,PctBlack and PctEld.

Alternative View

Median Income

Regression Models

Our Model Assumptions

We are interested in predicting MedInc in Georgia.

Assume associated with

  • rurality (PctRurual)
  • whether people went to university (PctBach)
  • how old the population is (PctEld)
  • the number of people that are foreign born (PctFB)
  • the percentage of the populations classed as being in poverty (PctPov)
  • and being black (PctBlack).

Models

The equation for a regression is:

yi=β0+∑k=1mβkxik+ϵi for i∈1⋯n

  • yi is Medinc here
  • The xij’s are pctpov etc.

for example

m <- lm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, 
        data = df) 

Summary

## 
## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##     PctBlack, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4203  -2.9897  -0.6163   2.2095  25.8201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.59895    3.10893  16.919  < 2e-16 ***
## PctRural     0.07377    0.02043   3.611 0.000414 ***
## PctBach      0.69726    0.11221   6.214 4.73e-09 ***
## PctEld      -0.78862    0.17979  -4.386 2.14e-05 ***
## PctFB       -1.29030    0.47388  -2.723 0.007229 ** 
## PctPov      -0.95400    0.10459  -9.121 4.19e-16 ***
## PctBlack     0.03313    0.03717   0.891 0.374140    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.155 on 152 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7593 
## F-statistic: 84.09 on 6 and 152 DF,  p-value: < 2.2e-16

Outliers

Studentised Residuals

Which ones?

Name Residual
Charlton -2.248484
Chattahoochee -2.205354
Clarke -2.918361
Fayette 2.984996
Forsyth 5.620108
Seminole 2.954353
Wilkinson -2.307853

Spatial Aspects

Assumptions - Normally Distributed Errors