NCG602A Assignment

1. The data for Georgia are in the GWmodel. Use the data() function to load them and library() to load GWmodel.

data(Georgia)           # loads census data from Georgia
data(GeorgiaCounties)   # loads the Georgia counties data
ls()                    # ensures the data has been added to the R environment

## [1] "Gedu.counties" "Gedu.df"

ls() will tell you what the data frames are called. names() will list the column headings.

?Georgia              # This informs us Georgia is a dataset with 159 observations on 13 variables, which include various geographical variables, stats about educational attainment and demographic information regarding age, ethnicity and place of birth
?Gedu.counties        # This command informs us with boundaries for mapping
?Gedu.df              # This informs us that this is a .csv file which contains census data from Georgia, USA
names(Gedu.counties)  # returns the names of the objects in Gedu.counties

## [1] "AREA"      "PERIMETER" "G_UTM_"    "G_UTM_ID"  "AREANAME"  "AREAKEY"  
## [7] "X_COORD"   "Y_COORD"

names(Gedu.df)        # Gedu.df also contains geographical variables, but also statistics relating to  rural v. urban demographics, population, educational attainment and ethnicity

##  [1] "AreaKey"  "Latitude" "Longitud" "TotPop90" "PctRural" "PctBach" 
##  [7] "PctEld"   "PctFB"    "PctPov"   "PctBlack" "ID"       "X"       
## [13] "Y"

Q.2. Remember to merge the data frames before you can plot them - remember about the order of the data frames… match() will be useful.

sort(Gedu.df$AreaKey) == sort(Gedu.counties$AREAKEY) # Validates that the areakeys are identical across the dataframes

##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [99] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [113] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [127] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [141] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [155] TRUE TRUE TRUE TRUE TRUE

CountiesCopy <- Gedu.counties # Creates CountiesCopy, a duplicate of Gedu.counties
Gmatch <- match(Gedu.counties$AREAKEY,Gedu.df$AreaKey) # Creates Gmatch, a vector of areakey positions so Gedu.df can be indexed 
slot(CountiesCopy,"data") <- data.frame(slot(CountiesCopy,"data"),Gedu.df[Gmatch,])   # Places the dataframes side by side and places them in CountiesCopy
names(CountiesCopy) # Validates that the dataframes have been joined

##  [1] "AREA"      "PERIMETER" "G_UTM_"    "G_UTM_ID"  "AREANAME" 
##  [6] "AREAKEY"   "X_COORD"   "Y_COORD"   "AreaKey"   "Latitude" 
## [11] "Longitud"  "TotPop90"  "PctRural"  "PctBach"   "PctEld"   
## [16] "PctFB"     "PctPov"    "PctBlack"  "ID"        "X"        
## [21] "Y"

summary(CountiesCopy) # Checks the data

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min     max
## x  627305.9 1082188
## y 3368055.8 3879805
## Is projected: NA 
## proj4string : [NA]
## Data attributes:
##       AREA             PERIMETER          G_UTM_         G_UTM_ID     
##  Min.   :3.138e+08   Min.   : 87211   Min.   :  2.0   Min.   :  1.00  
##  1st Qu.:6.651e+08   1st Qu.:124139   1st Qu.: 42.5   1st Qu.: 41.50  
##  Median :8.981e+08   Median :152137   Median : 86.0   Median : 85.00  
##  Mean   :9.620e+08   Mean   :154314   Mean   : 87.1   Mean   : 86.04  
##  3rd Qu.:1.190e+09   3rd Qu.:174974   3rd Qu.:131.5   3rd Qu.:130.50  
##  Max.   :2.356e+09   Max.   :341307   Max.   :174.0   Max.   :173.00  
##                                                                       
##                 AREANAME      AREAKEY       X_COORD       
##  GA, Appling County :  1   13001  :  1   Min.   : 635964  
##  GA, Atkinson County:  1   13003  :  1   1st Qu.: 741144  
##  GA, Bacon County   :  1   13005  :  1   Median : 809737  
##  GA, Baker County   :  1   13007  :  1   Mean   : 820944  
##  GA, Baldwin County :  1   13009  :  1   3rd Qu.: 894584  
##  GA, Banks County   :  1   13011  :  1   Max.   :1059710  
##  (Other)            :153   (Other):153                    
##     Y_COORD           AreaKey         Latitude        Longitud     
##  Min.   :3401150   Min.   :13001   Min.   :30.72   Min.   :-85.50  
##  1st Qu.:3523380   1st Qu.:13082   1st Qu.:31.79   1st Qu.:-84.44  
##  Median :3636470   Median :13161   Median :32.75   Median :-83.69  
##  Mean   :3636238   Mean   :13161   Mean   :32.81   Mean   :-83.58  
##  3rd Qu.:3741655   3rd Qu.:13242   3rd Qu.:33.79   3rd Qu.:-82.85  
##  Max.   :3872640   Max.   :13321   Max.   :34.92   Max.   :-81.09  
##                                                                    
##     TotPop90         PctRural         PctBach          PctEld     
##  Min.   :  1915   Min.   :  2.50   Min.   : 4.20   Min.   : 1.46  
##  1st Qu.:  9220   1st Qu.: 54.70   1st Qu.: 7.60   1st Qu.: 9.81  
##  Median : 16934   Median : 72.30   Median : 9.40   Median :12.07  
##  Mean   : 40744   Mean   : 70.18   Mean   :10.95   Mean   :11.74  
##  3rd Qu.: 36058   3rd Qu.:100.00   3rd Qu.:12.00   3rd Qu.:13.70  
##  Max.   :648951   Max.   :100.00   Max.   :37.50   Max.   :22.96  
##                                                                   
##      PctFB           PctPov         PctBlack           ID        
##  Min.   :0.040   Min.   : 2.60   Min.   : 0.00   Min.   :  1.00  
##  1st Qu.:0.415   1st Qu.:14.05   1st Qu.:11.75   1st Qu.: 41.50  
##  Median :0.720   Median :18.60   Median :27.64   Median : 85.00  
##  Mean   :1.131   Mean   :19.34   Mean   :27.39   Mean   : 86.04  
##  3rd Qu.:1.265   3rd Qu.:24.65   3rd Qu.:40.06   3rd Qu.:130.50  
##  Max.   :6.740   Max.   :35.90   Max.   :79.64   Max.   :173.00  
##                                                                  
##        X                 Y          
##  Min.   : 635964   Min.   :3401148  
##  1st Qu.: 741144   1st Qu.:3523380  
##  Median : 809737   Median :3636468  
##  Mean   : 820944   Mean   :3636238  
##  3rd Qu.: 894584   3rd Qu.:3741656  
##  Max.   :1059706   Max.   :3872640  
##

summary(CountiesCopy[12:18]) # Checks the data we're interested in

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min     max
## x  627305.9 1082188
## y 3368055.8 3879805
## Is projected: NA 
## proj4string : [NA]
## Data attributes:
##     TotPop90         PctRural         PctBach          PctEld     
##  Min.   :  1915   Min.   :  2.50   Min.   : 4.20   Min.   : 1.46  
##  1st Qu.:  9220   1st Qu.: 54.70   1st Qu.: 7.60   1st Qu.: 9.81  
##  Median : 16934   Median : 72.30   Median : 9.40   Median :12.07  
##  Mean   : 40744   Mean   : 70.18   Mean   :10.95   Mean   :11.74  
##  3rd Qu.: 36058   3rd Qu.:100.00   3rd Qu.:12.00   3rd Qu.:13.70  
##  Max.   :648951   Max.   :100.00   Max.   :37.50   Max.   :22.96  
##      PctFB           PctPov         PctBlack    
##  Min.   :0.040   Min.   : 2.60   Min.   : 0.00  
##  1st Qu.:0.415   1st Qu.:14.05   1st Qu.:11.75  
##  Median :0.720   Median :18.60   Median :27.64  
##  Mean   :1.131   Mean   :19.34   Mean   :27.39  
##  3rd Qu.:1.265   3rd Qu.:24.65   3rd Qu.:40.06  
##  Max.   :6.740   Max.   :35.90   Max.   :79.64

countiesmatrix <- as.matrix(cbind(CountiesCopy$TotPop90, CountiesCopy$PctRural, CountiesCopy$PctFB, CountiesCopy$PctEld, CountiesCopy$PctPov, CountiesCopy$PctBach, CountiesCopy$PctBlack)) # creates a matrix object so the relevant data can be more easily correlated below
colnames(countiesmatrix) <- c("Population", "Rural (%)", "Foreign-Born (%)", "Elderly (%)", "Poverty (%)", "Degree (%)", "Black (%)") # names each of the variables inside the matrix

3. hist(), summary(), boxplot() and plot() may be useful exploratory tools.

options(scipen = 999) # effectively disables scientific notation for the plots to follow
boxplot(CountiesCopy$TotPop90, main = "Boxplot of Total Population", ylab = "Population") # Generates a boxplot for the total population of Georgia

hist(CountiesCopy$TotPop90, main = "Histogram of Total Population", xlab = "Population", ylab = "Number of Counties") # Generates a histogram for the rural population of Georgia

spatialmap <- function(x) {                              # creates function spatialmap which takes in object x and returns appropriate colour codes
  VAR <- x                                               # creates variable Var which is made equal to x
  NC <- 8                                                # creates variable NC, which is equal to eight, the number of intervals we want for the plot
  Colors <- brewer.pal(NC, "RdYlBu")                     # instantiates variable colors, which contains eight class intervals from the Red Yellow Blue pallate
  Classes <- classIntervals(VAR, NC, style = "pretty")   # instantiates the variable classes which contains the intervals at which the colours will change, a 'pretty' style is chosen because the data is not evenly distributed, and may appear visually misleading otherwise
  colCodes <- findColours(Classes, Colors)               # creates variable colCodes which contains three parts, each colour for each county, 7 colours from the palette (), and the colours to be used in the legend
  return(colCodes)                                       # returns the variable ColCodes
}

colCodes <- spatialmap(countiesmatrix[,1])  # calls the function spatialmap on the total population variable from the countiesmatrix

plot(Gedu.counties, col=spatialmap(countiesmatrix[,1]))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Spatial Plot of Georgia, U.S.")                    # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(countiesmatrix[,1]), "table")), # determines the text for the legend
       fill=attr(spatialmap(countiesmatrix[,1]), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)

When plotted on a histogram and box plot, the data reveals itself to have a non-normal distribution, and to skew dramatically towards the left, at the lower range of values. 150 of the 159 counties have a population of between 1915 and 100,000, meaning that there is extreme towards the lower range of values. There are also a number of outliers, visible on both the boxplot and spatial plot, which corresponds to information derived from the summary of each variable above. The median value for population in each county is 16,934, and the maximum is more than 38 times this at 648,951.

The spatial plot confirms that many counties in Georgia have relatively low populations, with the exception of Fulton, which contains the capital city Atlanta and the surrounding counties of of Gwinnet Rockdale, Dekalb and Douglas. Other counties which contain cities such as Savannah in the south-east of the state also score higher for overall population.

hist(CountiesCopy$PctRural, main = "Histogram of Rural Population", ylab = "Number of Counties", xlab = "Population (%)") # Generates a histogram for the rural population in each county in Georgia

boxplot(CountiesCopy$PctRural, main = "Boxplot of Rural Population", ylab = "Population (%)") # Generates a boxplot for the total population of Georgia

colCodes <- spatialmap(countiesmatrix[,2])  # calls the function spatialmap on the rural population variable from the countiesmatrix

plot(Gedu.counties, col=spatialmap(countiesmatrix[,2]))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Rural Population (%)")                                       # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(countiesmatrix[,2]), "table")), # determines the text for the legend
       fill=attr(spatialmap(countiesmatrix[,2]), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)                                                             # inserts the y axis

When viewing this data on a histogram and box plot, the data exhibits a large amount of skew to the right, towards the higher end of the values. The median value for the percentage of Georgia’s rural population is 72.3, indicating that Georgia is a highly rural state. This was confirmed by the spatial plot, which indicates that most counties score extremely highly for their rural population. Those counties which score low for their rural population tend to have higher populations, as we would expect from the previous spatial plot, suggesting that, unsurprisingly, non-rural, or urban areas have higher populations than rural areas.

hist(CountiesCopy$PctEld, main = "Histogram of Elderly Population", xlab = "Population (%)", ylab = "Number of Counties") # Generates a histogram for the Elderly population of Georgia

boxplot(CountiesCopy$PctEld, main = "Boxplot of Elderly Population", ylab = "Population (%)")                    # Generates a boxplot for the Edlerly population of Georgia

colCodes <- spatialmap(countiesmatrix[,4])  # calls the function spatialmap on the elderly population variable from the countiesmatrix

plot(Gedu.counties, col=spatialmap(countiesmatrix[,4]))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Elderly Population (%)")                                       # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(countiesmatrix[,4]), "table")), # determines the text for the legend
       fill=attr(spatialmap(countiesmatrix[,4]), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)                                                             # inserts the y axis

The histogram and box plot both demonstrate that the percentage of the elderly population in Georgia is fairly evenly distributed in comparison to the previous two variables. The median percentage for elderly population of counties in Georgia is 12%.

The outlier for the lowest elderly population is the county of Chatta-Hoochee in the west of the state, while the county of towns has the highest at 24%.

hist(CountiesCopy$PctBach, main = "Histogram of Population with a Bachelor's Degree", xlab = "Population (%)", ylab = "Number of Counties") # Generates a histogram for the population of Georgia with a bachelor's degree

boxplot(CountiesCopy$PctBach, main = "Boxplot of Population with a Bachelor's Degree", ylab = "Population (%)")                    # Generates a boxplot for the population of Georgia with a bachelor's degree

colCodes <- spatialmap(countiesmatrix[,6])

plot(Gedu.counties, col=spatialmap(countiesmatrix[,6]))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Degree (%)")                                       # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(countiesmatrix[,6]), "table")), # determines the text for the legend
       fill=attr(spatialmap(countiesmatrix[,6]), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)                                                             # inserts the y axis

As is clear from the above histogram and box plot, this data exhibits a strong skewness to the left, at the lower end of the values. There are obviously a number of outlier counties, particularly around Atlanta, as can be seen on the spatial plot, the mean attainment of a bachelor’s education in the state of Georgia is 11%.

Unsuprisingly, those counties with higher percentages for bachelor’s degree attainment tend to be counties in which universities are located, such as Columbus State university in the county of Muscogee and Georgia state university in Atlanta. Neighbouring counties also exhibit higher percentages for bachelor’s degree attainment, presumably due to the higher density of student population in these counties.

hist(CountiesCopy$PctFB, main = "Histogram of Foreign-Born Population", xlab = "Population (%)", ylab = "Number of Counties")  # Generates a histogram for the foreign-born population of Georgia

boxplot(CountiesCopy$PctFB, main = "Boxplot of Foreign-Born Population", ylab = "Population (%)")                     # Generates a boxplot for the foreign-born population of Georgia

colCodes <- spatialmap(countiesmatrix[,3])  # calls the function spatialmap on the foreign-born population variable from the countiesmatrix

plot(Gedu.counties, col=spatialmap(countiesmatrix[,3]))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Foreign-Born (%)")                                       # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(countiesmatrix[,3]), "table")), # determines the text for the legend
       fill=attr(spatialmap(countiesmatrix[,3]), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)                                                             # inserts the y axis

The data for the percentage of Georgia’s population born outside of the United States is highly skewed to the left, or towards lower end of the values, as is clear from both the histogram and box plot. The overwhelming majority of counties have a foreign-born population of less than 1% and this, as well as the spatial plot, indicates that Georgia is a very ethnically homogenous state, speaking solely from the perspective of its inhabitant’s places of birth.

Once again, the disparity between urban and rural values for this variable is clear, with more urban areas such as Fulton and Muscogee scoring higher for its foreign-born population, presumably due to the influx of international students attending university in these areas.

hist(CountiesCopy$PctPov, main = "Histogram of Population living in poverty", xlab = "Population (%)", ylab = "Number of counties") # Generates a histogram for the population of Georgia living in poverty

boxplot(CountiesCopy$PctPov, main = "Boxplot of Population living in poverty", ylab = "Population (%)")                    # Generates a boxplot for the population of Georgia living in poverty

colCodes <- spatialmap(countiesmatrix[,5])  # calls the function spatialmap on the poverty  variable from the countiesmatrix

plot(Gedu.counties, col=spatialmap(countiesmatrix[,5]))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Population living in poverty (%)")                                       # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(countiesmatrix[,5]), "table")), # determines the text for the legend
       fill=attr(spatialmap(countiesmatrix[,5]), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)                                                             # inserts the y axis

Interpreting the above histogram and box plots visually, the data exhibits a slight rightward skew, or towards the higher end of the values for poverty rates. There are no outliers in the box plot, and the median value is 18.6%. The spatial data plot suggests that Georgia’s poorer counties are located in the south, and that there is a north/south divide in Georgia regarding rates of poverty.

Interpreting the spatial plot visually, there is also a slight overlap between poverty rates and rual areas, with more populous and urban areas tending towards more mid-ranges for percentages of its population living in poverty.

hist(CountiesCopy$PctBlack, main = "Histogram of black population", xlab = "Population (%)", ylab = "Number of counties")  # Generates a histogram for the black population of Georgia

boxplot(CountiesCopy$PctBlack, main = "Boxplot of black population", ylab = "Population (%)")                     # Generates a boxplot for the black population of Georgia

colCodes <- spatialmap(countiesmatrix[,7])  # calls the function spatialmap on the black population variable from the countiesmatrix

plot(Gedu.counties, col=spatialmap(countiesmatrix[,7]))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Black population (%)")          # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(countiesmatrix[,7]), "table")), # determines the text for the legend
       fill=attr(spatialmap(countiesmatrix[,7]), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)                                                             # inserts the y axis

The histogram and box plot demonstrate that the data for percentage of black people in each county skews to the left, or towards the lower end of the values. The median percentage of black population in each county is 27.6%. The spatial plot below shows that the most ethnically black counties in Georgia are the counties in the centre of the state.

4. cor() will give you a correlation matrix, and cor.test() will carry out a test to determine whether any correlations are likely to be zero.

After having carried out more rigorous tests for normality of the data distribution, by plotting each of the variables on a sequence of Quantile-Quantile plots, and by reading these in tandem with the histograms above, it was determined that none of these data are distributed normally, so the non-parametric tests of correlation that do not make assumptions about how data is distributed would be used. This made the results obtained from the Pearson correlation unsuitable.

After combining the seven variables into a matrix, these correlation tests were carried out, the results of which appear below.

cor(countiesmatrix, method = "pearson")

##                   Population   Rural (%) Foreign-Born (%) Elderly (%)
## Population        1.00000000 -0.60419526        0.6114954  -0.3452795
## Rural (%)        -0.60419526  1.00000000       -0.5467835   0.3903173
## Foreign-Born (%)  0.61149536 -0.54678348        1.0000000  -0.4826286
## Elderly (%)      -0.34527951  0.39031726       -0.4826286   1.0000000
## Poverty (%)      -0.31198998  0.17420183       -0.3287093   0.5681833
## Degree (%)        0.71115683 -0.61885563        0.6719467  -0.4584957
## Black (%)        -0.01296442 -0.06905034       -0.1120165   0.2971667
##                  Poverty (%) Degree (%)   Black (%)
## Population        -0.3119900  0.7111568 -0.01296442
## Rural (%)          0.1742018 -0.6188556 -0.06905034
## Foreign-Born (%)  -0.3287093  0.6719467 -0.11201655
## Elderly (%)        0.5681833 -0.4584957  0.29716672
## Poverty (%)        1.0000000 -0.4016181  0.73563771
## Degree (%)        -0.4016181  1.0000000 -0.10929253
## Black (%)          0.7356377 -0.1092925  1.00000000

cor(countiesmatrix, method = "kendall")

##                  Population   Rural (%) Foreign-Born (%) Elderly (%)
## Population        1.0000000 -0.50096973        0.3756549  -0.3844347
## Rural (%)        -0.5009697  1.00000000       -0.2877349   0.1903480
## Foreign-Born (%)  0.3756549 -0.28773487        1.0000000  -0.3109432
## Elderly (%)      -0.3844347  0.19034802       -0.3109432   1.0000000
## Poverty (%)      -0.3877973  0.06571539       -0.2692739   0.4126899
## Degree (%)        0.4759326 -0.39701608        0.4088857  -0.2501520
## Black (%)        -0.1976351 -0.07604261       -0.1653498   0.2446817
##                  Poverty (%)  Degree (%)   Black (%)
## Population       -0.38779729  0.47593265 -0.19763507
## Rural (%)         0.06571539 -0.39701608 -0.07604261
## Foreign-Born (%) -0.26927392  0.40888569 -0.16534981
## Elderly (%)       0.41268988 -0.25015200  0.24468172
## Poverty (%)       1.00000000 -0.28781002  0.51655437
## Degree (%)       -0.28781002  1.00000000 -0.06767252
## Black (%)         0.51655437 -0.06767252  1.00000000

cor(countiesmatrix, method = "spearman")

##                  Population   Rural (%) Foreign-Born (%) Elderly (%)
## Population        1.0000000 -0.66404541        0.5175719  -0.5410559
## Rural (%)        -0.6640454  1.00000000       -0.3944235   0.2791255
## Foreign-Born (%)  0.5175719 -0.39442348        1.0000000  -0.4419493
## Elderly (%)      -0.5410559  0.27912549       -0.4419493   1.0000000
## Poverty (%)      -0.5468470  0.10218689       -0.3900342   0.5606699
## Degree (%)        0.6514782 -0.54034920        0.5677743  -0.3712539
## Black (%)        -0.2982561 -0.08782546       -0.2523544   0.3452952
##                  Poverty (%) Degree (%)   Black (%)
## Population        -0.5468470  0.6514782 -0.29825613
## Rural (%)          0.1021869 -0.5403492 -0.08782546
## Foreign-Born (%)  -0.3900342  0.5677743 -0.25235436
## Elderly (%)        0.5606699 -0.3712539  0.34529517
## Poverty (%)        1.0000000 -0.4109972  0.72273752
## Degree (%)        -0.4109972  1.0000000 -0.10732380
## Black (%)          0.7227375 -0.1073238  1.00000000

The Kendall correlation test revealed no significant correlations.

According to the Spearman there is a positive correlation between the percentage of black people in a particular county and the rate of its population living in poverty. A cor.test function confirmed this correlation was 0.7 and returned miniscule p-values. Therefore we can abandon the null hypothesis which suggests the p-values for correlation would be greater than 0.05, and determine that the correlation between the two variables is highly significant.

The relationship between the two variables can be seen visually on the scatter plot below.

plot(CountiesCopy$PctBlack, CountiesCopy$PctPov, main = "Poverty v. Black Population Scatter Plot", xlab = "Black Population (%)", ylab = "Population in poverty (%)")
abline(lm(CountiesCopy$PctPov ~ CountiesCopy$PctBlack))

cor.test(CountiesCopy$PctPov, CountiesCopy$PctBlack, method = "spearman")

## Warning in cor.test.default(CountiesCopy$PctPov, CountiesCopy$PctBlack, :
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  CountiesCopy$PctPov and CountiesCopy$PctBlack
## S = 185740, p-value < 0.00000000000000022
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7227375

5. You’ll need the RColorBrewer and classInt libraries when you’re plotting a map {don’t use spplot()}. You might need to write a function to plot maps.

6. the lm() function will fit a regression model. Use PctBach as the dependent variable, and PctEld, PctFB, PctPov and PctBlack as possible predictors.

7. Examine the unadjusted r-squared - start with one independent variable, and create several models with additional variables. What happens to the r-squared when you add a new variable? AIC() might be useful here too.

As PctBach is our dependent variable, it will be plotted on the y axis and each of the predictor, or independent variables will appear on the x-axis. This will be done in order to construct a model which can account for the relationship betweent the two variables, and each model will be compared in sequence in order to see which one most successful accounts for the variance, or the uncertainty in the way the data falls on the scatter plot.

plot(CountiesCopy$PctEld, CountiesCopy$PctBach, pch = 21, main = "Relationship between Bachelor's Attainment and Elderly Population in Georgia", xlab = "Elderly Population (%)", ylab = "Bachelor's Attainment (%)") # plots the relationship between Bachelor's Attainment and Elderly Population
abline(lm(CountiesCopy$PctBach ~ CountiesCopy$PctEld))     # fits a regression line to the plot

eldmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctEld) # constructs eldmodel, a linear regression of the elderly population and bachelor's degree attainment levels
summary(eldmodel)                                          # summarises the results

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctEld)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.881 -2.995 -0.664  1.537 23.415 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          20.9010     1.5916  13.132 < 0.0000000000000002 ***
## CountiesCopy$PctEld  -0.8478     0.1312  -6.464        0.00000000122 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.079 on 157 degrees of freedom
## Multiple R-squared:  0.2102, Adjusted R-squared:  0.2052 
## F-statistic: 41.79 on 1 and 157 DF,  p-value: 0.000000001221

By fitting a regression line to the scatter plot, we can see that there is a lot of scatter within the data, and that many of the data points do not fall along the regression line. Therefore, there is a lot of variance that is unaccounted for in this model. The multiple r-squared statistic in the summary of the model below, states that it can account for 21.02% of the variance, therefore there is 78.98% of variance that is unaccounted for.

However, we can see that there is a negative relationship between bachelor’s attainment and the elderly population of a county. One possible explanation for this negative relationship could be that the counties in which institutions of higher education are located, Fulton and its surrounding counties, as well as Muscogee county, which have a higher attainment rate for obvious reasons, have a younger population, due to the influx of students residing there.

plot(CountiesCopy$PctFB, CountiesCopy$PctBach, pch = 21, main = "Relationship between Bachelor's Attainment and Foreign-Born Population in Georgia", xlab = "Foreign-Born Population (%)", ylab = "Bachelor's Attainment (%)") # plots the relationship between Bachelor's Attainment and the foreign-born population
abline(lm(CountiesCopy$PctBach ~ CountiesCopy$PctFB))     # fits a regression line to the plot

fbmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctFB)  # constructs fbmodel, a linear regression of the foreign-born population and bachelor's degree attainment levels
summary(fbmodel)                                          # summarises the results

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctFB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2634  -2.0289  -0.2558   1.6015  17.9208 
## 
## Coefficients:
##                    Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)          7.4447     0.4556   16.34 <0.0000000000000002 ***
## CountiesCopy$PctFB   3.0964     0.2724   11.37 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.233 on 157 degrees of freedom
## Multiple R-squared:  0.4515, Adjusted R-squared:  0.448 
## F-statistic: 129.2 on 1 and 157 DF,  p-value: < 0.00000000000000022

By fitting a regression line to the scatter plot, we can see that there is no clear linear relationship between the two variables and a number of the points become increasingly scattered and lie further away from the fitted line. The multiple r-squared statistic can only account for 45.15% of the variation, and there is therefore 54.85% of variation that cannot be accounted for in the model.

However, there does seem to be a positive relationship between the rate of bachelor’s degree attainment and foreign-born population of a county. Once again, we can probably attribute this to the international student population in counties in which institutions of higher education are located.

plot(CountiesCopy$PctPov, CountiesCopy$PctBach, pch = 21, main = "Relationship between Bachelor's Attainment and Population living in poverty in Georgia", xlab = "Population living in poverty(%)", ylab = "Bachelor's Attainment (%)") # plots the relationship between Bachelor's Attainment and percentage of the population living in poverty
abline(lm(CountiesCopy$PctBach ~ CountiesCopy$PctPov))     # fits a regression line to the plot

povmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctPov) # constructs povmodel, a linear regression model of the percentage of the population in the county and the rate of bachelor degree attainment
summary(povmodel)                                          # summarises the results

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctPov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9836 -3.2577 -0.7594  1.3242 28.9689 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         17.04816    1.18535  14.382 < 0.0000000000000002 ***
## CountiesCopy$PctPov -0.31545    0.05741  -5.495          0.000000155 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.234 on 157 degrees of freedom
## Multiple R-squared:  0.1613, Adjusted R-squared:  0.156 
## F-statistic: 30.19 on 1 and 157 DF,  p-value: 0.0000001547

When we fit a line to the scatter plot we see that the rate of poverty in each county exerts a downward pressure on the rate of educational attainment. However, very few observations fall on the regression line. The model summary below, also demonstrates that 84% of the variation is not accounted for by this model.

However, interpreting the plot visually, it is obvious that the poverty rate of a county negatively effects its rate of bachelor’s degree attainment. This stands to reason, a lack of wealth is a barrier to educational access.

plot(CountiesCopy$PctBlack, CountiesCopy$PctBach, pch = 21, main = "Relationship between Bachelor's Attainment and Black Population in Georgia", xlab = "Black Population (%)", ylab = "Bachelor's Attainment (%)") # plots the relationship between Bachelor's Attainment and the black population
abline(lm(CountiesCopy$PctBach ~ CountiesCopy$PctBlack))         # fits a regression line to the plot

blackmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctBlack)   # constructs blackmodel, a linear regression model of the percentage of the black population in each county and the rate of bachelor degree attainment
summary(blackmodel)                                              # summarises the results

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctBlack)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.817 -3.299 -1.628  1.357 26.511 
## 
## Coefficients:
##                       Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)           11.92840    0.84276  14.154 <0.0000000000000002 ***
## CountiesCopy$PctBlack -0.03582    0.02600  -1.378                0.17    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.681 on 157 degrees of freedom
## Multiple R-squared:  0.01194,    Adjusted R-squared:  0.005652 
## F-statistic: 1.898 on 1 and 157 DF,  p-value: 0.1703

By fitting a regression line to this plot, we can see that there are very few points which fall along the line. The multiple r-squared statistic also states that the model can only account for 1.2% of the variance, which entails that there is 98.8% of variance which goes unaccounted for in this model.

Interpreting the plot visually, there is a slightly downwards slope to the line, suggesting that the black population of a county has a slightly negative effect on educational attainment at third level. This stands to reason of course, as was demonstrated above, there is a correlation between poverty rates and black population rates in the state of Georgia, in this instance, it is clear that the black population is a particular demographic affected by the lower levels of educational attainment in poorer counties in the state.

7. Examine the unadjusted r-squared - start with one independent variable, and create several models with additional variables.

From the above results, we can see that foreign-born population is the most successful explanatory variable at accounting for the variance that we see in rates of bachelor degree attainment in Georgia.

When we use the AIC command to compare the fit of the models with one another, and the results of this test appears below.

AIC(eldmodel, fbmodel, blackmodel, povmodel) # compares the goodness of fit of the different models created so far

##            df       AIC
## eldmodel    3  971.9981
## fbmodel     3  914.0280
## blackmodel  3 1007.6113
## povmodel    3  981.5540

The model which carries out a linear regression to identify the relationship between rates of bachelor-degree attainment and the rate of foreign-born residents, ‘fbmodel’ has obtained the lowest AIC score and is therefore preferred in this instance.

What happens to the r-squared when you add a new variable? AIC() might be useful here too.

In determining what happenes to the r-squared value when a new variable is added, a sequence of models were devised with a number of different variables contained in each.

An AIC test was carried out on all seven models after these models has been generated and their summaries had been outputted.

fbeldmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctFB + CountiesCopy$PctEld)   # creates a multiple linear regression model for the foreign born population and the elderly population
eldpovmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctPov + CountiesCopy$PctEld) # creates a multiple linear regression model for the percentage of the population living in poverty and the elderly population
blackpovmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctBlack + CountiesCopy$PctPov)  # creates a multiple linear regression model for the percentage of the black population and the percentage of the population living in poverty
allmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctPov + CountiesCopy$PctEld + CountiesCopy$PctBlack + CountiesCopy$PctFB) # creates a multiple linear regression model for all the variables

summary(fbeldmodel)    # outputs the summary of the foreign-born and elderly population model

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctFB + CountiesCopy$PctEld)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6940  -1.9229  -0.1769   1.6759  16.8687 
## 
## Coefficients:
##                     Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)          11.6828     1.6658   7.013 0.0000000000664371 ***
## CountiesCopy$PctFB    2.7073     0.3052   8.870 0.0000000000000016 ***
## CountiesCopy$PctEld  -0.3235     0.1225  -2.641             0.0091 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.154 on 156 degrees of freedom
## Multiple R-squared:  0.475,  Adjusted R-squared:  0.4683 
## F-statistic: 70.57 on 2 and 156 DF,  p-value: < 0.00000000000000022

summary(eldpovmodel)   # outputs the summary of the elderly and population living in poverty model

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctPov + CountiesCopy$PctEld)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3106 -3.1170 -0.4931  1.8395 25.4791 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         21.49614    1.58531  13.560 < 0.0000000000000002 ***
## CountiesCopy$PctPov -0.16367    0.06664  -2.456               0.0151 *  
## CountiesCopy$PctEld -0.62888    0.15688  -4.009            0.0000944 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5 on 156 degrees of freedom
## Multiple R-squared:  0.2396, Adjusted R-squared:  0.2299 
## F-statistic: 24.58 on 2 and 156 DF,  p-value: 0.0000000005251

summary(blackpovmodel) # outputs the summary of the black population and poverty model

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctBlack + CountiesCopy$PctPov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6183 -3.0994 -0.7035  1.3927 30.9189 
## 
## Coefficients:
##                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)           17.93951    1.15681  15.508 < 0.0000000000000002 ***
## CountiesCopy$PctBlack  0.13297    0.03384   3.929             0.000128 ***
## CountiesCopy$PctPov   -0.54986    0.08110  -6.780       0.000000000234 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.009 on 156 degrees of freedom
## Multiple R-squared:  0.2368, Adjusted R-squared:  0.227 
## F-statistic:  24.2 on 2 and 156 DF,  p-value: 0.0000000006997

summary(allmodel)      # outputs the summary the model that contains all variables

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctPov + CountiesCopy$PctEld + 
##     CountiesCopy$PctBlack + CountiesCopy$PctFB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1414  -2.0934  -0.2113   1.6709  19.9212 
## 
## Coefficients:
##                       Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)           12.67113    1.63682   7.741 0.0000000000012175 ***
## CountiesCopy$PctPov   -0.28291    0.07810  -3.622           0.000396 ***
## CountiesCopy$PctEld   -0.10531    0.13784  -0.764           0.446047    
## CountiesCopy$PctBlack  0.07685    0.02803   2.742           0.006834 ** 
## CountiesCopy$PctFB     2.54521    0.29839   8.530 0.0000000000000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.013 on 154 degrees of freedom
## Multiple R-squared:  0.5163, Adjusted R-squared:  0.5037 
## F-statistic: 41.09 on 4 and 154 DF,  p-value: < 0.00000000000000022

AIC(eldmodel, fbmodel, blackmodel, fbeldmodel, eldpovmodel, blackpovmodel, allmodel) #compares the goodness of fit for all the models

##               df       AIC
## eldmodel       3  971.9981
## fbmodel        3  914.0280
## blackmodel     3 1007.6113
## fbeldmodel     4  909.0725
## eldpovmodel    4  967.9654
## blackpovmodel  4  968.5503
## allmodel       6  900.0487

After having modelled and explored the summaries of these data, it can be seen that adding a new variable generally has the effect of increasing the r-squared value. This can be seen, for example in the linear regression model ‘blackpovmodel’ which accounts for 24% of variance in contrast to the ‘blackmodel’ and ‘povmodel’ which only account for 1.2% and 16% of variance respectively.

However, there does seem to be a certain amount of diminishing returns setting in, which can be seen most clearly in the summary of the allmodel regression, which is successful in accounting for only 52% of the variance, only 7% more than the Foreign-Born population model ‘fbmodel’.

8. Interpret the coefficient estimates - which ones are significant in the model?

summary(allmodel)   # outputs the summary of the linear regression model that accounts for all variables

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctPov + CountiesCopy$PctEld + 
##     CountiesCopy$PctBlack + CountiesCopy$PctFB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1414  -2.0934  -0.2113   1.6709  19.9212 
## 
## Coefficients:
##                       Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)           12.67113    1.63682   7.741 0.0000000000012175 ***
## CountiesCopy$PctPov   -0.28291    0.07810  -3.622           0.000396 ***
## CountiesCopy$PctEld   -0.10531    0.13784  -0.764           0.446047    
## CountiesCopy$PctBlack  0.07685    0.02803   2.742           0.006834 ** 
## CountiesCopy$PctFB     2.54521    0.29839   8.530 0.0000000000000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.013 on 154 degrees of freedom
## Multiple R-squared:  0.5163, Adjusted R-squared:  0.5037 
## F-statistic: 41.09 on 4 and 154 DF,  p-value: < 0.00000000000000022

After performing a linear regression with all the variables, in the allmodel variable above, it can be seen that, based on the returned p-values, the most significant co-efficient estimates are the rates for poverty, black and foreign born. These were therefore modelled separately.

coeffmodel <- lm(CountiesCopy$PctBach ~ CountiesCopy$PctBlack + CountiesCopy$PctPov + CountiesCopy$PctFB)

summary(coeffmodel)

## 
## Call:
## lm(formula = CountiesCopy$PctBach ~ CountiesCopy$PctBlack + CountiesCopy$PctPov + 
##     CountiesCopy$PctFB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5233  -2.1449  -0.1925   1.7846  20.2514 
## 
## Coefficients:
##                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)           11.77059    1.13415  10.378 < 0.0000000000000002 ***
## CountiesCopy$PctBlack  0.08015    0.02766   2.898               0.0043 ** 
## CountiesCopy$PctPov   -0.30964    0.06973  -4.440             0.000017 ***
## CountiesCopy$PctFB     2.62551    0.27889   9.414 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.008 on 155 degrees of freedom
## Multiple R-squared:  0.5144, Adjusted R-squared:  0.505 
## F-statistic: 54.74 on 3 and 155 DF,  p-value: < 0.00000000000000022

The most significant variables, as calculated by R, are are those for foreign-birth, followed by poverty rates, and finally, rate of black population in each county. In this particular model, the black percentage of the population exerts a small, but nonetheless positive influence, whereas rates of poverty exert a negative influence. The estimates provided by rates of foreign birth are by far the largest, and exert a positive influence.

The calculated t-values for the intercept are 10.378 and the remaining three are calculated ot equal 9.41, -4.44 and 2.90 respectively. The null hypothesis for the calculation of t-values are that they are greater than 1.96, or in the case of negative t-values, more negative than 1.96. In each instance this is true, so we can abandon the null hypothesis, and adopt the alternative hypothesis, that these t-values are significantly different from 1.96.

All these variables also present p-values of less than 0.05, therefore we can abandon the null hypothesis, that the p-values would be greater than 0.05, and conclude that these results are significant, and that rate of foreign birth is significant to the greatest extent.

9. Do the residuals exhibit significant spatial autocorrelation?

In determining whether or not the residuals for these models exhibit spatial autocorrelation, it will be necessary to carry out the moran test, a statistical test devised for autocorrelation between spatial data. This was carried out below.

georgianeighbours <- poly2nb(Gedu.counties) # builds a list of neighbours based on the boundary data
georgialist <- nb2listw(georgianeighbours)  # adds spatial weights to the list
moran.test(countiesmatrix[,6], georgialist) # carries out the moran.test

## 
##  Moran I test under randomisation
## 
## data:  countiesmatrix[, 6]  
## weights: georgialist  
## 
## Moran I statistic standard deviate = 5.2975, p-value =
## 0.00000005869
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.248098122      -0.006329114       0.002306653

The null hypothesis can be rejected, as the Moran I statistic is 0.25.

Useful to map the residuals?

In determining which counties exhibit the highest residuals, it may indeed be useful to represent them on a spatial plot, to see what factors might be influencing the distribution of the residuals.

CountiesCopy$coeffresiduals <- residuals(coeffmodel)
colCodes <- spatialmap(CountiesCopy$coeffresiduals)  # calls the function spatialmap on the black population variable from the countiesmatrix

plot(Gedu.counties, col=spatialmap(CountiesCopy$coeffresiduals))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Residuals")          # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(CountiesCopy$coeffresiduals), "table")), # determines the text for the legend
       fill=attr(spatialmap(CountiesCopy$coeffresiduals), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)                                                             # inserts the y axis

summary(CountiesCopy$coeffresiduals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -12.5233  -2.1449  -0.1925   0.0000   1.7846  20.2514

boxplot(CountiesCopy$coeffresiduals, main = "Residuals Boxplot")

By mapping the residuals, both spatially and on a boxplot above, we can see that the residuals are non-normally distributed. While the median residuals value is zero, there are a number of outliers.

What do the residuals tell us?

The residuals tell us that the coeff model was in some respects insufficient, and that it makes too many assumptions about data variance to be considered a definitive prediction of where the data points can be expecte to lie.

In which counties are the predictions worst (standardised residuals might be useful here) - and why might this be? You’ll find the county names in the spatial polygons data frame. I found Wikipedia quite helpful as a start.

CountiesCopy$stdres <- stdres(coeffmodel)
colCodes <- spatialmap(CountiesCopy$stdres)  # calls the function spatialmap on the black population variable from the countiesmatrix

plot(Gedu.counties, col=spatialmap(CountiesCopy$stdres))   # Plot Gedu.counties using the colCodes obtained from the spatialmap function
title("Standardised Residuals")          # Assigns a title to the plot
legend("topright",                                        # places the legend in the topright of the plot
       legend=names(attr(spatialmap(CountiesCopy$stdres), "table")), # determines the text for the legend
       fill=attr(spatialmap(CountiesCopy$stdres), "palette") ,       # uses the palette colours for the legend
       cex=0.7, bty="n" )                                           # determines how the text looks
axis(1)                                                             # inserts the x axis
axis(2)                                                             # inserts the y axis

summary(CountiesCopy$stdres)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.414150 -0.539635 -0.048370 -0.000219  0.448352  5.335207

boxplot(CountiesCopy$stdres, main = "Standardised Residuals Boxplot")

The two counties in Georgia with the lowest negative residuals, to the extent that they are outliers are Seminole and and Chatta-Hoochee. The four counties with the positive residuals, which are also outliers, are Clarke, OConee, Fulton and Douglas.

These particular residuals could be attributed to these counties being outliers for a number of these variables. For example, Fulton and Douglas are outliers for total population and residents from outside of the United States, as is the county Chatta-Hoochee.

Clarke, Fulton and Douglas also score quite highly for percentage of people educated to the level of a bachelor degree in comparison to the rest of the state. In conclusion, we could say that linear regression is highly sensitive to outliers, and as Georgia is a state in which the division between urban and rural demographics is so pronounced, a more robust statistical model would be required in accounting for the variance in educational attainment across different counties, such as a geographically weighted model, that would reduce the amount of autocorrelation and attenuate the influence of these outliers on the model.

NCG602A Assignment

C. Beausang

5/12/2017

1. The data for Georgia are in the GWmodel. Use the data() function to load them and library() to load GWmodel.

ls() will tell you what the data frames are called. names() will list the column headings.

Q.2. Remember to merge the data frames before you can plot them - remember about the order of the data frames… match() will be useful.

3. hist(), summary(), boxplot() and plot() may be useful exploratory tools.

The histogram and box plot both demonstrate that the percentage of the elderly population in Georgia is fairly evenly distributed in comparison to the previous two variables. The median percentage for elderly population of counties in Georgia is 12%.

The outlier for the lowest elderly population is the county of Chatta-Hoochee in the west of the state, while the county of towns has the highest at 24%.

Once again, the disparity between urban and rural values for this variable is clear, with more urban areas such as Fulton and Muscogee scoring higher for its foreign-born population, presumably due to the influx of international students attending university in these areas.

Interpreting the spatial plot visually, there is also a slight overlap between poverty rates and rual areas, with more populous and urban areas tending towards more mid-ranges for percentages of its population living in poverty.

4. cor() will give you a correlation matrix, and cor.test() will carry out a test to determine whether any correlations are likely to be zero.

After combining the seven variables into a matrix, these correlation tests were carried out, the results of which appear below.

The Kendall correlation test revealed no significant correlations.

The relationship between the two variables can be seen visually on the scatter plot below.

5. You’ll need the RColorBrewer and classInt libraries when you’re plotting a map {don’t use spplot()}. You might need to write a function to plot maps.

6. the lm() function will fit a regression model. Use PctBach as the dependent variable, and PctEld, PctFB, PctPov and PctBlack as possible predictors.

7. Examine the unadjusted r-squared - start with one independent variable, and create several models with additional variables. What happens to the r-squared when you add a new variable? AIC() might be useful here too.

However, there does seem to be a positive relationship between the rate of bachelor’s degree attainment and foreign-born population of a county. Once again, we can probably attribute this to the international student population in counties in which institutions of higher education are located.

However, interpreting the plot visually, it is obvious that the poverty rate of a county negatively effects its rate of bachelor’s degree attainment. This stands to reason, a lack of wealth is a barrier to educational access.

7. Examine the unadjusted r-squared - start with one independent variable, and create several models with additional variables.

From the above results, we can see that foreign-born population is the most successful explanatory variable at accounting for the variance that we see in rates of bachelor degree attainment in Georgia.

When we use the AIC command to compare the fit of the models with one another, and the results of this test appears below.

The model which carries out a linear regression to identify the relationship between rates of bachelor-degree attainment and the rate of foreign-born residents, ‘fbmodel’ has obtained the lowest AIC score and is therefore preferred in this instance.

What happens to the r-squared when you add a new variable? AIC() might be useful here too.

In determining what happenes to the r-squared value when a new variable is added, a sequence of models were devised with a number of different variables contained in each.

An AIC test was carried out on all seven models after these models has been generated and their summaries had been outputted.

However, there does seem to be a certain amount of diminishing returns setting in, which can be seen most clearly in the summary of the allmodel regression, which is successful in accounting for only 52% of the variance, only 7% more than the Foreign-Born population model ‘fbmodel’.

8. Interpret the coefficient estimates - which ones are significant in the model?

After performing a linear regression with all the variables, in the allmodel variable above, it can be seen that, based on the returned p-values, the most significant co-efficient estimates are the rates for poverty, black and foreign born. These were therefore modelled separately.

All these variables also present p-values of less than 0.05, therefore we can abandon the null hypothesis, that the p-values would be greater than 0.05, and conclude that these results are significant, and that rate of foreign birth is significant to the greatest extent.

9. Do the residuals exhibit significant spatial autocorrelation?

In determining whether or not the residuals for these models exhibit spatial autocorrelation, it will be necessary to carry out the moran test, a statistical test devised for autocorrelation between spatial data. This was carried out below.

The null hypothesis can be rejected, as the Moran I statistic is 0.25.

Useful to map the residuals?

In determining which counties exhibit the highest residuals, it may indeed be useful to represent them on a spatial plot, to see what factors might be influencing the distribution of the residuals.

By mapping the residuals, both spatially and on a boxplot above, we can see that the residuals are non-normally distributed. While the median residuals value is zero, there are a number of outliers.

What do the residuals tell us?

The residuals tell us that the coeff model was in some respects insufficient, and that it makes too many assumptions about data variance to be considered a definitive prediction of where the data points can be expecte to lie.

In which counties are the predictions worst (standardised residuals might be useful here) - and why might this be? You’ll find the county names in the spatial polygons data frame. I found Wikipedia quite helpful as a start.

The two counties in Georgia with the lowest negative residuals, to the extent that they are outliers are Seminole and and Chatta-Hoochee. The four counties with the positive residuals, which are also outliers, are Clarke, OConee, Fulton and Douglas.

These particular residuals could be attributed to these counties being outliers for a number of these variables. For example, Fulton and Douglas are outliers for total population and residents from outside of the United States, as is the county Chatta-Hoochee.