#In this spatial regression assignment, you will design a regression model based on either the Chicago Airbnb or Southern US datasets that are posted. Your regression model must have at least three variables (one dependent variable, and two independent variables). In your submission, please include:

#1) The research question you answer with your model (e.g. What is the impact of education on inequality?)

#2) A correctly interpreted Moran's I test on your original linear model (not the spatial version) - is there evidence of significant spatial autocorrelation in the residuals?

#3) A spatial regression model (error or lag, don't worry about GW regression!), and justification for why you chose it (I'd love to see you run the Lagrange Multiplier test! No need to get too technical, just tell me how you used p-values to determine the right test to use))

#4) A brief discussion of how you would interpret the results of your model - what variables are statistically significant? (e.g. Education had a negative and statistically significant impact on inequality)

#Submit everything as an Rpubs file! If you are unable to do so, let me know. No need to get too technical with your explanations. My goal is to make sure that you have a general conceptual understanding of the unit. 
#1) Research Question: How does the conjunction of lack of high school education and unemployment affect poverty rates?
#Load in Packages 
library(sf)     
## Warning: package 'sf' was built under R version 4.3.2
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library(dplyr)   
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(spData) 
## Warning: package 'spData' was built under R version 4.3.2
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.2
library(spdep)
## Warning: package 'spdep' was built under R version 4.3.3
library(spatialreg)
## Warning: package 'spatialreg' was built under R version 4.3.3
## Loading required package: Matrix
## 
## Attaching package: 'spatialreg'
## The following objects are masked from 'package:spdep':
## 
##     get.ClusterOption, get.coresOption, get.mcOption,
##     get.VerboseOption, get.ZeroPolicyOption, set.ClusterOption,
##     set.coresOption, set.mcOption, set.VerboseOption,
##     set.ZeroPolicyOption
library(GWmodel)
## Warning: package 'GWmodel' was built under R version 4.3.3
## Loading required package: robustbase
## Warning: package 'robustbase' was built under R version 4.3.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.3.2
## Loading required package: Rcpp
## Welcome to GWmodel version 2.3-2.
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack
#Load in Data
setwd("C:/Users/kajsa/OneDrive/Documents/Binghamton/Senior - Spring/DIDA 370/DIDA 370 - Topic 11/chicago_airbnb")
chi <- st_read("airbnb_Chicago 2015.shp")
## Reading layer `airbnb_Chicago 2015' from data source 
##   `C:\Users\kajsa\OneDrive\Documents\Binghamton\Senior - Spring\DIDA 370\DIDA 370 - Topic 11\chicago_airbnb\airbnb_Chicago 2015.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 77 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84
chi
## Simple feature collection with 77 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84
## First 10 features:
##          community    shape_area     shape_len AREAID response_r accept_r
## 1          DOUGLAS 46004621.1581 31027.0545098     35   98.77143 94.51429
## 2          OAKLAND 16913961.0408 19565.5061533     36   99.20000 90.10526
## 3      FULLER PARK 19916704.8692          <NA>     37   68.00000       NA
## 4  GRAND BOULEVARD 48492503.1554 28196.8371573     38   94.03704 83.61539
## 5          KENWOOD 29071741.9283 23325.1679062     39   92.54286 88.14286
## 6   LINCOLN SQUARE 71352328.2399 36624.6030848      4   90.00990 87.89691
## 7  WASHINGTON PARK 42373881.4842 28175.3160866     40  100.00000 97.33333
## 8        HYDE PARK 45105380.1732 29746.7082016     41   92.49505 89.93000
## 9         WOODLAWN  57815179.512 46936.9592443     42   91.09091 96.90000
## 10     ROGERS PARK 51259902.4506 34052.3975757      1   95.82500 88.81034
##    rev_rating  price_pp room_type num_spots poverty crowded dependency
## 1    87.77778  78.15789  1.789474        38    29.6     1.8       30.7
## 2    88.81250  53.77500  1.850000        20    39.7     1.3       40.4
## 3    91.75000  84.00000  1.833333         6    51.2     3.2       44.9
## 4    92.75000 119.53333  1.533333        30    29.3     3.3       39.5
## 5    90.65625  77.99145  1.615385        39    21.7     2.4       35.4
## 6    95.24176  80.18826  1.436364       110    10.9     3.4       25.5
## 7    94.33333 115.75000  1.750000         4    42.1     5.6       42.8
## 8    90.51685  79.31968  1.587156       109    18.4     1.5       26.2
## 9    93.38889  71.45833  1.958333        24    30.7     2.9       36.1
## 10   93.99020  58.67388  1.645669       127    23.6     7.7       27.5
##    without_hs unemployed income_pc harship_in num_crimes num_theft population
## 1        14.3       18.2     23791         47       5013      1241      18238
## 2        18.4       28.7     19252         78       1306       311       5918
## 3        26.6       33.9     10432         97       1764       383       2876
## 4        15.9       24.3     23472         57       6416      1428      21929
## 5        11.3       15.7     35911         26       2713       654      17841
## 6        13.4        8.2     37524         17       3670      1014      39493
## 7        25.4       28.6     13785         88       5433       823      11717
## 8         4.3        8.4     39056         14       3031       996      25681
## 9        16.5       23.4     18672         58       7585      1175      25983
## 10       18.2        8.7     23939         39       7181      1729      54991
##                          geometry
## 1  MULTIPOLYGON (((-87.60914 4...
## 2  MULTIPOLYGON (((-87.59215 4...
## 3  MULTIPOLYGON (((-87.6288 41...
## 4  MULTIPOLYGON (((-87.60671 4...
## 5  MULTIPOLYGON (((-87.59215 4...
## 6  MULTIPOLYGON (((-87.67441 4...
## 7  MULTIPOLYGON (((-87.60604 4...
## 8  MULTIPOLYGON (((-87.58038 4...
## 9  MULTIPOLYGON (((-87.57714 4...
## 10 MULTIPOLYGON (((-87.65456 4...
#Plot the Data
ggplot()+
  geom_sf(data = chi, aes(fill=as.numeric(poverty)))+
  scale_fill_steps(
    name = "Poverty",
    low = "lightsteelblue1",
    high = "tomato1",
    n.breaks = 5,
    show.limits = T)+
  theme_void()

#create our variables for the model:
model_2 <- chi %>% 
  select(c(poverty, unemployed, without_hs, AREAID, geometry)) %>%
  filter(!poverty %in% "NA")

#Run a basic model

summary(lm2 <- lm(poverty ~ unemployed + without_hs, data = model_2))
## 
## Call:
## lm(formula = poverty ~ unemployed + without_hs, data = model_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0642  -5.6940  -0.2384   4.8156  15.6825 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.12939    1.95078   0.579   0.5644    
## unemployed   1.13589    0.11044  10.285 6.51e-16 ***
## without_hs   0.15609    0.07047   2.215   0.0298 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.789 on 74 degrees of freedom
## Multiple R-squared:  0.6625, Adjusted R-squared:  0.6533 
## F-statistic: 72.62 on 2 and 74 DF,  p-value: < 2.2e-16
#2) Moran's I test on your original linear model - Is there evidence of significant spatial autocorrelation in the residuals?

#Let's test for spatial autocorrelation in the residuals - create the weights 
chi_list <- model_2 %>% 
  poly2nb(st_geometry(model_2)) %>% 
  nb2listw(zero.policy = TRUE) 

#Run the Moran's I test for regression residuals
lm.morantest(lm2, chi_list)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = poverty ~ unemployed + without_hs, data = model_2)
## weights: chi_list
## 
## Moran I statistic standard deviate = 5.692, p-value = 6.279e-09
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.371038481     -0.028342012      0.004923201
#Because our Moran's I value is positive, and the p-value is less than 0.05, we can conclude that there is 
#significant positive spatial autocorrelation in the residuals. 
#3) A spatial regression model (error or lag, don't worry about GW regression!), and justification for why you chose it (I'd love to see you run the Lagrange Multiplier test! No need to get too technical, just tell me how you used p-values to determine the right test to use))

# Because there was significant dependency in the residuals based on the Moran's I test from the previous code chunk, 
# I have decided to move forward with a Spatial Error Regression Model

lm2_error <- errorsarlm(poverty ~ unemployed  + without_hs,
              data = model_2,
              listw = chi_list,
              zero.policy = TRUE, 
              na.action = na.omit)

summary(lm2_error)
## 
## Call:errorsarlm(formula = poverty ~ unemployed + without_hs, data = model_2, 
##     listw = chi_list, na.action = na.omit, zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -13.27860  -4.23530  -0.57197   2.76407  12.89043 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.217800   2.658946 -0.0819    0.9347
## unemployed   1.042240   0.125797  8.2851 2.220e-16
## without_hs   0.307945   0.077923  3.9519 7.752e-05
## 
## Lambda: 0.65353, LR test value: 22.55, p-value: 2.0476e-06
## Asymptotic standard error: 0.10078
##     z-value: 6.4845, p-value: 8.9038e-11
## Wald statistic: 42.048, p-value: 8.9038e-11
## 
## Log likelihood: -243.9278 for error model
## ML residual variance (sigma squared): 29.57, (sigma: 5.4378)
## Number of observations: 77 
## Number of parameters estimated: 5 
## AIC: 497.86, (AIC for lm: 518.41)
# The error's model AIC is lower than the orginal linear model's AIC which suggests an improvement in the model!
seResiduals2 <- rep(0, length(model_2$poverty))
resIndex2 <- lm2_error$residuals %>% names() %>% as.integer();
seResiduals2[resIndex2] <- lm2_error$residuals

#perform the Moran Test
chi_list %>%
  moran.test(seResiduals2, ., zero.policy = TRUE) 
## 
##  Moran I test under randomisation
## 
## data:  seResiduals2  
## weights: .    
## 
## Moran I statistic standard deviate = -0.3115, p-value = 0.6223
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.035587486      -0.013157895       0.005184602
#Now, we can see that our Moran's I test is insignificant as indicated by the p-value greater than 0.05 - this tells us that we effectively controlled for the spatial variation in the error term.
#4) Through our statistical testing, we can see that an increase of unemployment and individuals without hs degrees we see an increase in poverty rates.