The file usa48_usedcars.shp in directory ‘usedcars’ contains information on tax rates and delivery charges for new cars (tax_charge) in the 48 conterminous U.S. states for the period 1955-1959, as well as the average used car price for 1960 (price_1960). Use this dataset to build a spatial model linking used car prices to the tax and delivery costs. You will need the spdep library.

usedcars <- st_read("../../datafiles/usedcars/usa48_usedcars.shp")
## Reading layer `usa48_usedcars' from data source 
##   `C:\Users\sherl\OneDrive\Documents\GRAD SCHOOL II\Geog 6000 - SpatialStats\datafiles\usedcars\usa48_usedcars.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 48 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.6813 ymin: 25.12993 xmax: -67.00742 ymax: 49.38323
## CRS:           NA
plot(usedcars)

It is also good to look at the numerical distribution of the data

ggplot(usedcars) +
  geom_histogram(aes(x = tax_charge), binwidth = 25) +
  geom_vline(xintercept = mean(usedcars$tax_charge, rm.na = T), 
             col = "darkblue") +
  labs(x = "Tax Charge ($)")

ggplot(usedcars) +
  geom_histogram(aes(x = price_1960), binwidth = 25) +
  geom_vline(xintercept = mean(usedcars$price_1960, rm.na = T), 
             col = "darkblue") +
  labs(x = "Car Price ($)")

Data looks somewhat normally distributed.

Neighborhood

Build a neighborhood function linking the 48 states. You will need to choose one of the set of neighborhood functions available in R. Explain your choice of neighborhood function.

# calc centroid
usa_geom <- st_geometry(usedcars)
usa_cent <- st_centroid(usa_geom)

plot(usa_geom, reset = F)
plot(usa_cent, pch = 16, col = 2, add = T)

Quene’s case:

nb_qc <- poly2nb(usedcars)

plot(usa_geom, reset = F)
plot(nb_qc, usa_cent, lwd = 1.5, col = 2, add = T)

This looks pretty good. Neighbors are all surrounding states.

Rook’s Case

nb_rk <- poly2nb(usedcars, queen = F)

plot(usa_geom, reset = F)
plot(nb_rk, usa_cent, lwd = 1.5, col = 2, add = T)

This also looks good. It is slightly less complicated than the Queen’s case. For instance Utah and New Mexico are no longer neighbors, but that seems like a valid relationship.

Delauney Triangulation

nb_dtri <- tri2nb(usa_cent)
plot(usa_geom, reset = F)
plot(nb_dtri, usa_cent, lwd = 1.5, col = 2, add = T)

This is too complicated. In some ways, its good. I think there is probably a neighborhood relationship between Texas and Florida, at least economically. However I doubt there is much of a relationship between Main and North Dakota.

Sphere of Influence

nb_sphere <- graph2nb(soi.graph(nb_dtri, usa_cent))

plot(usa_geom, reset = F)
plot(nb_sphere, usa_cent, lwd = 1.5, col = 2, add = T)

I like this representation. It regionalized the neighborhoods. There is a clear Western USA, Mid-West, and East Coast relationship happening here. I am going to use this neighborhood for the remaining analysis.

Spatial Weight Matrix

Build a spatial weight matrix. Use the summary() function to get information about the distribution of links per state and report this

I will use the row standarized method. I feel that it is best to not make assumptions about neighborhood relationships unless there is direct evidence that one relationship is stronger than another.

spw_mtrx <- nb2listw(nb_sphere, style = "W")

summary(spw_mtrx)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 48 
## Number of nonzero links: 184 
## Percentage nonzero weights: 7.986111 
## Average number of links: 3.833333 
## Link number distribution:
## 
##  2  3  4  5  6 
##  7 10 18 10  3 
## 7 least connected regions:
## 7 17 24 32 37 41 45 with 2 links
## 3 most connected regions:
## 10 13 42 with 6 links
## 
## Weights style: W 
## Weights constants summary:
##    n   nn S0       S1       S2
## W 48 2304 48 26.50444 195.9239

Use the moran.test() or moran.mc() function to test for spatial autocorrelation in the prices of used cars. Report the Moran’s \(I\) statistic, the z-score and whether or not you can reject the null hypothesis that there is no spatial autocorrelation

moran.test(usedcars$price_1960,
           listw = spw_mtrx,
           alternative = "two.sided",
           randomisation = T)
## 
##  Moran I test under randomisation
## 
## data:  usedcars$price_1960  
## weights: spw_mtrx    
## 
## Moran I statistic standard deviate = 8.1079, p-value = 5.149e-16
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.82305501       -0.02127660        0.01084445

The Moran’s \(I\) statistic is 0.8230, with a z-score of 8.11 and a \(p\)-value ~ 0. The \(p\)-value is extremely low, thus were are able to reject the null hypothesis that there is no spatial autocorrelation. These values are evidence that there is some spatial clustering in the data.

moran.mc(usedcars$price_1960,
         listw = spw_mtrx,
         nsim = 999,
         alternative = "greater")
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  usedcars$price_1960 
## weights: spw_mtrx  
## number of simulations + 1: 1000 
## 
## statistic = 0.82306, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

Running a Monte Carlo simulation returns similar results. \(I\) statistic is 0.823, and a \(p\)-value of 0.001. The \(p\)-value is slightly higher in this scenario, but remains well below our level of significance.

moran.plot(usedcars$price_1960,
           spw_mtrx,
           xlab = "1960 Price ($)",
           ylab = "Lagged Price ($)")

lm_cluster <- localmoran(usedcars$price_1960,
                         listw = spw_mtrx,
                         alternative = "two.sided")
usedcars$zscore <- lm_cluster[ ,4]
usedcars$pval <- lm_cluster[ , 5]

ggplot(usedcars) +
  geom_sf(aes(fill = zscore)) + 
  labs(fill = "Z-score",
       title = "Used Car 1960 Price")

usedcars$pval.bin <- as.factor(ifelse(usedcars$pval < 0.05, "Significant", "Not-significant"))
ggplot(usedcars) + 
  geom_sf(aes(fill = pval.bin)) +
  labs(fill = "Pval",
       title = "Used Car 1960 Price")

The Western states and states along the Mississippi River show high autocorrelation.

Simple Linear Model

Build a simple linear model using the lm() function between the used car prices (dependent or Y variable) and the tax and delivery cost (independent or X variable). Report the coefficients of the model and the \(R^2\). Check for autocorrelation in the residuals of the model using the moran.mc() function, and state whether or not this is present

lm1 <- lm(data = usedcars, formula = price_1960 ~ tax_charge)
summary(lm1)
## 
## Call:
## lm(formula = price_1960 ~ tax_charge, data = usedcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -116.701  -45.053   -1.461   43.400  107.807 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1435.7506    27.5796  52.058  < 2e-16 ***
## tax_charge     0.6872     0.1754   3.918 0.000294 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.01 on 46 degrees of freedom
## Multiple R-squared:  0.2503, Adjusted R-squared:  0.234 
## F-statistic: 15.35 on 1 and 46 DF,  p-value: 0.0002939

The model has an intercept of 1435.75 and a slope of 0.6872 (tax $ per dollar $). The adjusted \(R^2\)$ value is 0.234.

moran.mc(residuals(lm1),
         listw = spw_mtrx,
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(lm1) 
## weights: spw_mtrx  
## number of simulations + 1: 1000 
## 
## statistic = 0.59707, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

Using the Monte Carlo method, the Moran’s \(I\) is 0.597, with a \(p\)-value of 0.001 meaning that there is autocorrelation happening in the dataset.

Use the Lagrange Multiplier test to identify whether to use a spatial error or spatial lag model. Remember that you may need to use the robust version of this test if non-robust results are both significant. Report the \(p\)-value of each test and give the model choice

lm2 <- lm.RStests(lm1, spw_mtrx, test = c("LMerr", "LMlag"))

summary(lm2)
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## data:  
## model: lm(formula = price_1960 ~ tax_charge, data = usedcars)
## test weights: spw_mtrx
##  
##       statistic parameter   p.value    
## RSerr    30.989         1 2.595e-08 ***
## RSlag    40.817         1 1.672e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Spatial Errors are 31.0, the Spatial Lag is 40.8. Both tests have very low p-vals (~0).

Because both results are significant, we need to redo using the robust LaGrange Multiplier Test.

lm3 <- lm.RStests(lm1, spw_mtrx, test = c("RLMerr", "RLMlag"))

summary(lm3)
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## data:  
## model: lm(formula = price_1960 ~ tax_charge, data = usedcars)
## test weights: spw_mtrx
##  
##          statistic parameter  p.value   
## adjRSerr   0.11465         1 0.734911   
## adjRSlag   9.94206         1 0.001615 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The robust test tellls us that spatial lag is more likely the source of correlation.

Now build a spatial model linking the car prices and the tax/delivery cost, using the model you chose in the previous section (either use the lagsarlm() or errorsarlm() function).

slm <- lagsarlm(price_1960 ~ tax_charge, data = usedcars, spw_mtrx)
summary(slm)
## 
## Call:lagsarlm(formula = price_1960 ~ tax_charge, data = usedcars, 
##     listw = spw_mtrx)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -60.30197 -18.81993   0.38465  23.76249  51.56415 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) 307.518016 114.156220  2.6938 0.007064
## tax_charge    0.150429   0.096269  1.5626 0.118147
## 
## Rho: 0.78495, LR test value: 47.026, p-value: 7.0068e-12
## Asymptotic standard error: 0.07618
##     z-value: 10.304, p-value: < 2.22e-16
## Wald statistic: 106.17, p-value: < 2.22e-16
## 
## Log likelihood: -237.6531 for lag model
## ML residual variance (sigma squared): 917.43, (sigma: 30.289)
## Number of observations: 48 
## Number of parameters estimated: 4 
## AIC: 483.31, (AIC for lm: 528.33)
## LM test for residual autocorrelation
## test value: 1.7967, p-value: 0.18011

Report the following information:

If using a spatial lag model: a) coefficients (and their significance); b) the value of Rho (the spatial autoregressive coefficient); c) the AIC value and the AIC value of the linear model without the spatial weight matrix

  1. The coefficients are an intercept of $307 and a slope of $0.15 per tax dollar. However the tax_charge value is no longer significant.

  2. Rho is 0.785, with a p-value of ~0

  3. AIC is 483.31 (528.33 without the spatial weight matrix)

Test for remaining autocorrelation in the residuals of the model using the moran.test() or moran.mc() function. Given the value you obtain, state whether you think that the model adequately accounts for autocorrelation?

moran.mc(residuals(slm),
         listw = spw_mtrx,
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(slm) 
## weights: spw_mtrx  
## number of simulations + 1: 1000 
## 
## statistic = -0.074992, observed rank = 306, p-value = 0.694
## alternative hypothesis: greater

The Moran’s I for the residuals is -0.075, which is slightly dispersed, but the p-value is 0.681, meaning we can’t reject the null hypothesis that the residuals are now random.

Is the coefficient relating tax and delivery to car price significant? If not, give one reason why this may not be the case

The coefficient relating tax and delivery to car price is no longer significant. One reason may be that the delivery charge is a flat rate and the tax is not a constant percentage. Different locations have different tax rates.


The End