Data621 - Exercise 8.2 ELMR

library(dplyr)
library(ggplot2)
library(gtable)

This question is based on one of the data sets discussed in an unpublished manuscript by Powell, T. and Sheather, S. (2008) entitled “ A Theory of Extreme Competition ”. According to Powell and Sheather: This paper develops a model of competitive performance when populations compete … We present a theoretical framework … and empirical tests in chess and … national pageants. The findings show that performance in these domains is substantially predictable from a few observable features of population and economic geography. In this question we shall consider data from the Miss America pageant, which was founded in Atlantic City in 1921, and 81 pageants have been conducted through 2008. In particular we will develop a logistic regression model for the proportion of top ten finalists for each US state for the years 2000 to 2008. According to Powell and Sheather:

Eligibility for the Miss America pageant is limited to never-married female U.S. citizens between the ages of 17 and 24. To measure population size, we obtained data for this demographic segment for each U.S. state and the District of Columbia from the 2000 U.S. Census. As a measure of participation inducements, we obtained data on the number of qualifying pageants conducted in each state, on the assumption that qualifying pageants reflect state-level infrastructure and resource commitments. As a geographic measure, we used the latitude and longitude of each state capital and Washington DC, on the assumption that state locations convey information about the regional cultural geography of beauty pageants (in particular, beauty pageants are widely believed to receive greater cultural support south of the Mason-Dixon line). To measure search efficacy, we obtained data on the total land and water area (in square miles) for each state and the District of Columbia, on the assumption that search is more difficult over larger geographic areas.

They consider the following outcome variable and potential predictor variables:

Y = Number of times each US state (and the District of Columbia) has produced a top ten finalist for the years 2000–2008 x1 = log(population size) x2 = Log(average number of contestants in each state’s final qualifying pageant each year between 2002 and 2007) x3 = Log(geographic area of each state and the District of Columbia) x4 = Latitude of each state capitol and x5 = Longitude of each state capitol, and

The data can be found on the course web site in the file . MissAmericato2008.txt.

ms_america <- read.csv('https://raw.githubusercontent.com/hillt5/DATA_621/master/DataSets/MissAmericato2008.txt', sep = '\t', stringsAsFactors = FALSE)

head(ms_america)

##   abbreviation Top10 LogPopulation LogContestants LogTotalArea Latitude
## 1           AL     6       11.9249       3.895894      10.8670  32.3833
## 2           AK     0        9.8011       2.708050      13.4049  58.3667
## 3           AZ     0       12.0543       2.862201      11.6439  33.4333
## 4           AR     4       11.2702       3.766997      10.8814  34.7333
## 5           CA     5       14.0005       3.935740      12.0058  38.5167
## 6           CO     0       11.8820       2.944439      11.5530  39.7500
##   Longitude
## 1    86.367
## 2   134.583
## 3   112.017
## 4    92.233
## 5   121.500
## 6   104.867

hist(ms_america$Top10)

summary(ms_america$Top10)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   1.765   3.000   7.000

ms_america_binom <- ms_america
ms_america_binom <- ms_america_binom %>%
  mutate(TOP_10_YES_NO = case_when(Top10 != 0 ~ 1, Top10 == 0 ~0))

The initial response variable is inappropriate for consideration using a logistic regression model. .I added a variable for predicting whether a state ever placed in the top 10 for a binary response

Develop a logistic regression model that predicts y from x1, x2, x3, x4 and x5 such that each of the predictors is significant at least at the 5% level. Use marginal model plots to check the validity of the full model and the final model (if it is different from the full model).

Following the workflow for marginal model plots from MARR. The section of the book in question is 8.2.4 starting on page 286.

ms_america_model <- glm(data = ms_america_binom, formula = TOP_10_YES_NO ~.-abbreviation-Top10, family = 'binomial')

summary(ms_america_model)

## 
## Call:
## glm(formula = TOP_10_YES_NO ~ . - abbreviation - Top10, family = "binomial", 
##     data = ms_america_binom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2560  -0.3566   0.3286   0.5079   1.9395  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -11.66249    9.59490  -1.215   0.2242  
## LogPopulation    1.28304    0.64317   1.995   0.0461 *
## LogContestants   2.81809    1.65138   1.707   0.0879 .
## LogTotalArea    -1.31289    0.61719  -2.127   0.0334 *
## Latitude        -0.01528    0.10916  -0.140   0.8887  
## Longitude        0.03810    0.03677   1.036   0.3002  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63.449  on 50  degrees of freedom
## Residual deviance: 35.981  on 45  degrees of freedom
## AIC: 47.981
## 
## Number of Fisher Scoring iterations: 5

The initial model shows population and total area are significant predictors. The correlation is positive for population and negative for total area, meaning smaller states are more likely to make the top 10. Notably, latitude is not significant at the p = 0.05 level, however its coefficient is negative indicating that more southerly states would have greater representation in the top 10.

I wasn’t able to use the ‘cars’ package from MARR, however I visualized the marginal plots using ggplot2.

ggplot(ms_america_binom, aes(LogPopulation, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$LogPopulation, ms_america_model$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$LogPopulation` is discouraged. Use
## `LogPopulation` instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggplot(ms_america_binom, aes(LogContestants, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$LogContestants, ms_america_model$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$LogContestants` is discouraged. Use
## `LogContestants` instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggplot(ms_america_binom, aes(LogTotalArea, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$LogTotalArea, ms_america_model$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$LogTotalArea` is discouraged. Use
## `LogTotalArea` instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggplot(ms_america_binom, aes(Latitude, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$Latitude, ms_america_model$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$Latitude` is discouraged. Use `Latitude`
## instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggplot(ms_america_binom, aes(Longitude, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$Longitude, ms_america_model$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$Longitude` is discouraged. Use `Longitude`
## instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

In these graphs, the red line is the LOESS curve of the predicted values, while blue is the curve fitted to actual values.

Next, let’s look at some interaction effects. There are three I’m concerned with: population vs contestants, population versus area, and longitude versus area.

plot(ms_america_binom$LogPopulation,ms_america_binom$LogContestants,pch=ms_america_binom$TOP_10_YES_NO+1,col=ms_america_binom$TOP_10_YES_NO+1,xlab="Log of Population",ylab="Log of Contestants")
abline(lsfit(ms_america_binom$LogPopulation[ms_america_binom$TOP_10_YES_NO==0],ms_america_binom$LogContestants[ms_america_binom$TOP_10_YES_NO==0]),lty=1,col=1)
abline(lsfit(ms_america_binom$LogPopulation[ms_america_binom$TOP_10_YES_NO==1],ms_america_binom$LogContestants[ms_america_binom$TOP_10_YES_NO==1]),lty=2,col=2)

plot(ms_america_binom$LogPopulation,ms_america_binom$LogTotalArea,pch=ms_america_binom$TOP_10_YES_NO+1,col=ms_america_binom$TOP_10_YES_NO+1,xlab="Log of Population",ylab="Log of Total Area")
abline(lsfit(ms_america_binom$LogPopulation[ms_america_binom$TOP_10_YES_NO==0],ms_america_binom$LogTotalArea[ms_america_binom$TOP_10_YES_NO==0]),lty=1,col=1)
abline(lsfit(ms_america_binom$LogPopulation[ms_america_binom$TOP_10_YES_NO==1],ms_america_binom$LogTotalArea[ms_america_binom$TOP_10_YES_NO==1]),lty=2,col=2)

plot(ms_america_binom$Longitude,ms_america_binom$LogTotalArea,pch=ms_america_binom$TOP_10_YES_NO+1,col=ms_america_binom$TOP_10_YES_NO+1,xlab="Longitude",ylab="Log of Total Area")
abline(lsfit(ms_america_binom$Longitude[ms_america_binom$TOP_10_YES_NO==0],ms_america_binom$LogTotalArea[ms_america_binom$TOP_10_YES_NO==0]),lty=1,col=1)
abline(lsfit(ms_america_binom$Longitude[ms_america_binom$TOP_10_YES_NO==1],ms_america_binom$LogTotalArea[ms_america_binom$TOP_10_YES_NO==1]),lty=2,col=2)

Population versus area doesn’t appear to have a significant effect on slope,while the other two may be different.

ms_america_model_final <- glm(data = ms_america_binom, formula = TOP_10_YES_NO ~LogTotalArea + LogTotalArea + LogPopulation:LogContestants, family = 'binomial')

summary(ms_america_model_final)

## 
## Call:
## glm(formula = TOP_10_YES_NO ~ LogTotalArea + LogTotalArea + LogPopulation:LogContestants, 
##     family = "binomial", data = ms_america_binom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2162  -0.5135   0.3324   0.4738   2.1114  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   1.01144    4.19742   0.241 0.809581    
## LogTotalArea                 -0.91410    0.39845  -2.294 0.021784 *  
## LogPopulation:LogContestants  0.26836    0.07752   3.462 0.000536 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63.449  on 50  degrees of freedom
## Residual deviance: 38.295  on 48  degrees of freedom
## AIC: 44.295
## 
## Number of Fisher Scoring iterations: 5

ggplot(ms_america_binom, aes(LogPopulation, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$LogPopulation, ms_america_model_final$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$LogPopulation` is discouraged. Use
## `LogPopulation` instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggplot(ms_america_binom, aes(LogContestants, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$LogContestants, ms_america_model_final$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$LogContestants` is discouraged. Use
## `LogContestants` instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggplot(ms_america_binom, aes(LogTotalArea, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$LogTotalArea, ms_america_model_final$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$LogTotalArea` is discouraged. Use
## `LogTotalArea` instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggplot(ms_america_binom, aes(Latitude, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$Latitude, ms_america_model_final$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$Latitude` is discouraged. Use `Latitude`
## instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggplot(ms_america_binom, aes(Longitude, TOP_10_YES_NO)) +
  geom_point() +
  geom_smooth(method = 'loess', se = FALSE) +
  geom_smooth(aes(ms_america_binom$Longitude, ms_america_model_final$fitted.values),method = 'loess', se = FALSE, color =  'red')

## Warning: Use of `ms_america_binom$Longitude` is discouraged. Use `Longitude`
## instead.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

This model’s predicted values based on the marginal latitude have a closer LOESS curve when compared to the actual data. However, The curve for longitude still is difficult to interpret.

Identify any leverage points in the final model developed in (a). Decide if they are “bad” leverage points.

One leverage point is made obvious in the marginal plots for Longitude: there appears to be two western states cause some discrepancies between the two LOESS curves.

ms_america %>%
  arrange(desc(Longitude)) %>%
  head()

##   abbreviation Top10 LogPopulation LogContestants LogTotalArea Latitude
## 1           HI     3       10.6205       2.564949       9.2994  21.3333
## 2           AK     0        9.8011       2.708050      13.4049  58.3667
## 3           OR     1       11.6680       3.100092      11.4966  44.9167
## 4           WA     1       12.2114       2.995732      11.1747  46.9667
## 5           CA     5       14.0005       3.935740      12.0058  38.5167
## 6           NV     1       10.9462       2.549445      11.6133  39.1667
##   Longitude
## 1   157.917
## 2   134.583
## 3   123.017
## 4   122.900
## 5   121.500
## 6   119.767

Hawaii and Alaska are the two western states in question. These states are not part of the contiguous United States, so they are quite literally outliers.

ms_america_binom$hvalues <- influence(ms_america_model)$hat
ms_america_binom$stanresDeviance <- residuals(ms_america_model)/sqrt(1-ms_america_binom$hvalues)
plot(ms_america_binom$hvalues,ms_america_binom$stanresDeviance,ylab="Standardized Deviance Residuals",xlab="Leverage Values")
abline(v=2*7/length(ms_america_binom$TOP_10_YES_NO),lty=2)
text(ms_america_binom$hvalues,ms_america_binom$stanresDeviance,labels=ms_america_binom$abbreviation)

ms_america_binom$hvalues_f <- influence(ms_america_model_final)$hat
ms_america_binom$stanresDeviance_f <- residuals(ms_america_model_final)/sqrt(1-ms_america_binom$hvalues_f)
plot(ms_america_binom$hvalues_f,ms_america_binom$stanresDeviance_f,ylab="Standardized Deviance Residuals",xlab="Leverage Values")
abline(v=2*7/length(ms_america_binom$TOP_10_YES_NO),lty=2)
text(ms_america_binom$hvalues_f,ms_america_binom$stanresDeviance_f,labels=ms_america_binom$abbreviation)

It appears in the initial model, Hawaii was a high leverage point. However, in the final model considered this was not the case.

ms_america_no_hi <- ms_america_binom %>%
  filter(Longitude < 150)

lm_no_hi <- glm(data = ms_america_no_hi, formula = TOP_10_YES_NO ~LogPopulation + LogContestants + LogTotalArea + Latitude + Longitude, family = 'binomial')

summary(lm_no_hi)

## 
## Call:
## glm(formula = TOP_10_YES_NO ~ LogPopulation + LogContestants + 
##     LogTotalArea + Latitude + Longitude, family = "binomial", 
##     data = ms_america_no_hi)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2754  -0.3861   0.3190   0.5089   2.0454  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -13.011958  10.063483  -1.293   0.1960  
## LogPopulation    1.239679   0.645059   1.922   0.0546 .
## LogContestants   2.970205   1.694941   1.752   0.0797 .
## LogTotalArea    -1.185150   0.659980  -1.796   0.0725 .
## Latitude         0.008202   0.122736   0.067   0.9467  
## Longitude        0.027987   0.043697   0.640   0.5219  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 62.687  on 49  degrees of freedom
## Residual deviance: 35.762  on 44  degrees of freedom
## AIC: 47.762
## 
## Number of Fisher Scoring iterations: 5

Removing Hawaii, which is also the most southern state as well as the most western, indicates that latitude (i.e., north or south of Mason-Dixon line) a non-significant variable.

Interpret the regression coefficients of the final model developed in (a).

Compared to the inital model, the final model uses total area and the interaction between contestants and population to predict whether a state made the top 10 in the 9 year period being studied. The interaction between contestants and population is positively correlated, meaning a state is more likely to place if it offers more contestants and a larger population. Total area is negatively correlated, meaning smaller states have better representation.

My interpretation of using logistic regression for this exercise digressed from the stated goal of predicting y, because y was a count and logistic regression can only predict values between 0 and 1. Another way of predicting y could have been to divide the Top10 column by 9 to obtain the annual probability of placing in the top 10 per year. Consideration of repeat placements could potentially reinforce the trends seen for a binary response.

Data621 - Exercise 8.2 ELMR

Thomas Hill

4/7/2021