The data file is loaded from http://www.stat.tamu.edu/~sheather/book/data_sets.php, file is tab separated, I converted into comma separated csv file.
options(scipen=999)
library(dplyr)
library(tidyverse)
library(knitr)
library(kableExtra)
library(car)
library(ggrepel)
library(ggplot2)
#Load data
MissAm.df <- read.csv("https://raw.githubusercontent.com/akulapa/Data621-Week08-Discussion/master/MissAmericato2008.csv", header= TRUE, stringsAsFactors = F)
attach(MissAm.df)
Dataset structure, the variable Top10 represents the number of times state made into top 10 list from 2000 to 2008, including 2000 and 2008. The number of times state did not make into Top10 list is 9 - Top10.
str(MissAm.df)
## 'data.frame': 51 obs. of 7 variables:
## $ abbreviation : chr "AL" "AK" "AZ" "AR" ...
## $ Top10 : int 6 0 0 4 5 0 2 0 2 3 ...
## $ LogPopulation : num 11.9 9.8 12.1 11.3 14 ...
## $ LogContestants: num 3.9 2.71 2.86 3.77 3.94 ...
## $ LogTotalArea : num 10.9 13.4 11.6 10.9 12 ...
## $ Latitude : num 32.4 58.4 33.4 34.7 38.5 ...
## $ Longitude : num 86.4 134.6 112 92.2 121.5 ...
summary(MissAm.df)
## abbreviation Top10 LogPopulation LogContestants
## Length:51 Min. :0.000 Min. : 9.693 Min. :1.910
## Class :character 1st Qu.:0.000 1st Qu.:10.843 1st Qu.:2.857
## Mode :character Median :2.000 Median :11.757 Median :3.091
## Mean :1.765 Mean :11.682 Mean :3.136
## 3rd Qu.:3.000 3rd Qu.:12.354 3rd Qu.:3.450
## Max. :7.000 Max. :14.001 Max. :4.040
## LogTotalArea Latitude Longitude
## Min. : 4.22 Min. :21.33 Min. : 69.80
## 1st Qu.:10.49 1st Qu.:35.99 1st Qu.: 78.06
## Median :10.94 Median :39.75 Median : 89.67
## Mean :10.62 Mean :39.40 Mean : 93.14
## 3rd Qu.:11.34 3rd Qu.:42.77 3rd Qu.:102.78
## Max. :13.40 Max. :58.37 Max. :157.92
state.df <- data.frame(abbreviation = state.abb, State=state.name, stringsAsFactors = F)
MissAm.df <- MissAm.df %>%
mutate(InTop10 = ifelse(Top10>0, 1, 0)) %>%
inner_join(state.df)
MissAm.df %>%
select(State, Top10, InTop10, LogPopulation, LogContestants, LogTotalArea, Latitude, Longitude) %>%
kable("html",caption = "Miss America Contest - Number of Times State Produced Top 10 Finalist`(2000 - 2008)`") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = T, position = "left", font_size = 12) %>%
scroll_box(width = "100%", height = "250px")
| State | Top10 | InTop10 | LogPopulation | LogContestants | LogTotalArea | Latitude | Longitude |
|---|---|---|---|---|---|---|---|
| Alabama | 6 | 1 | 11.9249 | 3.895894 | 10.8670 | 32.3833 | 86.367 |
| Alaska | 0 | 0 | 9.8011 | 2.708050 | 13.4049 | 58.3667 | 134.583 |
| Arizona | 0 | 0 | 12.0543 | 2.862201 | 11.6439 | 33.4333 | 112.017 |
| Arkansas | 4 | 1 | 11.2702 | 3.766997 | 10.8814 | 34.7333 | 92.233 |
| California | 5 | 1 | 14.0005 | 3.935739 | 12.0058 | 38.5167 | 121.500 |
| Colorado | 0 | 0 | 11.8820 | 2.944439 | 11.5530 | 39.7500 | 104.867 |
| Connecticut | 2 | 1 | 11.5742 | 2.944439 | 8.6203 | 41.7333 | 72.650 |
| Delaware | 0 | 0 | 10.3397 | 2.852631 | 7.8196 | 39.1333 | 75.467 |
| Florida | 3 | 1 | 13.0882 | 3.725693 | 11.0937 | 30.3833 | 84.367 |
| Georgia | 4 | 1 | 12.5558 | 3.875359 | 10.9925 | 33.6500 | 84.433 |
| Hawaii | 3 | 1 | 10.6205 | 2.564949 | 9.2994 | 21.3333 | 157.917 |
| Idaho | 0 | 0 | 10.6565 | 2.970414 | 11.3334 | 43.5667 | 116.217 |
| Illinois | 2 | 1 | 13.0341 | 3.238678 | 10.9667 | 39.8333 | 89.667 |
| Indiana | 2 | 1 | 12.3026 | 3.126761 | 10.5028 | 39.7333 | 86.283 |
| Iowa | 1 | 1 | 11.6306 | 2.879199 | 10.9380 | 41.5333 | 93.650 |
| Kansas | 1 | 1 | 11.4371 | 2.978925 | 11.3178 | 39.0667 | 95.633 |
| Kentucky | 2 | 1 | 11.7568 | 3.433987 | 10.6068 | 38.2000 | 84.867 |
| Louisiana | 2 | 1 | 12.1042 | 3.465736 | 10.8559 | 30.5333 | 91.150 |
| Maine | 0 | 0 | 10.6066 | 2.484907 | 10.4740 | 44.3167 | 69.800 |
| Maryland | 3 | 1 | 12.1239 | 3.218876 | 9.4260 | 39.0000 | 76.083 |
| Massachusetts | 2 | 1 | 12.4428 | 2.917771 | 9.2644 | 42.3667 | 71.033 |
| Michigan | 3 | 1 | 12.8257 | 3.367296 | 11.4795 | 42.7833 | 84.600 |
| Minnesota | 0 | 0 | 12.1286 | 2.724579 | 11.3730 | 44.8833 | 93.217 |
| Mississippi | 3 | 1 | 11.6233 | 3.703768 | 10.7879 | 32.3167 | 90.083 |
| Missouri | 0 | 0 | 12.1716 | 3.526361 | 11.1520 | 38.5667 | 92.183 |
| Montana | 0 | 0 | 10.2920 | 2.852631 | 11.8985 | 46.6000 | 112.000 |
| Nebraska | 0 | 0 | 11.0494 | 2.871680 | 11.2561 | 40.8500 | 96.750 |
| Nevada | 1 | 1 | 10.9462 | 2.549445 | 11.6133 | 39.1667 | 119.767 |
| New Hampshire | 1 | 1 | 10.6631 | 2.785011 | 9.1431 | 43.2000 | 71.500 |
| New Jersey | 2 | 1 | 12.5266 | 3.242592 | 9.0735 | 40.2167 | 74.767 |
| New Mexico | 0 | 0 | 11.0420 | 3.091042 | 11.7084 | 35.6167 | 106.083 |
| New York | 3 | 1 | 13.4873 | 3.072693 | 10.9070 | 42.7500 | 73.800 |
| North Carolina | 2 | 1 | 12.5005 | 3.332205 | 10.8934 | 35.8667 | 78.783 |
| North Dakota | 0 | 0 | 10.1843 | 3.032546 | 11.1662 | 46.7667 | 100.750 |
| Ohio | 0 | 0 | 12.9287 | 3.212187 | 10.7105 | 40.0000 | 82.883 |
| Oklahoma | 5 | 1 | 11.6138 | 3.784190 | 11.1548 | 35.4000 | 97.600 |
| Oregon | 1 | 1 | 11.6680 | 3.100092 | 11.4966 | 44.9167 | 123.017 |
| Pennsylvania | 4 | 1 | 13.0212 | 3.178054 | 10.7376 | 40.2000 | 76.767 |
| Rhode Island | 1 | 1 | 10.7402 | 2.656757 | 7.3428 | 41.7333 | 71.433 |
| South Carolina | 1 | 1 | 11.8946 | 3.725693 | 10.3741 | 33.9500 | 81.117 |
| South Dakota | 0 | 0 | 10.2306 | 2.674149 | 11.2531 | 44.3833 | 100.283 |
| Tennessee | 1 | 1 | 12.1059 | 3.564827 | 10.6488 | 36.1167 | 86.683 |
| Texas | 7 | 1 | 13.4640 | 3.811097 | 12.5009 | 30.3000 | 97.700 |
| Utah | 2 | 1 | 11.5123 | 4.040123 | 11.3492 | 40.7667 | 111.967 |
| Vermont | 0 | 0 | 10.0402 | 2.335375 | 9.1710 | 44.2667 | 72.567 |
| Virginia | 3 | 1 | 12.4058 | 3.258097 | 10.6637 | 37.5000 | 77.333 |
| Washington | 1 | 1 | 12.2114 | 2.995732 | 11.1747 | 46.9667 | 122.900 |
| West Virginia | 2 | 1 | 10.9746 | 3.091042 | 10.0953 | 38.3667 | 81.600 |
| Wisconsin | 3 | 1 | 12.2358 | 3.212187 | 11.0898 | 43.1333 | 89.333 |
| Wyoming | 0 | 0 | 9.6926 | 1.909542 | 11.4908 | 41.1500 | 104.817 |
Coefficients of the full Generalized Logistic Regression Model.
#Build model
MissAm.glm <- glm(InTop10 ~ LogPopulation + LogContestants + LogTotalArea + Latitude + Longitude, family=binomial(link = "logit"), data = MissAm.df)
MissAm.glm
##
## Call: glm(formula = InTop10 ~ LogPopulation + LogContestants + LogTotalArea +
## Latitude + Longitude, family = binomial(link = "logit"),
## data = MissAm.df)
##
## Coefficients:
## (Intercept) LogPopulation LogContestants LogTotalArea
## -11.68988 1.28041 2.81720 -1.30420
## Latitude Longitude
## -0.01544 0.03781
##
## Degrees of Freedom: 49 Total (i.e. Null); 44 Residual
## Null Deviance: 62.69
## Residual Deviance: 35.97 AIC: 47.97
Summary of the model.
summary(MissAm.glm)
##
## Call:
## glm(formula = InTop10 ~ LogPopulation + LogContestants + LogTotalArea +
## Latitude + Longitude, family = binomial(link = "logit"),
## data = MissAm.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2553 -0.3656 0.3466 0.5164 1.9389
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.68988 9.58609 -1.219 0.2227
## LogPopulation 1.28041 0.64357 1.990 0.0466 *
## LogContestants 2.81720 1.65071 1.707 0.0879 .
## LogTotalArea -1.30420 0.62712 -2.080 0.0376 *
## Latitude -0.01544 0.10896 -0.142 0.8873
## Longitude 0.03781 0.03688 1.025 0.3052
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 62.687 on 49 degrees of freedom
## Residual deviance: 35.972 on 44 degrees of freedom
## AIC: 47.972
##
## Number of Fisher Scoring iterations: 5
#Manually calculating Predicted and Fitted values
# man.pre<- MissAm.glm$coefficients[1] +
# MissAm.glm$coefficients[2] * MissAm.df$LogPopulation +
# MissAm.glm$coefficients[3] * MissAm.df$LogContestants +
# MissAm.glm$coefficients[4] * MissAm.df$LogTotalArea +
# MissAm.glm$coefficients[5] * MissAm.df$Latitude +
# MissAm.glm$coefficients[6] * MissAm.df$Longitude
#
# man.fitted <- 1/(1 + (1/exp(man.pre)))
#Get coefficients
MissAm.Coe <- round(MissAm.glm$coefficients,4)
First part Call, shows information about response variable and predictor variables.
Logistic regression equation
\[ln \bigg(\frac{P}{1-P}\bigg) = \beta_0 + \beta_1{X_1} + \beta_2{X_2} + \beta_3{X_3} + \beta_4{X_4} + \beta_5{X_5}\]
\[ln \bigg(\frac{P}{1-P}\bigg) = -11.6899 + 1.2804LogPopulation + 2.8172LogContestants -1.3042LogTotalArea -0.0154Latitude + 0.0378Longitude\]
Probability is
\[P = \frac{e^{\beta_0 + \beta_1{X_1} + \beta_2{X_2} + \beta_3{X_3} + \beta_4{X_4} + \beta_5{X_5}}}{1 + {e^{\beta_0 + \beta_1{X_1} + \beta_2{X_2} + \beta_3{X_3} + \beta_4{X_4} + \beta_5{X_5}}}}\]
For every unit increase of LogPopulation, \(ln \bigg(\frac{p}{1-p}\bigg)\) increases by 1.2804. LogPopulation has a positive effect on the outcome when all other predictor variables are held constant. In other words, as log value of state population increases by one unit, log odds or logits for the state to make into Top10 list of finalists increases by 1.2804.
For every unit increase of LogContestants, \(ln \bigg(\frac{p}{1-p}\bigg)\) increases by 2.8172. LogContestants has a positive effect on the outcome when all other predictor variables are held constant. As log value of contestants from a state increases by one unit, log odds or logits for the state to make into Top10 list of finalists increases by 2.8172.
For every unit increase of LogTotalArea, \(ln \bigg(\frac{p}{1-p}\bigg)\) decreases by 1.3042. LogTotalArea has a negative effect on the outcome when all other predictor variables are held constant. As log value of the total area of a state increases by one unit, log odds or logits for the state to make into Top10 list of finalists decreases by 1.3042.
For every degree increase of Latitude, \(ln \bigg(\frac{p}{1-p}\bigg)\) decreases by 0.0154. Latitude has a negative effect on the outcome when all other predictor variables are held constant. As log value of latitude of state capitol increases by one degree, log odds or logits for the state to make into Top10 list of finalists decreases by 0.0154.
For every degree increase of Longitude, \(ln \bigg(\frac{p}{1-p}\bigg)\) increases by 0.0378. Longitude has a positive effect on the outcome when all other predictor variables are held constant. As log value of longitude of state capitol increases by one degree, log odds or logits for the state to make into Top10 list of finalists increases by 0.0378.
Model also suggests,
LogPopulation and LogTotalArea are significent at 5% levelLogContestants is contributing to model and is significent at 10% level.Latitude and Longitude are not significent to the model.Null deviance is 62.687, and Residual deviance is 35.972, suggesting variables are needed to build the model. Lower the value of deviance better the model.
Akaike information criterion(AIC), gives the quality of the model. Lower the value of AIC better the model. Since this is full model, lets use Step function to see if AIC improves if we remove any variables from the model.
step(MissAm.glm, test="LRT")
## Start: AIC=47.97
## InTop10 ~ LogPopulation + LogContestants + LogTotalArea + Latitude +
## Longitude
##
## Df Deviance AIC LRT Pr(>Chi)
## - Latitude 1 35.992 45.992 0.0202 0.88703
## - Longitude 1 37.292 47.292 1.3194 0.25071
## <none> 35.972 47.972
## - LogContestants 1 39.712 49.712 3.7397 0.05313 .
## - LogPopulation 1 40.709 50.709 4.7364 0.02953 *
## - LogTotalArea 1 42.067 52.067 6.0949 0.01356 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=45.99
## InTop10 ~ LogPopulation + LogContestants + LogTotalArea + Longitude
##
## Df Deviance AIC LRT Pr(>Chi)
## - Longitude 1 37.981 45.981 1.9886 0.158485
## <none> 35.992 45.992
## - LogPopulation 1 40.992 48.992 4.9993 0.025358 *
## - LogContestants 1 41.302 49.302 5.3095 0.021209 *
## - LogTotalArea 1 44.344 52.344 8.3518 0.003853 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=45.98
## InTop10 ~ LogPopulation + LogContestants + LogTotalArea
##
## Df Deviance AIC LRT Pr(>Chi)
## <none> 37.981 45.981
## - LogPopulation 1 41.708 47.708 3.7274 0.05353 .
## - LogContestants 1 42.521 48.521 4.5404 0.03310 *
## - LogTotalArea 1 44.446 50.446 6.4649 0.01100 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call: glm(formula = InTop10 ~ LogPopulation + LogContestants + LogTotalArea,
## family = binomial(link = "logit"), data = MissAm.df)
##
## Coefficients:
## (Intercept) LogPopulation LogContestants LogTotalArea
## -9.4502 1.0702 2.5447 -0.9303
##
## Degrees of Freedom: 49 Total (i.e. Null); 46 Residual
## Null Deviance: 62.69
## Residual Deviance: 37.98 AIC: 45.98
AIC value without Latitude and Longitude yield better value. The output of step suggests existence of Latitude and Longitude is not providing any value to the model. Let’s build a model without Latitude and Longitude variables.
MissAm.glm_v1 <- glm(InTop10 ~ LogPopulation + LogContestants + LogTotalArea, family=binomial(link = "logit"), data = MissAm.df)
summary(MissAm.glm_v1)
##
## Call:
## glm(formula = InTop10 ~ LogPopulation + LogContestants + LogTotalArea,
## family = binomial(link = "logit"), data = MissAm.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3100 -0.4617 0.3657 0.4814 2.0845
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.4502 6.1757 -1.530 0.1260
## LogPopulation 1.0702 0.6007 1.782 0.0748 .
## LogContestants 2.5447 1.3779 1.847 0.0648 .
## LogTotalArea -0.9303 0.4293 -2.167 0.0302 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 62.687 on 49 degrees of freedom
## Residual deviance: 37.981 on 46 degrees of freedom
## AIC: 45.981
##
## Number of Fisher Scoring iterations: 5
mmps(MissAm.glm,layout=c(2,3),key=T)
mmps(MissAm.glm_v1,layout=c(2,3),key=T)
There is not much difference between the plots. In both Marginal Plots curve drawn by model and data fit between 0 and 1 for all variables.
The leverage \(h_i\) is a measure of the distance between the \(x\) value for the \(i^{th}\) data point and the mean of the \(x\) values for all \(n\) data points. If leverage value greater than \(2\times \frac{number\ of\ variables + 1}{number\ of\ observations}\) is considered as high leverage point. Leverage values are also known as hat values. We obtain the values using function hatvalues.
Let’s get hatvalues for the model that does not have Longitude variable. Since we have multiple variables, we will be using Standardized Deviance Residuals It is calculated dividing pearson residual by \(\sqrt{1 - hatvalues}\).
\[Standardized\ Deviance\ Residuals(r_i) = \frac{p_i}{\sqrt{(1 - h_i)}}\]
Leverage points can be identified using influencePlot function from car package or calculated manually.
#Cut of leverage
#we have 4 variables and 50 observations
highLeverageHat = 2 * (3+1)/50
#Leverage values
MissAm.df$hatVal <- hatvalues(MissAm.glm_v1)
#standardized deviance residuals(sdr)
#Get pearson residuals
MissAm.df$pearsonResd <- residuals(MissAm.glm_v1,'pearson')
MissAm.df$sdr <- MissAm.df$pearsonResd / (sqrt(1 - MissAm.df$hatVal))
#Cook's distance
MissAm.df$cookd <- cooks.distance(MissAm.glm_v1)
#High leverage SDR
#data points falling outside 2 standard deviations
highLeverageSdrU <- mean(MissAm.df$sdr) + (2*sd(MissAm.df$sdr))
highLeverageSdrL <- mean(MissAm.df$sdr) - (2*sd(MissAm.df$sdr))
#High leverage based on Cook's distance
#data points falling outside 2 standard deviations
highLeverageCookdU <- mean(MissAm.df$cookd) + (2*sd(MissAm.df$cookd))
highLeverageCookdL <- mean(MissAm.df$cookd) - (2*sd(MissAm.df$cookd))
MissAm.df$Outlier <- ifelse((MissAm.df$hatVal > highLeverageHat | MissAm.df$sdr > highLeverageSdrU | MissAm.df$sdr < highLeverageSdrL | MissAm.df$cookd > highLeverageCookdU | MissAm.df$cookd < highLeverageCookdL),'Yes','No')
Identifying leverage data points Using influencePlot function
influencePlot(MissAm.glm_v1, col="red",id.n=5)
## StudRes Hat CookD
## 8 -2.2315944 0.22191328 0.45381657
## 11 1.3549485 0.15762426 0.06839409
## 23 -1.2075931 0.17274282 0.05612712
## 25 -2.2347980 0.04654186 0.11263216
## 28 2.2400994 0.07957585 0.18272920
## 34 -0.7564691 0.15896707 0.01655727
## 35 -2.4581400 0.05002072 0.18585607
## 39 0.5455872 0.21222184 0.01187986
## 45 -0.7364479 0.17412150 0.01749629
Manual calculation to identify leverage data points.
ggplot(data=MissAm.df, aes(hatVal,sdr)) +
geom_point(aes(col=Outlier)) +
scale_color_manual(values=c("black", "red")) +
geom_vline(xintercept=highLeverageHat, color="blue") +
geom_hline(yintercept=c(highLeverageSdrU, highLeverageSdrL), color="blue") +
geom_text_repel(data=filter(MissAm.df, (Outlier == 'Yes')), aes(hatVal,sdr, label=State), size=3) +
labs(title = sprintf("High Leverage Data Points Using GGPlot - Manually")) + xlab("Leverage(Hat-Values)") +
ylab("Standardized Deviance Residuals") +
annotate("text", x = 0.04, y = -2.3, label = 'SDR - Lower Bound', colour="blue", size = 3) +
annotate("text", x = 0.04, y = 2.3, label = 'SDR - Upper Bound', colour="blue", size = 3) +
annotate("text", x = 0.18, y = 2.5, label = 'High Leverage Hat Value', colour="blue", size = 3)
MissAm.df %>%
select(State, Top10, InTop10, LogPopulation, LogContestants, LogTotalArea, Latitude, Longitude, pearsonResd, hatVal, sdr, cookd, Outlier) %>%
filter(Outlier == 'Yes') %>%
kable("html",caption = "Miss America Contest - High Leverage Data Points") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = T, position = "left", font_size = 12) %>%
scroll_box(width = "100%", height = "250px")
| State | Top10 | InTop10 | LogPopulation | LogContestants | LogTotalArea | Latitude | Longitude | pearsonResd | hatVal | sdr | cookd | Outlier |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Delaware | 0 | 0 | 10.3397 | 2.852631 | 7.8196 | 39.1333 | 75.467 | -2.2253921 | 0.2219133 | -2.5228564 | 0.4538166 | Yes |
| Minnesota | 0 | 0 | 12.1286 | 2.724579 | 11.3730 | 44.8833 | 93.217 | -0.9430982 | 0.1727428 | -1.0368994 | 0.0561271 | Yes |
| Missouri | 0 | 0 | 12.1716 | 3.526361 | 11.1520 | 38.5667 | 92.183 | -2.9664769 | 0.0465419 | -3.0380167 | 0.1126322 | Yes |
| Nevada | 1 | 1 | 10.9462 | 2.549445 | 11.6133 | 39.1667 | 119.767 | 2.7895320 | 0.0795758 | 2.9076180 | 0.1827292 | Yes |
| Ohio | 0 | 0 | 12.9287 | 3.212187 | 10.7105 | 40.0000 | 82.883 | -3.6623304 | 0.0500207 | -3.7575127 | 0.1858561 | Yes |
| Rhode Island | 1 | 1 | 10.7402 | 2.656757 | 7.3428 | 41.7333 | 71.433 | 0.3727730 | 0.2122218 | 0.4199934 | 0.0118799 | Yes |
| Vermont | 0 | 0 | 10.0402 | 2.335375 | 9.1710 | 44.2667 | 72.567 | -0.5235920 | 0.1741215 | -0.5761491 | 0.0174963 | Yes |
Capitol of state of Rhode Island is on a high Latitude(41.7333), and variables LogTotalArea(7.3428) and LogContestants(2.656757) are low, yet contestants made into Top10 list. It seems like Outlier.
For state Delaware values for variables are close to individual averages, yet contestants never made into Top10 list at least once in nine years. Data points look like Outliers.
State of Ohio has variable LogContestants as 3.212187, it is high, yet contestants never made into Top10 list at least once in nine years. Data point looks like Outlier.
State of Missouri has variable LogContestants as 3.526361, it is very high, yet contestants never made into Top10 list at least once in nine years. Data point looks like Outlier.
State of Vermont has variable LogContestants as 2.335375, it is high compared to LogTotalArea 9.1710 and LogPopulation 10.0402, yet contestants never made into Top10 list at least once in nine years. Data point looks like Outlier.
State of Minnesota has variable LogContestants as 2.724580, it is very high compared to LogTotalArea 11.3730 and LogPopulation 12.1286, yet contestants never made into Top10 list at least once in nine years. Data point looks like Outlier.
State of Nevada has variable LogContestants as 2.549445, it is very low, yet contestants from the state made into Top10 list. This does not seem to be bad leverage data point.
MissAm.glm <- glm(InTop10 ~ LogPopulation + LogContestants + LogTotalArea, family=binomial(link = "logit"), data = MissAm.df)
MissAm.Coe <- round(MissAm.glm$coefficients,4)
summary(MissAm.glm)
##
## Call:
## glm(formula = InTop10 ~ LogPopulation + LogContestants + LogTotalArea,
## family = binomial(link = "logit"), data = MissAm.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3100 -0.4617 0.3657 0.4814 2.0845
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.4502 6.1757 -1.530 0.1260
## LogPopulation 1.0702 0.6007 1.782 0.0748 .
## LogContestants 2.5447 1.3779 1.847 0.0648 .
## LogTotalArea -0.9303 0.4293 -2.167 0.0302 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 62.687 on 49 degrees of freedom
## Residual deviance: 37.981 on 46 degrees of freedom
## AIC: 45.981
##
## Number of Fisher Scoring iterations: 5
Intercepct value(-9.4502) decreased a lot by removing Latitude and Longitude variables.
For every unit increase of LogPopulation, \(ln \bigg(\frac{p}{1-p}\bigg)\) increases by 1.0702. LogPopulation has a positive effect on the outcome when all other predictor variables are held constant. In other words, as log value of state population increases by one unit, log odds or logits for the state to make into Top10 list of finalists increases by 1.0702.
For every unit increase of LogContestants, \(ln \bigg(\frac{p}{1-p}\bigg)\) increases by 2.5447. LogContestants has a positive effect on the outcome when all other predictor variables are held constant. As log value of contestants from a state increases by one unit, log odds or logits for the state to make into Top10 list of finalists increases by 2.5447.
For every unit increase of LogTotalArea, \(ln \bigg(\frac{p}{1-p}\bigg)\) decreases by 0.9303. LogTotalArea has a negative effect on the outcome when all other predictor variables are held constant. As log value of the total area of a state increases by one unit, log odds or logits for the state to make into Top10 list of finalists decreases by 0.9303.
Model also suggests,
LogTotalArea is significent at 5% levelLogContestants and LogPopulation are contributing to model and are significent at 10% level.Null deviance is 62.687, and Residual deviance is 37.981, suggesting variables are needed to build the model. Lower the value of deviance better the model.