Diego Correa
12/3/2020
FiveThirtyEight ran a poll, surveying 1,186, respondents from June 3 to 6 in 2014. The survey included the following questions: Are you a fan?, Which movies have you watched? Age Group, Sex, Household Income, and Education.
Research questions:
Does this data provide convincing evidence of an inconsistency between the observed and expected counts between being a Star Wars fans with respect to sexes?
Can a Star Wars fan be predicted from the other questions from the survey?
str(dfSW)
'data.frame': 836 obs. of 9 variables:
$ fan : chr "Yes" "No" "Yes" "Yes" ...
$ watched_any_SW_movie : num 1 1 1 1 1 1 1 1 1 1 ...
$ Sex : chr "Male" "Male" "Male" "Male" ...
$ Age : chr "18-29" "18-29" "18-29" "18-29" ...
$ household_income : chr NA "$0 - $24,999" "$100,000 - $149,999" "$100,000 - $149,999" ...
$ Education : chr "High school degree" "High school degree" "Some college or Associate degree" "Some college or Associate degree" ...
$ location : chr "South Atlantic" "West North Central" "West North Central" "West North Central" ...
$ num_of_movies_watched: num 6 3 6 6 6 6 6 6 1 6 ...
$ fan_bool : num 1 0 1 1 1 1 1 1 0 0 ...
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 6.000 4.661 6.000 6.000
Conditions:
dfSW %>%
specify(response = fan, success = 'Yes') %>%
generate(reps = 1000, type = 'bootstrap') %>%
calculate(stat = 'prop') %>%
get_ci(level = 0.95)
# A tibble: 1 x 2
lower_ci upper_ci
<dbl> <dbl>
1 0.628 0.694
Do these data provide convincing evidence of an inconsistency between the observed and expected counts between?
Null Hypothesis: There is no inconsistency between the observed and the expected counts. The observed counts follow the same distribution as the expected counts.
Alternative Hypothesis: There is an inconsistency between the observed and the expected counts. The observed counts do not follow the same distribution as the expected counts.
The p value is < 0.0001, therefore, we reject the null hypothesis. There is a bias in Star Wars fandom by Sex.
fan_by_sex
Female Male
Yes 187 247
No 135 103
chisq.test(fan_by_sex)
Pearson's Chi-squared test with Yates' continuity correction
data: fan_by_sex
X-squared = 10.911, df = 1, p-value = 0.000956
For our logistic regression, we assume a binomial distribution produced the outcome variable, Star Wars fandom. The logistic regression uses all of the categorical and numerical variables in the data set.
The data set is seperated into a training set, with 70% of the original, and a test set, with the remaining 30%.
train.rows <- sample(nrow(dfSW), nrow(dfSW) * .7)
dfSW_train <- dfSW[train.rows,]
dfSW_test <- dfSW[-train.rows,]
Call:
glm(formula = fan_bool ~ watched_any_SW_movie + Sex + Age + household_income +
Education + num_of_movies_watched, family = binomial(link = "logit"),
data = dfSW_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1861 -0.6753 0.4672 0.6054 2.2801
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -14.76305 882.74351 -0.017 0.987
watched_any_SW_movie 11.58099 882.74353 0.013 0.990
SexMale 0.25369 0.25108 1.010 0.312
Age18-29 -0.05495 0.36737 -0.150 0.881
Age30-44 0.28856 0.34203 0.844 0.399
Age45-60 0.23113 0.33213 0.696 0.486
household_income$100,000 - $149,999 0.30378 0.44176 0.688 0.492
household_income$150,000+ 0.04340 0.49078 0.088 0.930
household_income$25,000 - $49,999 -0.05670 0.40736 -0.139 0.889
household_income$50,000 - $99,999 0.07581 0.38193 0.198 0.843
EducationGraduate degree 0.22904 0.31847 0.719 0.472
EducationHigh school degree -0.69041 0.48465 -1.425 0.154
EducationLess than high school degree 12.50168 882.74345 0.014 0.989
EducationSome college or Associate degree 0.21231 0.29922 0.710 0.478
num_of_movies_watched 0.77140 0.07613 10.133 <2e-16
(Intercept)
watched_any_SW_movie
SexMale
Age18-29
Age30-44
Age45-60
household_income$100,000 - $149,999
household_income$150,000+
household_income$25,000 - $49,999
household_income$50,000 - $99,999
EducationGraduate degree
EducationHigh school degree
EducationLess than high school degree
EducationSome college or Associate degree
num_of_movies_watched ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 614.87 on 470 degrees of freedom
Residual deviance: 437.89 on 456 degrees of freedom
(114 observations deleted due to missingness)
AIC: 467.89
Number of Fisher Scoring iterations: 13
The proportions of Star Wars fans in our entire data set is:
No Yes
0.3397129 0.6602871
The prediction results of the logistic regression model on the training set are:
No Yes
FALSE 0.23566879 0.07430998
TRUE 0.12314225 0.56687898
Overall accuracy is 77.97%, 11.95% better than guessing that everyone is a Star Wars Fan in the training data set.
The proportions of Star Wars fans in our test set is:
No Yes
0.3346614 0.6653386
The prediction results of the logistic regression model on the test set are:
No Yes
FALSE 0.1871921 0.0591133
TRUE 0.1625616 0.5911330
The model's accuracy is 80.57%, 16.83% better than guessing that everyone is a Star Wars Fan in the test data set.
The Chi - Squared Test provides convincing evidence that there is a bias in Star Wars fandom with respect to different sexes.
The logistic regression model prediction was 80.57% accurate on new data.
Limitation:
The number of movies watched is not normally distributed.
There are newer Star Wars movies which is not represented in the model.
Importance:
Hickey, Walt. “America's Favorite 'Star Wars' Movies (And Least Favorite Characters).” FivethirtyEight, 22 July 2014. https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/.
https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv