Star Wars Fans

Diego Correa
12/3/2020

Introduction

FiveThirtyEight ran a poll, surveying 1,186, respondents from June 3 to 6 in 2014. The survey included the following questions: Are you a fan?, Which movies have you watched? Age Group, Sex, Household Income, and Education.

Research questions:

  • Does this data provide convincing evidence of an inconsistency between the observed and expected counts between being a Star Wars fans with respect to sexes?

  • Can a Star Wars fan be predicted from the other questions from the survey?

Looking at the Data

str(dfSW)
'data.frame':   836 obs. of  9 variables:
 $ fan                  : chr  "Yes" "No" "Yes" "Yes" ...
 $ watched_any_SW_movie : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Sex                  : chr  "Male" "Male" "Male" "Male" ...
 $ Age                  : chr  "18-29" "18-29" "18-29" "18-29" ...
 $ household_income     : chr  NA "$0 - $24,999" "$100,000 - $149,999" "$100,000 - $149,999" ...
 $ Education            : chr  "High school degree" "High school degree" "Some college or Associate degree" "Some college or Associate degree" ...
 $ location             : chr  "South Atlantic" "West North Central" "West North Central" "West North Central" ...
 $ num_of_movies_watched: num  6 3 6 6 6 6 6 6 1 6 ...
 $ fan_bool             : num  1 0 1 1 1 1 1 1 0 0 ...

Exploratory Analysis - Categorical Variables

plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-5

Exploratory Analysis - Categorical Variables - Continued

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-7

Exploratory Analysis - Numerical Variables

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   6.000   4.661   6.000   6.000 

plot of chunk unnamed-chunk-9

Inference

Conditions:

  1. Independence: The sample is random, and 836 < 10% of all Americans, therefore, we can assume one respondent's reponse is independent of another.
  2. Success-failure: 552 of people answered that there are a fan of Star Wars (success) and 284 answered that they were not (failures), both are greater than 10.
dfSW %>%
  specify(response = fan, success = 'Yes') %>%
  generate(reps = 1000, type = 'bootstrap') %>%
  calculate(stat = 'prop') %>%
  get_ci(level = 0.95)
# A tibble: 1 x 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.628    0.694

Chi - Squared Test

Do these data provide convincing evidence of an inconsistency between the observed and expected counts between?

  • Null Hypothesis: There is no inconsistency between the observed and the expected counts. The observed counts follow the same distribution as the expected counts.

  • Alternative Hypothesis: There is an inconsistency between the observed and the expected counts. The observed counts do not follow the same distribution as the expected counts.

Chi - Squared Test - Continued

The p value is < 0.0001, therefore, we reject the null hypothesis. There is a bias in Star Wars fandom by Sex.

fan_by_sex
    Female Male
Yes    187  247
No     135  103
chisq.test(fan_by_sex)

    Pearson's Chi-squared test with Yates' continuity correction

data:  fan_by_sex
X-squared = 10.911, df = 1, p-value = 0.000956

Logistic Regression - Set Up Data

For our logistic regression, we assume a binomial distribution produced the outcome variable, Star Wars fandom. The logistic regression uses all of the categorical and numerical variables in the data set.

The data set is seperated into a training set, with 70% of the original, and a test set, with the remaining 30%.

train.rows <- sample(nrow(dfSW), nrow(dfSW) * .7)
dfSW_train <- dfSW[train.rows,]
dfSW_test <- dfSW[-train.rows,]

Logistic Regression - Create Model


Call:
glm(formula = fan_bool ~ watched_any_SW_movie + Sex + Age + household_income + 
    Education + num_of_movies_watched, family = binomial(link = "logit"), 
    data = dfSW_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1861  -0.6753   0.4672   0.6054   2.2801  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
(Intercept)                               -14.76305  882.74351  -0.017    0.987
watched_any_SW_movie                       11.58099  882.74353   0.013    0.990
SexMale                                     0.25369    0.25108   1.010    0.312
Age18-29                                   -0.05495    0.36737  -0.150    0.881
Age30-44                                    0.28856    0.34203   0.844    0.399
Age45-60                                    0.23113    0.33213   0.696    0.486
household_income$100,000 - $149,999         0.30378    0.44176   0.688    0.492
household_income$150,000+                   0.04340    0.49078   0.088    0.930
household_income$25,000 - $49,999          -0.05670    0.40736  -0.139    0.889
household_income$50,000 - $99,999           0.07581    0.38193   0.198    0.843
EducationGraduate degree                    0.22904    0.31847   0.719    0.472
EducationHigh school degree                -0.69041    0.48465  -1.425    0.154
EducationLess than high school degree      12.50168  882.74345   0.014    0.989
EducationSome college or Associate degree   0.21231    0.29922   0.710    0.478
num_of_movies_watched                       0.77140    0.07613  10.133   <2e-16

(Intercept)                                  
watched_any_SW_movie                         
SexMale                                      
Age18-29                                     
Age30-44                                     
Age45-60                                     
household_income$100,000 - $149,999          
household_income$150,000+                    
household_income$25,000 - $49,999            
household_income$50,000 - $99,999            
EducationGraduate degree                     
EducationHigh school degree                  
EducationLess than high school degree        
EducationSome college or Associate degree    
num_of_movies_watched                     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 614.87  on 470  degrees of freedom
Residual deviance: 437.89  on 456  degrees of freedom
  (114 observations deleted due to missingness)
AIC: 467.89

Number of Fisher Scoring iterations: 13

Predicted Values

The proportions of Star Wars fans in our entire data set is:


       No       Yes 
0.3397129 0.6602871 

The prediction results of the logistic regression model on the training set are:


                No        Yes
  FALSE 0.23566879 0.07430998
  TRUE  0.12314225 0.56687898

Overall accuracy is 77.97%, 11.95% better than guessing that everyone is a Star Wars Fan in the training data set.

Model Performance - Training Data

plot of chunk unnamed-chunk-17

Validating with Test Data Set

The proportions of Star Wars fans in our test set is:


       No       Yes 
0.3346614 0.6653386 

The prediction results of the logistic regression model on the test set are:


               No       Yes
  FALSE 0.1871921 0.0591133
  TRUE  0.1625616 0.5911330

The model's accuracy is 80.57%, 16.83% better than guessing that everyone is a Star Wars Fan in the test data set.

Model Performance - Test Data

plot of chunk unnamed-chunk-20

Conclusion

  • The Chi - Squared Test provides convincing evidence that there is a bias in Star Wars fandom with respect to different sexes.

  • The logistic regression model prediction was 80.57% accurate on new data.

Limitation:

  • The number of movies watched is not normally distributed.

  • There are newer Star Wars movies which is not represented in the model.

Importance:

  • Similar data can be used for marketing.

References