Star Wars

  1. We are back to a survey collected by FiveThirtyEight. This time we are interested in a survey on Star Wars - the accompanying article is published here
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
starwars <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv")

# the following lines are necessary to fix the multibyte problem and make proper names
# part of the names:
line1 <- names(starwars)
line2 <- unlist(starwars[1,])
varnames <- paste(line1, line2)
# clean up some of the multibyte characters:
names(starwars) <- enc2native(stringi::stri_trans_general(varnames, "latin-ascii"))

starwars <- starwars[-1,]
head(starwars)
## # A tibble: 6 x 38
##   `RespondentID N~ `Have you seen ~ `Do you conside~ `Which of the f~
##              <dbl> <chr>            <chr>            <chr>           
## 1       3292879998 Yes              Yes              Star Wars: Epis~
## 2       3292879538 No               <NA>             <NA>            
## 3       3292765271 Yes              No               Star Wars: Epis~
## 4       3292763116 Yes              Yes              Star Wars: Epis~
## 5       3292731220 Yes              Yes              Star Wars: Epis~
## 6       3292719380 Yes              Yes              Star Wars: Epis~
## # ... with 34 more variables: `X5 Star Wars: Episode II Attack of the
## #   Clones` <chr>, `X6 Star Wars: Episode III Revenge of the Sith` <chr>,
## #   `X7 Star Wars: Episode IV A New Hope` <chr>, `X8 Star Wars: Episode V
## #   The Empire Strikes Back` <chr>, `X9 Star Wars: Episode VI Return of
## #   the Jedi` <chr>, `Please rank the Star Wars films in order of
## #   preference with 1 being your favorite film in the franchise and 6
## #   being your least favorite film. Star Wars: Episode I The Phantom
## #   Menace` <chr>, `X11 Star Wars: Episode II Attack of the Clones` <chr>,
## #   `X12 Star Wars: Episode III Revenge of the Sith` <chr>, `X13 Star
## #   Wars: Episode IV A New Hope` <chr>, `X14 Star Wars: Episode V The
## #   Empire Strikes Back` <chr>, `X15 Star Wars: Episode VI Return of the
## #   Jedi` <chr>, `Please state whether you view the following characters
## #   favorably, unfavorably, or are unfamiliar with him/her. Han
## #   Solo` <chr>, `X17 Luke Skywalker` <chr>, `X18 Princess Leia
## #   Organa` <chr>, `X19 Anakin Skywalker` <chr>, `X20 Obi Wan
## #   Kenobi` <chr>, `X21 Emperor Palpatine` <chr>, `X22 Darth Vader` <chr>,
## #   `X23 Lando Calrissian` <chr>, `X24 Boba Fett` <chr>, `X25
## #   C-3P0` <chr>, `X26 R2 D2` <chr>, `X27 Jar Jar Binks` <chr>, `X28 Padme
## #   Amidala` <chr>, `X29 Yoda` <chr>, `Which character shot first?
## #   Response` <chr>, `Are you familiar with the Expanded Universe?
## #   Response` <chr>, `Do you consider yourself to be a fan of the Expanded
## #   Universe?<U+FFFD><U+FFFD> Response` <chr>, `Do you consider yourself
## #   to be a fan of the Star Trek franchise? Response` <chr>, `Gender
## #   Response` <chr>, `Age Response` <chr>, `Household Income
## #   Response` <chr>, `Education Response` <chr>, `Location (Census Region)
## #   Response` <chr>

3.

How many people responded to the survey? How many people have seen at least one of the movies? Use the variable Have you seen any of the 6 films in the Star Wars franchise? Response to answer this question. Only consider responses of participants who have seen at least one of the Star Wars films for the remainder of the homework.

starwars%>%
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1186
starwars%>%
  filter(`Have you seen any of the 6 films in the Star Wars franchise? Response`=="Yes")%>%
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   936
d<-starwars%>%
  filter(`Have you seen any of the 6 films in the Star Wars franchise? Response`=="Yes")

There are 1186 total responses to the survey and among these, 936 respondents have seen at least one of the movies.

4.

Variables Gender Response and Age Response are two of the demographic variables collected. Use dplyr to provide a frequency break down for each variable. Does the result surprise you? Comment. Reorder the levels in the variable Age Response from youngest to oldest.

d%>%
  group_by(`Gender Response`)%>%
  tally()
## # A tibble: 3 x 2
##   `Gender Response`     n
##   <chr>             <int>
## 1 Female              397
## 2 Male                423
## 3 <NA>                116
d%>%
  group_by(`Age Response`)%>%
  tally()
## # A tibble: 5 x 2
##   `Age Response`     n
##   <chr>          <int>
## 1 > 60             193
## 2 18-29            180
## 3 30-44            207
## 4 45-60            240
## 5 <NA>             116
d$`Age Response`<- factor(d$`Age Response`, 
                          levels= c("18-29", "30-44", "45-60", "> 60", "NA" ))
d%>%
  group_by(`Gender Response`, `Age Response`)%>%
  tally()
## # A tibble: 9 x 3
## # Groups:   Gender Response [?]
##   `Gender Response` `Age Response`     n
##   <chr>             <fct>          <int>
## 1 Female            18-29             85
## 2 Female            30-44             93
## 3 Female            45-60            120
## 4 Female            > 60              99
## 5 Male              18-29             95
## 6 Male              30-44            114
## 7 Male              45-60            120
## 8 Male              > 60              94
## 9 <NA>              <NA>             116

The number of the male an female respondents who have seen at least one of the Starwars movies is almost the same. What is surprising is that the starwars movies are more watched by older people and whereby more than half of the all respondents are 45 or older.

5.

Variables 10 through 15 answer the question: “Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.” for each of the films. Bring the data set into a long form. Introduce a variable for the star wars episode and the corresponding ranking. Find the average rank for each of the films. Are average ranks different between mens’ and womens’ rankings? On how many responses are the averages based? Show these numbers together with the averages.

names(d)[10]<- "X10 Episode I  The Phantom Menace"

d.rank<-d%>%
  select(1, 10:15, 34)%>%
  gather(key = Episode, value = Rank, 2:7)

d.rank[, "Rank"]<-as.numeric(d.rank$Rank)

d.rank%>%
  group_by(Episode, `Gender Response`)%>%
  summarise(mean.rank=mean(Rank, na.rm=TRUE))
## # A tibble: 18 x 3
## # Groups:   Episode [?]
##    Episode                                      `Gender Respons~ mean.rank
##    <chr>                                        <chr>                <dbl>
##  1 X10 Episode I  The Phantom Menace            Female                3.43
##  2 X10 Episode I  The Phantom Menace            Male                  4.04
##  3 X10 Episode I  The Phantom Menace            <NA>                  3.19
##  4 X11 Star Wars: Episode II  Attack of the Cl~ Female                3.95
##  5 X11 Star Wars: Episode II  Attack of the Cl~ Male                  4.22
##  6 X11 Star Wars: Episode II  Attack of the Cl~ <NA>                  3.75
##  7 X12 Star Wars: Episode III  Revenge of the ~ Female                4.42
##  8 X12 Star Wars: Episode III  Revenge of the ~ Male                  4.27
##  9 X12 Star Wars: Episode III  Revenge of the ~ <NA>                  4.19
## 10 X13 Star Wars: Episode IV  A New Hope        Female                3.54
## 11 X13 Star Wars: Episode IV  A New Hope        Male                  3.00
## 12 X13 Star Wars: Episode IV  A New Hope        <NA>                  3.81
## 13 X14 Star Wars: Episode V The Empire Strikes~ Female                2.57
## 14 X14 Star Wars: Episode V The Empire Strikes~ Male                  2.46
## 15 X14 Star Wars: Episode V The Empire Strikes~ <NA>                  2.56
## 16 X15 Star Wars: Episode VI Return of the Jedi Female                3.08
## 17 X15 Star Wars: Episode VI Return of the Jedi Male                  3.00
## 18 X15 Star Wars: Episode VI Return of the Jedi <NA>                  3.5
d.rank%>%
  group_by(Episode, `Gender Response`)%>%
  summarise(mean.rank=mean(Rank, na.rm=TRUE))%>%
  filter(!`Gender Response`=="NA")%>%
  ggplot(aes(x = Episode, y = mean.rank, color=`Gender Response`))+
  geom_point()+
  coord_flip()

d.rank%>%
  group_by(Episode)%>%
  count()
## # A tibble: 6 x 2
## # Groups:   Episode [6]
##   Episode                                              n
##   <chr>                                            <int>
## 1 X10 Episode I  The Phantom Menace                  936
## 2 X11 Star Wars: Episode II  Attack of the Clones    936
## 3 X12 Star Wars: Episode III  Revenge of the Sith    936
## 4 X13 Star Wars: Episode IV  A New Hope              936
## 5 X14 Star Wars: Episode V The Empire Strikes Back   936
## 6 X15 Star Wars: Episode VI Return of the Jedi       936

The table shows the mean rate of the respondents to the six episodes of the movie. By looking at the plot, it seems the mean rate for each episode differs betwen men and women. This distinction is sharper for Episode I, where average rate of men is higher than women, and for episode IV, where women found it better than men. All in all, episode III is the most popular among all of the repondents who answered this question and women rated it a little higher than men. This average rate is based on 936 respondents who answered the question.

6.

R2 D2 or C-3P0? Which of these two characters is the more popular one? Use responses to variables 25 and 26 to answer this question. Note: first you need to define what you mean by “popularity” based on the available data.

d %>% 
  mutate(n.C3P0 = length(`X25 C-3P0`) - sum(is.na(`X25 C-3P0`))) %>%
  filter(`X25 C-3P0` == 'Somewhat favorably' | `X25 C-3P0` == 'Very favorably') %>%
  summarise(`popularity (%)` = 100* n() / n.C3P0[1])
## # A tibble: 1 x 1
##   `popularity (%)`
##              <dbl>
## 1             85.0
d %>% 
  mutate(n.R2D2 = length(`X26 R2 D2`) - sum(is.na(`X26 R2 D2`))) %>%
  filter(`X26 R2 D2` == 'Somewhat favorably' | `X26 R2 D2` == 'Very favorably') %>%
  summarise(`popularity (%)` = 100* n() / n.R2D2[1])
## # A tibble: 1 x 1
##   `popularity (%)`
##              <dbl>
## 1               90

Firstly, Popularity is defined as the percentage of the respondents who voted for the characters only as ‘Somewhat favorably’ and ‘Very favorably’. The popularity of C-3P0 is about 85% and for R2 D2 is about 90%. So, R2 D2 is a little more popular among respondents.

  1. Popularity contest: which of the surveyed characters is the most popular? use the popularity measure you defined in the previous question to evaluate responses for characters 16 through 29. Use an appropriate long form of the data to get to your answer. Visualize the result.
d%>%
  select(16:29)-> d.contest

names(d.contest)[1]<-"X16 Han Solo"

d2 <- d.contest %>% gather(key = character, value = `rate of popularity`, names(d.contest)) %>% filter(is.na(`rate of popularity`) == FALSE) 

X = d2 %>% 
  filter(`rate of popularity` %in% c('Somewhat favorably','Very favorably')) %>% 
  group_by(character) %>%
  summarise(popular.count = n())

Y = d2 %>% group_by(character) %>% summarise(n=n())

X %>% full_join(Y, by = "character") %>% 
  mutate(`popularity (%)` = 100 * popular.count / n) %>%
  ggplot(aes(x = character, y = `popularity (%)`)) +
  geom_point()+
  coord_flip()

The most popular character is Luke Skywalker among the respondents who have seen at least one of the Starwars movies.