library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
starwars <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv")
# the following lines are necessary to fix the multibyte problem and make proper names
# part of the names:
line1 <- names(starwars)
line2 <- unlist(starwars[1,])
varnames <- paste(line1, line2)
# clean up some of the multibyte characters:
names(starwars) <- enc2native(stringi::stri_trans_general(varnames, "latin-ascii"))
starwars <- starwars[-1,]
head(starwars)
## # A tibble: 6 x 38
## `RespondentID N~ `Have you seen ~ `Do you conside~ `Which of the f~
## <dbl> <chr> <chr> <chr>
## 1 3292879998 Yes Yes Star Wars: Epis~
## 2 3292879538 No <NA> <NA>
## 3 3292765271 Yes No Star Wars: Epis~
## 4 3292763116 Yes Yes Star Wars: Epis~
## 5 3292731220 Yes Yes Star Wars: Epis~
## 6 3292719380 Yes Yes Star Wars: Epis~
## # ... with 34 more variables: `X5 Star Wars: Episode II Attack of the
## # Clones` <chr>, `X6 Star Wars: Episode III Revenge of the Sith` <chr>,
## # `X7 Star Wars: Episode IV A New Hope` <chr>, `X8 Star Wars: Episode V
## # The Empire Strikes Back` <chr>, `X9 Star Wars: Episode VI Return of
## # the Jedi` <chr>, `Please rank the Star Wars films in order of
## # preference with 1 being your favorite film in the franchise and 6
## # being your least favorite film. Star Wars: Episode I The Phantom
## # Menace` <chr>, `X11 Star Wars: Episode II Attack of the Clones` <chr>,
## # `X12 Star Wars: Episode III Revenge of the Sith` <chr>, `X13 Star
## # Wars: Episode IV A New Hope` <chr>, `X14 Star Wars: Episode V The
## # Empire Strikes Back` <chr>, `X15 Star Wars: Episode VI Return of the
## # Jedi` <chr>, `Please state whether you view the following characters
## # favorably, unfavorably, or are unfamiliar with him/her. Han
## # Solo` <chr>, `X17 Luke Skywalker` <chr>, `X18 Princess Leia
## # Organa` <chr>, `X19 Anakin Skywalker` <chr>, `X20 Obi Wan
## # Kenobi` <chr>, `X21 Emperor Palpatine` <chr>, `X22 Darth Vader` <chr>,
## # `X23 Lando Calrissian` <chr>, `X24 Boba Fett` <chr>, `X25
## # C-3P0` <chr>, `X26 R2 D2` <chr>, `X27 Jar Jar Binks` <chr>, `X28 Padme
## # Amidala` <chr>, `X29 Yoda` <chr>, `Which character shot first?
## # Response` <chr>, `Are you familiar with the Expanded Universe?
## # Response` <chr>, `Do you consider yourself to be a fan of the Expanded
## # Universe?<U+FFFD><U+FFFD> Response` <chr>, `Do you consider yourself
## # to be a fan of the Star Trek franchise? Response` <chr>, `Gender
## # Response` <chr>, `Age Response` <chr>, `Household Income
## # Response` <chr>, `Education Response` <chr>, `Location (Census Region)
## # Response` <chr>
How many people responded to the survey? How many people have seen at least one of the movies? Use the variable Have you seen any of the 6 films in the Star Wars franchise? Response
to answer this question. Only consider responses of participants who have seen at least one of the Star Wars films for the remainder of the homework.
starwars%>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1186
starwars%>%
filter(`Have you seen any of the 6 films in the Star Wars franchise? Response`=="Yes")%>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 936
d<-starwars%>%
filter(`Have you seen any of the 6 films in the Star Wars franchise? Response`=="Yes")
There are 1186 total responses to the survey and among these, 936 respondents have seen at least one of the movies.
Variables Gender Response
and Age Response
are two of the demographic variables collected. Use dplyr
to provide a frequency break down for each variable. Does the result surprise you? Comment. Reorder the levels in the variable Age Response
from youngest to oldest.
d%>%
group_by(`Gender Response`)%>%
tally()
## # A tibble: 3 x 2
## `Gender Response` n
## <chr> <int>
## 1 Female 397
## 2 Male 423
## 3 <NA> 116
d%>%
group_by(`Age Response`)%>%
tally()
## # A tibble: 5 x 2
## `Age Response` n
## <chr> <int>
## 1 > 60 193
## 2 18-29 180
## 3 30-44 207
## 4 45-60 240
## 5 <NA> 116
d$`Age Response`<- factor(d$`Age Response`,
levels= c("18-29", "30-44", "45-60", "> 60", "NA" ))
d%>%
group_by(`Gender Response`, `Age Response`)%>%
tally()
## # A tibble: 9 x 3
## # Groups: Gender Response [?]
## `Gender Response` `Age Response` n
## <chr> <fct> <int>
## 1 Female 18-29 85
## 2 Female 30-44 93
## 3 Female 45-60 120
## 4 Female > 60 99
## 5 Male 18-29 95
## 6 Male 30-44 114
## 7 Male 45-60 120
## 8 Male > 60 94
## 9 <NA> <NA> 116
The number of the male an female respondents who have seen at least one of the Starwars movies is almost the same. What is surprising is that the starwars movies are more watched by older people and whereby more than half of the all respondents are 45 or older.
Variables 10 through 15 answer the question: “Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.” for each of the films. Bring the data set into a long form. Introduce a variable for the star wars episode and the corresponding ranking. Find the average rank for each of the films. Are average ranks different between mens’ and womens’ rankings? On how many responses are the averages based? Show these numbers together with the averages.
names(d)[10]<- "X10 Episode I The Phantom Menace"
d.rank<-d%>%
select(1, 10:15, 34)%>%
gather(key = Episode, value = Rank, 2:7)
d.rank[, "Rank"]<-as.numeric(d.rank$Rank)
d.rank%>%
group_by(Episode, `Gender Response`)%>%
summarise(mean.rank=mean(Rank, na.rm=TRUE))
## # A tibble: 18 x 3
## # Groups: Episode [?]
## Episode `Gender Respons~ mean.rank
## <chr> <chr> <dbl>
## 1 X10 Episode I The Phantom Menace Female 3.43
## 2 X10 Episode I The Phantom Menace Male 4.04
## 3 X10 Episode I The Phantom Menace <NA> 3.19
## 4 X11 Star Wars: Episode II Attack of the Cl~ Female 3.95
## 5 X11 Star Wars: Episode II Attack of the Cl~ Male 4.22
## 6 X11 Star Wars: Episode II Attack of the Cl~ <NA> 3.75
## 7 X12 Star Wars: Episode III Revenge of the ~ Female 4.42
## 8 X12 Star Wars: Episode III Revenge of the ~ Male 4.27
## 9 X12 Star Wars: Episode III Revenge of the ~ <NA> 4.19
## 10 X13 Star Wars: Episode IV A New Hope Female 3.54
## 11 X13 Star Wars: Episode IV A New Hope Male 3.00
## 12 X13 Star Wars: Episode IV A New Hope <NA> 3.81
## 13 X14 Star Wars: Episode V The Empire Strikes~ Female 2.57
## 14 X14 Star Wars: Episode V The Empire Strikes~ Male 2.46
## 15 X14 Star Wars: Episode V The Empire Strikes~ <NA> 2.56
## 16 X15 Star Wars: Episode VI Return of the Jedi Female 3.08
## 17 X15 Star Wars: Episode VI Return of the Jedi Male 3.00
## 18 X15 Star Wars: Episode VI Return of the Jedi <NA> 3.5
d.rank%>%
group_by(Episode, `Gender Response`)%>%
summarise(mean.rank=mean(Rank, na.rm=TRUE))%>%
filter(!`Gender Response`=="NA")%>%
ggplot(aes(x = Episode, y = mean.rank, color=`Gender Response`))+
geom_point()+
coord_flip()
d.rank%>%
group_by(Episode)%>%
count()
## # A tibble: 6 x 2
## # Groups: Episode [6]
## Episode n
## <chr> <int>
## 1 X10 Episode I The Phantom Menace 936
## 2 X11 Star Wars: Episode II Attack of the Clones 936
## 3 X12 Star Wars: Episode III Revenge of the Sith 936
## 4 X13 Star Wars: Episode IV A New Hope 936
## 5 X14 Star Wars: Episode V The Empire Strikes Back 936
## 6 X15 Star Wars: Episode VI Return of the Jedi 936
The table shows the mean rate of the respondents to the six episodes of the movie. By looking at the plot, it seems the mean rate for each episode differs betwen men and women. This distinction is sharper for Episode I, where average rate of men is higher than women, and for episode IV, where women found it better than men. All in all, episode III is the most popular among all of the repondents who answered this question and women rated it a little higher than men. This average rate is based on 936 respondents who answered the question.
R2 D2 or C-3P0? Which of these two characters is the more popular one? Use responses to variables 25 and 26 to answer this question. Note: first you need to define what you mean by “popularity” based on the available data.
d %>%
mutate(n.C3P0 = length(`X25 C-3P0`) - sum(is.na(`X25 C-3P0`))) %>%
filter(`X25 C-3P0` == 'Somewhat favorably' | `X25 C-3P0` == 'Very favorably') %>%
summarise(`popularity (%)` = 100* n() / n.C3P0[1])
## # A tibble: 1 x 1
## `popularity (%)`
## <dbl>
## 1 85.0
d %>%
mutate(n.R2D2 = length(`X26 R2 D2`) - sum(is.na(`X26 R2 D2`))) %>%
filter(`X26 R2 D2` == 'Somewhat favorably' | `X26 R2 D2` == 'Very favorably') %>%
summarise(`popularity (%)` = 100* n() / n.R2D2[1])
## # A tibble: 1 x 1
## `popularity (%)`
## <dbl>
## 1 90
Firstly, Popularity is defined as the percentage of the respondents who voted for the characters only as ‘Somewhat favorably’ and ‘Very favorably’. The popularity of C-3P0
is about 85% and for R2 D2
is about 90%. So, R2 D2
is a little more popular among respondents.
d%>%
select(16:29)-> d.contest
names(d.contest)[1]<-"X16 Han Solo"
d2 <- d.contest %>% gather(key = character, value = `rate of popularity`, names(d.contest)) %>% filter(is.na(`rate of popularity`) == FALSE)
X = d2 %>%
filter(`rate of popularity` %in% c('Somewhat favorably','Very favorably')) %>%
group_by(character) %>%
summarise(popular.count = n())
Y = d2 %>% group_by(character) %>% summarise(n=n())
X %>% full_join(Y, by = "character") %>%
mutate(`popularity (%)` = 100 * popular.count / n) %>%
ggplot(aes(x = character, y = `popularity (%)`)) +
geom_point()+
coord_flip()
The most popular character is Luke Skywalker
among the respondents who have seen at least one of the Starwars movies.