library(classdata)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
FiveThirtyEight is a website founded by Statistician and writer Nate Silver to publish results from opinion poll analysis, politics, economics, and sports blogging. One of the featured articles considers flying etiquette. This article is based on data collected by FiveThirtyEight and publicly available on github. Use the code below to read in the data from the survey:
fly <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv")
The following couple of lines of code provide a bit of cleanup of the demographic information by reaordering the levels of the corresponding factor variables. Run this code in your session.
fly$Age <- factor(fly$Age, levels=c("18-29", "30-44", "45-60", "> 60", ""))
fly$Household.Income <- factor(fly$Household.Income, levels = c("$0 - $24,999","$25,000 - $49,999", "$50,000 - $99,999", "$100,000 - $149,999", "150000", ""))
fly$Education <- factor(fly$Education, levels = c("Less than high school degree", "High school degree", "Some college or Associate degree", "Bachelor degree", "Graduate degree", ""))
Some people do not travel often by plane. Provide a (visual) breakdown of travel frequency (use variable How.often.do.you.travel.by.plane.). Reorder the levels in the variable by travel frequency from least frequent travel to most frequent. Draw a barchart of travel frequency and comment on it.
fly%>%
ggplot(aes(x = How.often.do.you.travel.by.plane.))+
geom_bar()+
coord_flip()
fly %>%
mutate(How.often.do.you.travel.by.plane.=reorder(
How.often.do.you.travel.by.plane.,
How.often.do.you.travel.by.plane.,
FUN=length)
)%>%
ggplot(aes(x = How.often.do.you.travel.by.plane.))+
geom_bar()+
coord_flip()
According to the barchart, most of the travellers travel only once a year or less. There are fewer than 200 people who travel once a month or never travel. The ones who travel everyday or a few times a week are the lowest among all respondents.
Exclude all respondents who never fly from the remainder of the analysis. How many records does the data set have now?
fly%>%
filter(How.often.do.you.travel.by.plane.=="Never")-> never.travel
length(never.travel)
## [1] 27
Now, ther are only 27 respondents who never fly.
In the demographic variables (Education, Age, and Houshold.Income), replace all occurrences of the empty string “” by a missing value NA. How many responses in each variable do not have any missing values? How many responses have no missing values in any of the three variables? (Hint: think of the function is.na)
fly$Education[fly$Education=='' ] <- 'NA'
## Warning in `[<-.factor`(`*tmp*`, fly$Education == "", value =
## structure(c(NA, : invalid factor level, NA generated
fly$Age[fly$Age=='']<-'NA'
## Warning in `[<-.factor`(`*tmp*`, fly$Age == "", value = structure(c(NA, :
## invalid factor level, NA generated
fly$Household.Income[fly$Household.Income=='']<-'NA'
## Warning in `[<-.factor`(`*tmp*`, fly$Household.Income == "", value =
## structure(c(NA, : invalid factor level, NA generated
Missing.Education<-is.na(fly$Education)
Missing.Age<-is.na(fly$Age)
Missing.Household.Income<-is.na(fly$Household.Income)
table(Missing.Education)
## Missing.Education
## FALSE TRUE
## 1001 39
table(Missing.Age)
## Missing.Age
## FALSE TRUE
## 1007 33
table(Missing.Household.Income)
## Missing.Household.Income
## FALSE TRUE
## 826 214
There are 39, 33 and 214 missing responses for Education, Age and Household. Income repectively. Additonally, 1001, 1007 and 826 responses for Education, Age and Household. Income repectively are no missing values.
Run the command below and interpret the output. What potential purpose can you see for the chart? What might be a problem with the chart? Find at least one purpose and one problem.
library(ggplot2)
fly$Education = with(fly, factor(Education, levels = rev(levels(Education))))
ggplot(data = fly, aes(x = 1)) +
geom_bar(aes(fill=Education), position="fill") +
coord_flip() +
theme(legend.position="bottom") +
scale_fill_brewer() +
xlab("Ratio")
The potential porpuse of this barchart is the comparison of the count of education levels among the respondant. The factor ‘Education’ is relabeled reversely, so that from left to right we can see the count of the respondents with the lowest education up to graduate levels. The problem is that the barchart is not weighted and it is a little confusing. For example, there are more respondents with ‘Bachelo degree’ than the ones with ‘Graduate degree’; however, in the barchart this difference is not much distinct.
Rename the variable In.general..is.itrude.to.bring.a.baby.on.a.plane. to baby.on.plane.. How many levels does the variable baby.on.plane have, and what are these levels? Rename the level labeled “” to “Not answered”.
colnames(fly)[colnames(fly)=="In.general..is.itrude.to.bring.a.baby.on.a.plane."]<- "baby.on.plane"
levels(fly$baby.on.plane)[1]<- "Not Answered"
levels(fly$baby.on.plane)
## [1] "Not Answered" "No, not at all rude" "Yes, somewhat rude"
## [4] "Yes, very rude"
There are four levels in the variable ‘baby.on.plane’. These levels contain “Not Answered”, “No, not at all rude”,“Yes, somewhat rude” and “Yes, very rude”.
Bring the levels of baby.on.plane in an order from least rude to most rude. Put the level “Not answered” last. Draw a barchart of variable baby.on.plane. Interpret the result. (This question is very similar to question 2, but preps the data for the next question)
levels(fly$baby.on.plane)<-c("No, not at all rude", "Yes, somewhat rude",
"Yes, very rude","Not Answered" )
fly%>%
ggplot(aes(x = baby.on.plane))+
geom_bar()
Most of the respondents belive that bringing a baby to the plane is to somewhat rude. The frequency of the least and the most rude are somehow the same.
Investigate the relationship between gender and the variables Do.you.have.any.children.under.18. and baby.on.plane. How is the attitude towards babies on planes shaped by gender and own children under 18? Find a plot that summarises your findings (use ggplot2).
fly%>%
ggplot(aes(x = Do.you.have.any.children.under.18., fill = baby.on.plane))+
geom_bar(position = "fill")
library(ggmosaic)
##
## Attaching package: 'ggmosaic'
## The following object is masked _by_ '.GlobalEnv':
##
## fly
fly%>%
ggplot() +
geom_mosaic(aes(x = product(Do.you.have.any.children.under.18.),
fill = baby.on.plane,
weight = 1))
Among the people who have children, it seems they are embarrassed to take their babies on a plane as they think it is to somewhat rude. But, among the ones with no children, it seems very rude to have babies on a plane. With moisac plot, the majority of respondents think it is generally rude to have babies on a plane.