Week1 Assignment

The author of “How Americans Like Their Stake” brings out a question about how does a risk-taking behavior associate with steak in rare. Along with other variables, he found that the the risk-taking behavior is statistically insignificant to steak rareness. https://fivethirtyeight.com/features/how-americans-like-their-steak/

Import data as csv

linked phrase

a<-getURL("https://raw.githubusercontent.com/fivethirtyeight/data/master/steak-survey/steak-risk-survey.csv")
b<-data.frame(read.csv(text=a, header=T))
head(b)

b<-b[-1,-2]  # removed the first row because it is an empty row, and removed the 2nd row which is out of my analysis target.
head(b)

dim(b)

## [1] 550  14

Rename the column headers

data<-rename(b,c("RespondentID"="ID","Do.you.ever.smoke.cigarettes."="Cigarettes","Do.you.ever.drink.alcohol."="Alcohol","Do.you.ever.gamble."="Gamble","Have.you.ever.been.skydiving."="Skydiving", "Do.you.ever.drive.above.the.speed.limit."="Drive_limit", "Have.you.ever.cheated.on.your.significant.other."="Cheat","Do.you.eat.steak."="Steak","How.do.you.like.your.steak.prepared."="prepared","Gender"="Gender","VAge"="Age","Household.Income"="Income_range", "Education"="Education", "Location..Census.Region."= "Region"))

## The following `from` values were not present in `x`: VAge

head(data)

str(data)

## 'data.frame':    550 obs. of  14 variables:
##  $ ID          : num  3.24e+09 3.23e+09 3.23e+09 3.23e+09 3.23e+09 ...
##  $ Cigarettes  : chr  "" "No" "No" "Yes" ...
##  $ Alcohol     : chr  "" "Yes" "Yes" "Yes" ...
##  $ Gamble      : chr  "" "No" "Yes" "Yes" ...
##  $ Skydiving   : chr  "" "No" "No" "No" ...
##  $ Drive_limit : chr  "" "No" "Yes" "Yes" ...
##  $ Cheat       : chr  "" "No" "Yes" "Yes" ...
##  $ Steak       : chr  "" "Yes" "Yes" "Yes" ...
##  $ prepared    : chr  "" "Medium rare" "Rare" "Medium" ...
##  $ Gender      : chr  "" "Male" "Male" "Male" ...
##  $ Age         : chr  "" "> 60" "> 60" "> 60" ...
##  $ Income_range: chr  "" "$50,000 - $99,999" "$150,000+" "$50,000 - $99,999" ...
##  $ Education   : chr  "" "Some college or Associate degree" "Graduate degree" "Bachelor degree" ...
##  $ Region      : chr  "" "East North Central" "South Atlantic" "New England" ...

data$ID<-as.character(data$ID)   # change the ID to Character, since the number in ID is not meaningful numbers.
str(data)

## 'data.frame':    550 obs. of  14 variables:
##  $ ID          : chr  "3237565956" "3234982343" "3234973379" "3234972383" ...
##  $ Cigarettes  : chr  "" "No" "No" "Yes" ...
##  $ Alcohol     : chr  "" "Yes" "Yes" "Yes" ...
##  $ Gamble      : chr  "" "No" "Yes" "Yes" ...
##  $ Skydiving   : chr  "" "No" "No" "No" ...
##  $ Drive_limit : chr  "" "No" "Yes" "Yes" ...
##  $ Cheat       : chr  "" "No" "Yes" "Yes" ...
##  $ Steak       : chr  "" "Yes" "Yes" "Yes" ...
##  $ prepared    : chr  "" "Medium rare" "Rare" "Medium" ...
##  $ Gender      : chr  "" "Male" "Male" "Male" ...
##  $ Age         : chr  "" "> 60" "> 60" "> 60" ...
##  $ Income_range: chr  "" "$50,000 - $99,999" "$150,000+" "$50,000 - $99,999" ...
##  $ Education   : chr  "" "Some college or Associate degree" "Graduate degree" "Bachelor degree" ...
##  $ Region      : chr  "" "East North Central" "South Atlantic" "New England" ...

Data cleaning

# I have seen some problems in this data set. There  are empty/blank within the variable of Gender and Prepared. 
# My idea here is convert the blank column to NA, then drop the row that has NA from the data set.


summary(as.factor(data$Gender))  # There are 36 missing in Gender.

##        Female   Male 
##     36    268    246

summary(as.factor(data$prepared)) # There are 118 missing in Prepared.

##                  Medium Medium rare Medium Well        Rare        Well 
##         118         132         166          75          23          36

missing_Gender<-data$Gender ==""     # Make a new variable missing in gender.
data$missing_Gender<-ifelse(missing_Gender==TRUE, NA,"Keep")  # convert the empty column to NA
missing_Prepared<-data$prepared =="" # Make a new variable missing in prepared.
data$missing_Prepared<-ifelse(missing_Prepared==TRUE, NA, "Keep")  #convert the empty column to NA


data<-data[!is.na(data$missing_Gender),]
data<-data[!is.na(data$missing_Prepared),]

# now I can see the rows are dropped successfully, and the data observation is dropped from 550 to 412.

How many female like the steak in rare?

steak_rare_female<- subset(data, Gender=="Female" & prepared =="Rare")
steak_rare_female

count(steak_rare_female$Gender) # 12 female like the steak rare

count(data$Gender=="Female")# There are 200 female in this data set.

12/200      # about 6 % of female in this data like the steak in rare.

## [1] 0.06

Therefore, there are 12 female like the steak rare which is about 6 % of female in this data like the steak in rare.

How many male like the steak in rare?

## [1] 0.04716981

There are 10 male like the steak in rare which means about 4.7 % of the male like the steak in rare in this data set.

library(ggplot2)
ggplot(data=data)+
  geom_bar(mapping=aes(x=prepared,fill=Gender))+
  ggtitle("Female VS Male in steak")

ggplot(data=data)+
  geom_bar(mapping=aes(x=prepared,fill=prepared))

From above two plots, we can see that Female and Male have the similar pattern in make their steak prepared. The proportions are very close to each other.

How many steak lovers like the stake in rare?

I found that there are 22 steak lovers like eating steak in rare. This is exactly the sum of as above, 10 of male love steak in rare and 12 female love steak in rare.

Conclusion

The data explains that there are 12 females, and 10 males like the steak in rare. Also, we found that people who cook the steak in rare are all steak lovers with total number of 22