Data Preparation

# load data from fivethirtyeight.com
library(tidyverse)
library(psych)

food <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/food-world-cup/food-world-cup-data.csv", encoding = "latin1", na = c("", "N/A"))

options(scipen = 999)
dim(food)
# Add column for the number of cuisines a respondent has tried
food$tried <- rowSums( !is.na( food[,4:43] ) )

# Get rid of extra text in cuisine rating column names
countries <- sub("Please.rate.how.much.you.like.the.traditional.cuisine.of.", "", colnames(food[,4:43]))
colnames(food)[4:43] <- countries

# Set Factor Levels for Household.Income
levels(food$Household.Income) <- c("$0 - $24,999","$25,000 - $49,999","$50,000 - $99,999","$100,000 - $149,999","$150,000+")

# Remove NA's in the Gender and/or Household.Income columns and unneeded columns
food2 <- food[complete.cases(food[, c(44,46)]), c(1,44,46,49)]

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Do gender or household income have any correlation with the number of cuisines a respondent has tried?

Cases

What are the cases, and how many are there?

Each case represents a respondent in “The FiveThirtyEight International Food Association’s 2014 World Cup”. There 954 observations in the data set.

Data collection

Describe the method of data collection.

FiveThirtyEight “polled 1,373 Americans through SurveyMonkey Audience and asked them to rate the national cuisines of the 32 teams that qualified for the [2014] World Cup, as well as eight additional nations with bad soccer but great food: China, Cuba, Ethiopia, India, Ireland, Thailand, Turkey and Vietnam.”

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Data was collected by FiveThirtyEight through SurveyMonkey Audience and is available in GitHub here:

https://github.com/fivethirtyeight/data/tree/master/food-world-cup

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is numerical and is the number of cuisines a respondent says they have tried which is calculated from the data by counting the number of cuisines rated by the respondent.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?

The explanatory variables are Gender (categorical) and household income (numerical).

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

table(food2$Gender)
## 
## Female   Male 
##    477    477
prop.table(table(food2$Gender))
## 
## Female   Male 
##    0.5    0.5
table(food2$Household.Income)
## 
##        $0 - $24,999   $25,000 - $49,999   $50,000 - $99,999 
##                 138                 161                 125 
## $100,000 - $149,999           $150,000+ 
##                 210                 320
prop.table(table(food2$Household.Income))
## 
##        $0 - $24,999   $25,000 - $49,999   $50,000 - $99,999 
##           0.1446541           0.1687631           0.1310273 
## $100,000 - $149,999           $150,000+ 
##           0.2201258           0.3354298
ggplot(food2, aes(x=Household.Income)) + geom_bar()

men <- subset(food2,Gender == "Male")
table(men$Household.Income)
## 
##        $0 - $24,999   $25,000 - $49,999   $50,000 - $99,999 
##                  70                  89                  68 
## $100,000 - $149,999           $150,000+ 
##                  91                 159
ggplot(men, aes(x=Household.Income)) + geom_bar()

women <- subset(food2,Gender == "Female")
table(women$Household.Income)
## 
##        $0 - $24,999   $25,000 - $49,999   $50,000 - $99,999 
##                  68                  72                  57 
## $100,000 - $149,999           $150,000+ 
##                 119                 161
ggplot(women, aes(x=Household.Income)) + geom_bar()

describe(food2$tried)
##    vars   n  mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 954 19.02 9.6     18   18.56 8.9   0  40    40  0.4    -0.41 0.31
ggplot(food2, aes(x=tried)) + geom_histogram(binwidth=5)