Data Analysis for Thanksgiving Poll Data Set

We are often tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or…). Here, you are asked to use R—you may use any base functions or packages as you like. Your task is to first choose one of the provided datasets on fivethirtyeight.com that you find interesting: https://data.fivethirtyeight.com/ You should first study the data and any other information on the GitHub site, and read the associated fivethirtyeight.com article.

Overview

Using a SurveyMonkey poll, we asked 1,058 respondents on Nov. 17, 2015 the following questions about their Thanksgiving:

Link for dataset: https://fivethirtyeight.com/features/heres-what-your-part-of-america-eats-on-thanksgiving/

github : https://github.com/fivethirtyeight/data/blob/master/thanksgiving-2015/thanksgiving-2015-poll-data.csv

Environment set up

#loading required libraries
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(curl)
## 
## Attaching package: 'curl'
## The following object is masked from 'package:readr':
## 
##     parse_date

Data Acquisition

  1. Make sure that the original data file is accessible through your code—for example, stored in a GitHub repository or AWS S3 bucket and referenced in your code. If the code references data on your local machine, then your work is not reproducible!
#data import from the source itself (uci.edu)

poll_df <- read.csv('https://raw.githubusercontent.com/keshaws/CUNY_MSDS_2020/master/DATA607/Week1/data/thanksgiving-2015-poll-data.csv')

dim(poll_df)
## [1] 1058   65

After ingesting the data and analyzing the dimension of it, I found there are 1058 data points/ observations and 65 variables/ features

colnames(poll_df)
##  [1] "RespondentID"                                                                                                                                
##  [2] "Do.you.celebrate.Thanksgiving."                                                                                                              
##  [3] "What.is.typically.the.main.dish.at.your.Thanksgiving.dinner."                                                                                
##  [4] "What.is.typically.the.main.dish.at.your.Thanksgiving.dinner....Other..please.specify."                                                       
##  [5] "How.is.the.main.dish.typically.cooked."                                                                                                      
##  [6] "How.is.the.main.dish.typically.cooked....Other..please.specify."                                                                             
##  [7] "What.kind.of.stuffing.dressing.do.you.typically.have."                                                                                       
##  [8] "What.kind.of.stuffing.dressing.do.you.typically.have....Other..please.specify."                                                              
##  [9] "What.type.of.cranberry.saucedo.you.typically.have."                                                                                          
## [10] "What.type.of.cranberry.saucedo.you.typically.have....Other..please.specify."                                                                 
## [11] "Do.you.typically.have.gravy."                                                                                                                
## [12] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Brussel.sprouts"                 
## [13] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Carrots"                         
## [14] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Cauliflower"                     
## [15] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Corn"                            
## [16] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Cornbread"                       
## [17] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Fruit.salad"                     
## [18] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Green.beans.green.bean.casserole"
## [19] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Macaroni.and.cheese"             
## [20] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Mashed.potatoes"                 
## [21] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Rolls.biscuits"                  
## [22] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Squash"                          
## [23] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Vegetable.salad"                 
## [24] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Yams.sweet.potato.casserole"     
## [25] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Other..please.specify."          
## [26] "Which.of.these.side.dishes.aretypically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Other..please.specify..1"        
## [27] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Apple"                                    
## [28] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Buttermilk"                               
## [29] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Cherry"                                   
## [30] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Chocolate"                                
## [31] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Coconut.cream"                            
## [32] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Key.lime"                                 
## [33] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Peach"                                    
## [34] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Pecan"                                    
## [35] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Pumpkin"                                  
## [36] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Sweet.Potato"                             
## [37] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....None"                                     
## [38] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Other..please.specify."                   
## [39] "Which.type.of.pie.is.typically.served.at.your.Thanksgiving.dinner..Please.select.all.that.apply....Other..please.specify..1"                 
## [40] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Apple.cobbler"                       
## [41] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Blondies"                            
## [42] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Brownies"                            
## [43] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Carrot.cake"                         
## [44] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Cheesecake"                          
## [45] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Cookies"                             
## [46] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Fudge"                               
## [47] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Ice.cream"                           
## [48] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Peach.cobbler"                       
## [49] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......None"                                
## [50] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Other..please.specify."              
## [51] "Which.of.these.desserts.do.you.typically.have.at.Thanksgiving.dinner..Please.select.all.that.apply......Other..please.specify..1"            
## [52] "Do.you.typically.pray.before.or.after.the.Thanksgiving.meal."                                                                                
## [53] "How.far.will.you.travel.for.Thanksgiving."                                                                                                   
## [54] "Will.you.watch.any.of.the.following.programs.on.Thanksgiving..Please.select.all.that.apply....Macy.s.Parade"                                 
## [55] "What.s.the.age.cutoff.at.your..kids..table..at.Thanksgiving."                                                                                
## [56] "Have.you.ever.tried.to.meet.up.with.hometown.friends.on.Thanksgiving.night."                                                                 
## [57] "Have.you.ever.attended.a..Friendsgiving.."                                                                                                   
## [58] "Will.you.shop.any.Black.Friday.sales.on.Thanksgiving.Day."                                                                                   
## [59] "Do.you.work.in.retail."                                                                                                                      
## [60] "Will.you.employer.make.you.work.on.Black.Friday."                                                                                            
## [61] "How.would.you.describe.where.you.live."                                                                                                      
## [62] "Age"                                                                                                                                         
## [63] "What.is.your.gender."                                                                                                                        
## [64] "How.much.total.combined.money.did.all.members.of.your.HOUSEHOLD.earn.last.year."                                                             
## [65] "US.Region"

Subsetting dataset

Now selecting the below features for creating the subset of data from the main dataset

poll_subset_df <- select(poll_df, 'RespondentID', 'Do.you.celebrate.Thanksgiving.', 'What.is.your.gender.', 'Age', 'How.would.you.describe.where.you.live.', 'US.Region')
head(poll_subset_df,10)
##    RespondentID Do.you.celebrate.Thanksgiving. What.is.your.gender.     Age
## 1    4337954960                            Yes                 Male 18 - 29
## 2    4337951949                            Yes               Female 18 - 29
## 3    4337935621                            Yes                 Male 18 - 29
## 4    4337933040                            Yes                 Male 30 - 44
## 5    4337931983                            Yes                 Male 30 - 44
## 6    4337929779                            Yes                 Male 18 - 29
## 7    4337924420                            Yes                 Male 18 - 29
## 8    4337916002                            Yes                 Male 18 - 29
## 9    4337914977                            Yes                 Male 30 - 44
## 10   4337899817                            Yes                 Male 30 - 44
##    How.would.you.describe.where.you.live.          US.Region
## 1                                Suburban    Middle Atlantic
## 2                                   Rural East South Central
## 3                                Suburban           Mountain
## 4                                   Urban            Pacific
## 5                                   Urban            Pacific
## 6                                   Urban            Pacific
## 7                                   Rural East North Central
## 8                                   Rural           Mountain
## 9                                   Urban    Middle Atlantic
## 10                               Suburban East South Central

Now, reanaming the columns of the subset

poll_data_df <- rename(poll_subset_df,ID='RespondentID',celebrate='Do.you.celebrate.Thanksgiving.', gender = 'What.is.your.gender.', age_range='Age', living_region = 'How.would.you.describe.where.you.live.', us_region='US.Region')
head(poll_data_df,5)
##           ID celebrate gender age_range living_region          us_region
## 1 4337954960       Yes   Male   18 - 29      Suburban    Middle Atlantic
## 2 4337951949       Yes Female   18 - 29         Rural East South Central
## 3 4337935621       Yes   Male   18 - 29      Suburban           Mountain
## 4 4337933040       Yes   Male   30 - 44         Urban            Pacific
## 5 4337931983       Yes   Male   30 - 44         Urban            Pacific

Finding the unique living region in the subset

unique(poll_data_df$living_region)
## [1] Suburban Rural    Urban            
## Levels:  Rural Suburban Urban

Data Exploration

Now, using the group by and summarise function to get the count for each living region

poll_data_df %>%
 group_by(living_region) %>%
 summarize(count=n())
## # A tibble: 4 x 2
##   living_region count
##   <fct>         <int>
## 1 ""              110
## 2 "Rural"         216
## 3 "Suburban"      496
## 4 "Urban"         236

We can seee from above table that there are 110 data points for which there are no living region values

Now, analyzing us region values, and it appears that there are 59 data points for which the values are missing

poll_data_df %>%
 group_by(us_region) %>%
 summarize(count=n())
## # A tibble: 10 x 2
##    us_region            count
##    <fct>                <int>
##  1 ""                      59
##  2 "East North Central"   150
##  3 "East South Central"    60
##  4 "Middle Atlantic"      159
##  5 "Mountain"              47
##  6 "New England"           58
##  7 "Pacific"              146
##  8 "South Atlantic"       214
##  9 "West North Central"    74
## 10 "West South Central"    91

Data Cleansing

Now, cleansing the subset to get rid of missing us region data points.

poll_data_clean <- poll_data_df %>%
      filter((poll_data_df$us_region!=""))

poll_data_clean %>%
 group_by(us_region) %>%
 summarize(count=n())
## # A tibble: 9 x 2
##   us_region          count
##   <fct>              <int>
## 1 East North Central   150
## 2 East South Central    60
## 3 Middle Atlantic      159
## 4 Mountain              47
## 5 New England           58
## 6 Pacific              146
## 7 South Atlantic       214
## 8 West North Central    74
## 9 West South Central    91
poll_data_clean <- poll_data_df %>%
      filter((poll_data_df$us_region!=""))

poll_data_clean %>%
 group_by(celebrate) %>%
 summarize(count=n())
## # A tibble: 2 x 2
##   celebrate count
##   <fct>     <int>
## 1 No           68
## 2 Yes         931
poll_data_clean <- poll_data_df %>%
      filter((poll_data_df$us_region!=""))

poll_data_clean %>%
 group_by(us_region, celebrate) %>%
 summarize(count=n())
## # A tibble: 18 x 3
## # Groups:   us_region [9]
##    us_region          celebrate count
##    <fct>              <fct>     <int>
##  1 East North Central No            5
##  2 East North Central Yes         145
##  3 East South Central No            4
##  4 East South Central Yes          56
##  5 Middle Atlantic    No           14
##  6 Middle Atlantic    Yes         145
##  7 Mountain           No            6
##  8 Mountain           Yes          41
##  9 New England        No            3
## 10 New England        Yes          55
## 11 Pacific            No           16
## 12 Pacific            Yes         130
## 13 South Atlantic     No           11
## 14 South Atlantic     Yes         203
## 15 West North Central No            3
## 16 West North Central Yes          71
## 17 West South Central No            6
## 18 West South Central Yes          85

Data Visualization

Plotting the graph for living region with missing values

ggplot(poll_data_df, mapping = aes(poll_data_df$living_region))+
    geom_bar(aes(fill=poll_data_df$age_range))+
    xlab('Living Region')+labs(fill='gender')+geom_text(stat='count',aes(label = ..count..,y=..count..),vjust=-0.2)+
    ggtitle('Poll data')

plotting the graph for us region with cleaned data points

ggplot(poll_data_clean, mapping = aes(poll_data_clean$us_region))+
    geom_bar(aes(fill=poll_data_clean$celebrate))+
    xlab('US Region')+labs(fill='age_range')+geom_text(stat='count',aes(label = ..count..,y=..count..),vjust=-0.2)+
    theme(axis.text.x = element_text(angle = 90))+
    ggtitle('Poll data')

Conclusion

There are 1058 data points and 65 features present in the main dataset. After creating the subset and cleaning it for the “US Region” and completing the data visualization it appears that 68 said yes and 931 said no. Also, it appears from the plot that thanksgiving celebration is popular in suburban living region. Among the various US region, thanksgiving celebration is most popular in “south atlantic” region and least popular in “mountain” region.