Project 2B - Candy Hierarchy

We are looking at data collected by the University of British Columbia using this survey.

The data collected is an amalgation of questions most of which center around candy preferences where the survey responses offered were: “JOY”, “MEH, or”DESPAIR". Some inconsistent location data is also supplied for every response.

Our task is to select the information of interest from the document and transform it in order to facilitate analysis. Additionally, we will use the survey responses to develop a scoring method which will allow us to rank the candy by preference and develop a candy hierarchy.

Libraries

Load required libraries

library(tidyverse) #readr, dplyr, tidyr, stringr, tibble, ggplot2
library(knitr)
library(scales)
library(kableExtra)
library(readxl)

A helper function for displaying tables

showtable <- function(data, title) {
  kable(data, caption = title) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
}

Data Import

The data can is orginally found and the .xlsx file can be downloaded here.

Read the .csv file from the source link

url <- "https://www.scq.ubc.ca/wp-content/uploads/2017/10/candyhierarchy2017.csv"
untidy <- read_csv(url)

Data Evaluation

Raw Data

Here is a snapshot of the dataset. It is composed of 2460 rows and 120 columns. The data is un-tidy in multiple ways:

Since the survey does not specify any formatting for the participants response, you can expect all kinds of inconsistent manual entries, for example with country: USA, usa, us, Murica, United States etc.
Values of variables in column headings
Missing values

Data Transformation

First, we use regular expressions look at the columns that contain a question and remove it to leave only the candy names.

questions <- names(untidy)
questions <- str_replace(questions, "Q[:digit:]+[:punct:]? \\|? ?", replacement = "")
colnames(untidy) <- questions

We create two tables to avoid data duplication: demographics and responses. This study will focus on the responses table only and we will tidy the table by gathering the candy types from the column headers into a column of its own.

demographics <- untidy %>% select(1:6)
responses <- untidy %>% select(1,7:(ncol(untidy)-11))
tidy_responses <- responses %>% 
  gather(2:ncol(responses), key = Candy, value = Response)

The data is now presented in long format. Let’s take a look at 15 random oberservations from this table. We see a number of NA values which we will handle later.

showtable(sample_n(tidy_responses, 15), "Tidy Dataset")

Tidy Dataset
Internal ID	Candy	Response
90275176	Hard Candy	MEH
90273228	Hershey’s Kisses	NA
90277170	Red M&M’s	NA
90282201	Reggie Jackson Bar	NA
90273089	Kit Kat	JOY
90276469	Rolos	MEH
90273299	Whole Wheat anything	DESPAIR
90285090	Goo Goo Clusters	NA
90273923	Dots	MEH
90285560	Whole Wheat anything	DESPAIR
90277710	Peeps	NA
90283698	Kinder Happy Hippo	JOY
90277374	Reese’s Pieces	JOY
90273209	Swedish Fish	NA
90273419	Tic Tacs	NA

Data Analysis & Visualization

We start by reformatting the data in a way that can it can be used for this particular analysis. We are looking for the count of responses by type for each candy.

counted <- tidy_responses %>% 
  group_by(Candy) %>% 
  count(Response) %>% 
  spread(Response, n) 

showtable(head(counted, 10), "Tidy Dataset")

Tidy Dataset
Candy	DESPAIR	JOY	MEH	<NA>
100 Grand Bar	85	873	755	747
Abstained from M&M’ing.	693	218	607	942
Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes)	1089	176	461	734
Any full-sized candy bar	17	1559	212	672
Black Jacks	793	92	617	958
Blue M&M’s	120	1018	595	727
Bonkers (the board game)	546	193	715	1006
Bonkers (the candy)	495	116	857	992
Bottle Caps	560	465	670	765
Box’o’Raisins	1179	117	476	688

We drop the NA column and calculate the proportions instead of the counts, sorting by highest score. From this table we can see the preferred candies as rated by our score. However, we also see a few non specific candy types like the top score “Any full-sized candy bar”.

counted <- counted %>% select(Candy:MEH) %>% 
  mutate(total = DESPAIR + JOY + MEH) %>% 
  mutate(Dp = DESPAIR/total, Jp = JOY/total, Mp = MEH/total) %>% 
  mutate(Score = (Dp*-1+Mp*0+Jp*1)*100) %>%
  arrange(desc(Score))

showtable(head(counted, 10), "Tidy Dataset")

Tidy Dataset
Candy	DESPAIR	JOY	MEH	total	Dp	Jp	Mp	Score
Any full-sized candy bar	17	1559	212	1788	0.0095078	0.8719239	0.1185682	86.24161
Reese�s Peanut Butter Cups	95	1498	188	1781	0.0533408	0.8411005	0.1055587	78.77597
Kit Kat	52	1447	287	1786	0.0291153	0.8101904	0.1606943	78.10750
Cash, or other forms of legal tender	65	1439	276	1780	0.0365169	0.8084270	0.1550562	77.19101
Twix	71	1413	286	1770	0.0401130	0.7983051	0.1615819	75.81921
Snickers	79	1399	292	1770	0.0446328	0.7903955	0.1649718	74.57627
Tolberone something or other	84	1320	350	1754	0.0478905	0.7525656	0.1995439	70.46750
Lindt Truffle	98	1277	367	1742	0.0562572	0.7330654	0.2106774	67.68083
Peanut M&M�s	121	1291	377	1789	0.0676356	0.7216322	0.2107323	65.39966
Nestle Crunch	102	1188	472	1762	0.0578888	0.6742338	0.2678774	61.63451

To visualize the data in an interesting way, we creature two types of labels that will describe candies with positive score and candies with negative scores.

candylabel_p2 <- counted %>% filter(Score >= 0) %>% select(Candy) %>% pull(Candy)
candylabel_n1 <- counted %>% filter(Score < 0) %>% select(Candy) %>% pull(Candy)

candylabel_p1 <- rep("", times=length(candylabel_n1))
candylabel_n2 <- rep("", times=length(candylabel_p2))

candylabel_p <- c(candylabel_p2, candylabel_p1)
candylabel_n <- c(candylabel_n2, candylabel_n1)

Finally, we plot our Candy Hierarchy.

ggplot(data = counted, aes(x=reorder(Candy, Score), y=Score)) + 
  geom_bar(stat="identity", width = 0.8, position = position_dodge(width = 0.9), fill= 'lightblue') +
  geom_text(aes(label=candylabel_p), size = 2, vjust=0, hjust="left") +
  geom_text(aes(label=candylabel_n), size = 2, vjust=0, hjust="right") +
  labs(title="Candy Hierarchy", x = "", y = "") +
  theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.background = element_blank()) +
  coord_flip() +
  expand_limits(y = c(-225, 150))

Not accounting non-specific candy types, the top three preferred candies are: Reese’s Penaut Butter Cups, Kit Kat and Twix. It is harder to identify the least preferred candies because of the non-sensical data concentrated at the lowest scores, but we do a few classics in the negative zone such as: Tic Tac, Lollipops, Licorice

Conclusion

While working with this dataset, we encountered a few forms of “un-tidyness”:
- Inconsistent value entry for non survey responses
- Question values in column headers
- Missing values

We used regular expressions and the tools from the tidyverse to clean up this data and organize it for use interesting displays. We revealed the preferred candies and identified a few mostly disliked ones.

To further and stregthen this analysis, we could look to add a column of data that would describe the candies into types that we could use group by color on the above hierarchy. We could also seek to eliminate non-sensical candy types. To improve the display, we could normalize the scores so that the bars are better stretched to occupy more of the visible scale. Lastly, we could try to incorporate the demographics data collected in a seperate table at the beginning to add more depth to this analysis.

data607-illien-project2b

Mael Illien

9/30/2019