We are looking at data collected by the University of British Columbia using this survey.
The data collected is an amalgation of questions most of which center around candy preferences where the survey responses offered were: “JOY”, “MEH, or”DESPAIR". Some inconsistent location data is also supplied for every response.
Our task is to select the information of interest from the document and transform it in order to facilitate analysis. Additionally, we will use the survey responses to develop a scoring method which will allow us to rank the candy by preference and develop a candy hierarchy.
Load required libraries
library(tidyverse) #readr, dplyr, tidyr, stringr, tibble, ggplot2
library(knitr)
library(scales)
library(kableExtra)
library(readxl)A helper function for displaying tables
The data can is orginally found and the .xlsx file can be downloaded here.
Read the .csv file from the source link
Raw Data
Here is a snapshot of the dataset. It is composed of 2460 rows and 120 columns. The data is un-tidy in multiple ways:
First, we use regular expressions look at the columns that contain a question and remove it to leave only the candy names.
questions <- names(untidy)
questions <- str_replace(questions, "Q[:digit:]+[:punct:]? \\|? ?", replacement = "")
colnames(untidy) <- questionsWe create two tables to avoid data duplication: demographics and responses. This study will focus on the responses table only and we will tidy the table by gathering the candy types from the column headers into a column of its own.
demographics <- untidy %>% select(1:6)
responses <- untidy %>% select(1,7:(ncol(untidy)-11))
tidy_responses <- responses %>%
gather(2:ncol(responses), key = Candy, value = Response)The data is now presented in long format. Let’s take a look at 15 random oberservations from this table. We see a number of NA values which we will handle later.
| Internal ID | Candy | Response |
|---|---|---|
| 90275176 | Hard Candy | MEH |
| 90273228 | Hershey’s Kisses | NA |
| 90277170 | Red M&M’s | NA |
| 90282201 | Reggie Jackson Bar | NA |
| 90273089 | Kit Kat | JOY |
| 90276469 | Rolos | MEH |
| 90273299 | Whole Wheat anything | DESPAIR |
| 90285090 | Goo Goo Clusters | NA |
| 90273923 | Dots | MEH |
| 90285560 | Whole Wheat anything | DESPAIR |
| 90277710 | Peeps | NA |
| 90283698 | Kinder Happy Hippo | JOY |
| 90277374 | Reese’s Pieces | JOY |
| 90273209 | Swedish Fish | NA |
| 90273419 | Tic Tacs | NA |
We start by reformatting the data in a way that can it can be used for this particular analysis. We are looking for the count of responses by type for each candy.
counted <- tidy_responses %>%
group_by(Candy) %>%
count(Response) %>%
spread(Response, n)
showtable(head(counted, 10), "Tidy Dataset")| Candy | DESPAIR | JOY | MEH | <NA> |
|---|---|---|---|---|
| 100 Grand Bar | 85 | 873 | 755 | 747 |
| Abstained from M&M’ing. | 693 | 218 | 607 | 942 |
| Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) | 1089 | 176 | 461 | 734 |
| Any full-sized candy bar | 17 | 1559 | 212 | 672 |
| Black Jacks | 793 | 92 | 617 | 958 |
| Blue M&M’s | 120 | 1018 | 595 | 727 |
| Bonkers (the board game) | 546 | 193 | 715 | 1006 |
| Bonkers (the candy) | 495 | 116 | 857 | 992 |
| Bottle Caps | 560 | 465 | 670 | 765 |
| Box’o’Raisins | 1179 | 117 | 476 | 688 |
We drop the NA column and calculate the proportions instead of the counts, sorting by highest score. From this table we can see the preferred candies as rated by our score. However, we also see a few non specific candy types like the top score “Any full-sized candy bar”.
counted <- counted %>% select(Candy:MEH) %>%
mutate(total = DESPAIR + JOY + MEH) %>%
mutate(Dp = DESPAIR/total, Jp = JOY/total, Mp = MEH/total) %>%
mutate(Score = (Dp*-1+Mp*0+Jp*1)*100) %>%
arrange(desc(Score))
showtable(head(counted, 10), "Tidy Dataset") | Candy | DESPAIR | JOY | MEH | total | Dp | Jp | Mp | Score |
|---|---|---|---|---|---|---|---|---|
| Any full-sized candy bar | 17 | 1559 | 212 | 1788 | 0.0095078 | 0.8719239 | 0.1185682 | 86.24161 |
| Reese�s Peanut Butter Cups | 95 | 1498 | 188 | 1781 | 0.0533408 | 0.8411005 | 0.1055587 | 78.77597 |
| Kit Kat | 52 | 1447 | 287 | 1786 | 0.0291153 | 0.8101904 | 0.1606943 | 78.10750 |
| Cash, or other forms of legal tender | 65 | 1439 | 276 | 1780 | 0.0365169 | 0.8084270 | 0.1550562 | 77.19101 |
| Twix | 71 | 1413 | 286 | 1770 | 0.0401130 | 0.7983051 | 0.1615819 | 75.81921 |
| Snickers | 79 | 1399 | 292 | 1770 | 0.0446328 | 0.7903955 | 0.1649718 | 74.57627 |
| Tolberone something or other | 84 | 1320 | 350 | 1754 | 0.0478905 | 0.7525656 | 0.1995439 | 70.46750 |
| Lindt Truffle | 98 | 1277 | 367 | 1742 | 0.0562572 | 0.7330654 | 0.2106774 | 67.68083 |
| Peanut M&M�s | 121 | 1291 | 377 | 1789 | 0.0676356 | 0.7216322 | 0.2107323 | 65.39966 |
| Nestle Crunch | 102 | 1188 | 472 | 1762 | 0.0578888 | 0.6742338 | 0.2678774 | 61.63451 |
To visualize the data in an interesting way, we creature two types of labels that will describe candies with positive score and candies with negative scores.
candylabel_p2 <- counted %>% filter(Score >= 0) %>% select(Candy) %>% pull(Candy)
candylabel_n1 <- counted %>% filter(Score < 0) %>% select(Candy) %>% pull(Candy)
candylabel_p1 <- rep("", times=length(candylabel_n1))
candylabel_n2 <- rep("", times=length(candylabel_p2))
candylabel_p <- c(candylabel_p2, candylabel_p1)
candylabel_n <- c(candylabel_n2, candylabel_n1)Finally, we plot our Candy Hierarchy.
ggplot(data = counted, aes(x=reorder(Candy, Score), y=Score)) +
geom_bar(stat="identity", width = 0.8, position = position_dodge(width = 0.9), fill= 'lightblue') +
geom_text(aes(label=candylabel_p), size = 2, vjust=0, hjust="left") +
geom_text(aes(label=candylabel_n), size = 2, vjust=0, hjust="right") +
labs(title="Candy Hierarchy", x = "", y = "") +
theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.background = element_blank()) +
coord_flip() +
expand_limits(y = c(-225, 150))Not accounting non-specific candy types, the top three preferred candies are: Reese’s Penaut Butter Cups, Kit Kat and Twix. It is harder to identify the least preferred candies because of the non-sensical data concentrated at the lowest scores, but we do a few classics in the negative zone such as: Tic Tac, Lollipops, Licorice
While working with this dataset, we encountered a few forms of “un-tidyness”:
- Inconsistent value entry for non survey responses
- Question values in column headers
- Missing values
We used regular expressions and the tools from the tidyverse to clean up this data and organize it for use interesting displays. We revealed the preferred candies and identified a few mostly disliked ones.
To further and stregthen this analysis, we could look to add a column of data that would describe the candies into types that we could use group by color on the above hierarchy. We could also seek to eliminate non-sensical candy types. To improve the display, we could normalize the scores so that the bars are better stretched to occupy more of the visible scale. Lastly, we could try to incorporate the demographics data collected in a seperate table at the beginning to add more depth to this analysis.