Project 2B - Candy Hierarchy

We are looking at data collected by the University of British Columbia using this survey.

The data collected is an amalgation of questions most of which center around candy preferences where the survey responses offered were: “JOY”, “MEH, or”DESPAIR". Some inconsistent location data is also supplied for every response.

Our task is to select the information of interest from the document and transform it in order to facilitate analysis. Additionally, we will use the survey responses to develop a scoring method which will allow us to rank the candy by preference and develop a candy hierarchy.

Data Import

The data can is orginally found and the .xlsx file can be downloaded here.

Read the .csv file from the source link

Data Evaluation

Raw Data

Raw Data

Here is a snapshot of the dataset. It is composed of 2460 rows and 120 columns. The data is un-tidy in multiple ways:

  • Since the survey does not specify any formatting for the participants response, you can expect all kinds of inconsistent manual entries, for example with country: USA, usa, us, Murica, United States etc.
  • Values of variables in column headings
  • Missing values

Data Transformation

First, we use regular expressions look at the columns that contain a question and remove it to leave only the candy names.

We create two tables to avoid data duplication: demographics and responses. This study will focus on the responses table only and we will tidy the table by gathering the candy types from the column headers into a column of its own.

The data is now presented in long format. Let’s take a look at 15 random oberservations from this table. We see a number of NA values which we will handle later.

Tidy Dataset
Internal ID Candy Response
90275176 Hard Candy MEH
90273228 Hershey’s Kisses NA
90277170 Red M&M’s NA
90282201 Reggie Jackson Bar NA
90273089 Kit Kat JOY
90276469 Rolos MEH
90273299 Whole Wheat anything DESPAIR
90285090 Goo Goo Clusters NA
90273923 Dots MEH
90285560 Whole Wheat anything DESPAIR
90277710 Peeps NA
90283698 Kinder Happy Hippo JOY
90277374 Reese’s Pieces JOY
90273209 Swedish Fish NA
90273419 Tic Tacs NA

Data Analysis & Visualization

We start by reformatting the data in a way that can it can be used for this particular analysis. We are looking for the count of responses by type for each candy.

Tidy Dataset
Candy DESPAIR JOY MEH <NA>
100 Grand Bar 85 873 755 747
Abstained from M&M’ing. 693 218 607 942
Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) 1089 176 461 734
Any full-sized candy bar 17 1559 212 672
Black Jacks 793 92 617 958
Blue M&M’s 120 1018 595 727
Bonkers (the board game) 546 193 715 1006
Bonkers (the candy) 495 116 857 992
Bottle Caps 560 465 670 765
Box’o’Raisins 1179 117 476 688

We drop the NA column and calculate the proportions instead of the counts, sorting by highest score. From this table we can see the preferred candies as rated by our score. However, we also see a few non specific candy types like the top score “Any full-sized candy bar”.

Tidy Dataset
Candy DESPAIR JOY MEH total Dp Jp Mp Score
Any full-sized candy bar 17 1559 212 1788 0.0095078 0.8719239 0.1185682 86.24161
Reese�s Peanut Butter Cups 95 1498 188 1781 0.0533408 0.8411005 0.1055587 78.77597
Kit Kat 52 1447 287 1786 0.0291153 0.8101904 0.1606943 78.10750
Cash, or other forms of legal tender 65 1439 276 1780 0.0365169 0.8084270 0.1550562 77.19101
Twix 71 1413 286 1770 0.0401130 0.7983051 0.1615819 75.81921
Snickers 79 1399 292 1770 0.0446328 0.7903955 0.1649718 74.57627
Tolberone something or other 84 1320 350 1754 0.0478905 0.7525656 0.1995439 70.46750
Lindt Truffle 98 1277 367 1742 0.0562572 0.7330654 0.2106774 67.68083
Peanut M&M�s 121 1291 377 1789 0.0676356 0.7216322 0.2107323 65.39966
Nestle Crunch 102 1188 472 1762 0.0578888 0.6742338 0.2678774 61.63451

To visualize the data in an interesting way, we creature two types of labels that will describe candies with positive score and candies with negative scores.

Finally, we plot our Candy Hierarchy.

Not accounting non-specific candy types, the top three preferred candies are: Reese’s Penaut Butter Cups, Kit Kat and Twix. It is harder to identify the least preferred candies because of the non-sensical data concentrated at the lowest scores, but we do a few classics in the negative zone such as: Tic Tac, Lollipops, Licorice

Conclusion

While working with this dataset, we encountered a few forms of “un-tidyness”:
- Inconsistent value entry for non survey responses
- Question values in column headers
- Missing values

We used regular expressions and the tools from the tidyverse to clean up this data and organize it for use interesting displays. We revealed the preferred candies and identified a few mostly disliked ones.

To further and stregthen this analysis, we could look to add a column of data that would describe the candies into types that we could use group by color on the above hierarchy. We could also seek to eliminate non-sensical candy types. To improve the display, we could normalize the scores so that the bars are better stretched to occupy more of the visible scale. Lastly, we could try to incorporate the demographics data collected in a seperate table at the beginning to add more depth to this analysis.