Exploring FiveThirtyEight’s Halloween Candy Data Set

1. Importing packages and data

There’s a great dataset from FiveThirtyEight that includes all sorts of different information about different kinds of candy. For example, is it chocolaty? Is there nougat? How does the cost compare to other candies? How many people prefer this candy over another?

We’ll run through a whirlwind tour of this dataset and wrap up by trying some modeling techniques out on it! Specifically, we’ll take a look at linear and logistic regression.

First things first, let’s get our packages and data loaded up and take a look at exactly what we’re dealing with.

# Load the candy_rankings dataset from the fivethirtyeight package
data(candy_rankings)

#Look at the data
glimpse(candy_rankings)

## Observations: 85
## Variables: 13
## $ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One...
## $ chocolate        <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, ...
## $ fruity           <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALS...
## $ caramel          <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE...
## $ peanutyalmondy   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE...
## $ nougat           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE...
## $ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ hard             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ bar              <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, ...
## $ pluribus         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.6...
## $ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.7...
## $ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34...

# Look at the first few rows of the data
head(candy_rankings)

## # A tibble: 6 x 13
##   competitorname chocolate fruity caramel peanutyalmondy nougat
##   <chr>          <lgl>     <lgl>  <lgl>   <lgl>          <lgl> 
## 1 100 Grand      TRUE      FALSE  TRUE    FALSE          FALSE 
## 2 3 Musketeers   TRUE      FALSE  FALSE   FALSE          TRUE  
## 3 One dime       FALSE     FALSE  FALSE   FALSE          FALSE 
## 4 One quarter    FALSE     FALSE  FALSE   FALSE          FALSE 
## 5 Air Heads      FALSE     TRUE   FALSE   FALSE          FALSE 
## 6 Almond Joy     TRUE      FALSE  FALSE   TRUE           FALSE 
## # ... with 7 more variables: crispedricewafer <lgl>, hard <lgl>,
## #   bar <lgl>, pluribus <lgl>, sugarpercent <dbl>, pricepercent <dbl>,
## #   winpercent <dbl>

2. Explore the distributions of categorical variables

Let’s get started by taking a look at the distributions of each of these binary categorical variables. There are quite a few of them, so we’ll have to do some data wrangling to get them in shape for plotting. We’ll explore these by making a bar chart showing the breakdown of each column. This lets us get a sense of the proportion of TRUEs and FALSEs in each column.

# gather() the categorical variables to make them easier to plot
candy_rankings_long <- gather(candy_rankings, key = feature, value = value, chocolate:pluribus)

# Make a bar plot showing the distribution of each variable
ggplot(candy_rankings_long, aes(x=value)) + geom_bar() + facet_wrap(~feature)

3. Taking a look at pricepercent

Next, we’ll look at the pricepercent variable. This variable records the percentile rank of the candy’s price against all the other candies in the dataset. Let’s see which is the most expensive and which is the least expensive by making a lollipop chart. One of the most interesting aspects of this chart is that a lot of the candies share the same ranking, so it looks like quite a few of them are the same price.

# Make a lollipop chart of pricepercent
ggplot(candy_rankings, aes(x=reorder(competitorname, pricepercent), y=pricepercent)) +
geom_segment(aes(xend=reorder(competitorname, pricepercent), yend=0)) +
geom_point() + coord_flip()

4. Exploring winpercent (part i)

Moving on, we’ll take a look at another numerical variable in the dataset: winpercent. This variable records the percentage of people who prefer this candy over another randomly chosen candy from the dataset. We’ll start with a histogram! The distribution of rankings looks pretty symmetrical, and seems to center on about 45%.

# Plot a histogram of winpercent
ggplot(candy_rankings, aes(winpercent)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5. Exploring winpercent (part ii)

Now that we’ve looked at the histogram, let’s make another lollipop chart to visualize the rankings. It looks like Reese’s Peanut Butter Cups are the all time favorite out of this set of candies!

# Make a lollipop chart of winpercent
ggplot(candy_rankings, aes(x=reorder(competitorname, winpercent), y=winpercent)) + geom_segment(aes(xend=reorder(competitorname, winpercent), yend=0)) + geom_point() + coord_flip()

6. Exploring the correlation structure

Now that we’ve explored the dataset one variable at a time, we’ll see how the variables interact with one another. This is important as we get ready to model the data because it gives us some intuition about which variables might be useful explanatory variables. We’ll use the corrplot package to plot the correlation matrix. Taking a look at this plot, it looks like chocolaty candies are almost never fruity. This also allows us to check for possible multicollinearity, which can be a problem for regression modeling. It looks like we’re good though!

# Plot the correlation matrix using corrplot()
cor(candy_rankings[, 2:13]) %>%
corrplot()

7. Fitting a linear model of winpercent

Let’s dive into the deep end of modeling by creating a linear model of winpercent using all the other variables (except competitorname). Because this is a categorical variable with a unique value in every row of the dataset, it’s actually mathematically impossible to fit a linear model with it. Moreover, this variable doesn’t actually include any information that our model could use because these names don’t actually relate to any of the attributes of the candy.

Let’s fit this model, then we can dive into exploring it shortly. Maybe this will give us an idea of why people tend to prefer one candy over another!

# Fit a linear model of winpercent explained by all variables 
# except competitorname
win_mod <- lm(winpercent ~ . -competitorname, data = candy_rankings)
win_mod

## 
## Call:
## lm(formula = winpercent ~ . - competitorname, data = candy_rankings)
## 
## Coefficients:
##          (Intercept)         chocolateTRUE            fruityTRUE  
##              34.5340               19.7481                9.4223  
##          caramelTRUE    peanutyalmondyTRUE            nougatTRUE  
##               2.2245               10.0707                0.8043  
## crispedricewaferTRUE              hardTRUE               barTRUE  
##               8.9190               -6.1653                0.4415  
##         pluribusTRUE          sugarpercent          pricepercent  
##              -0.8545                9.0868               -5.9284

8. Evaluating the linear model

Let’s see how we did! We’ll take a look at the results of our linear model and run some basic diagnostics to make sure the output is reliable.

Taking a look at the coefficients, we can make some conclusions about the factors that cause people to choose one candy over another. For example, it looks like people who took this survey really like peanut butter! There are a few other significant coefficients such as chocolate and fruity.

# Take a look at the summary
summary(win_mod)

## 
## Call:
## lm(formula = winpercent ~ . - competitorname, data = candy_rankings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2244  -6.6247   0.1986   6.8420  23.8680 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           34.5340     4.3199   7.994 1.44e-11 ***
## chocolateTRUE         19.7481     3.8987   5.065 2.96e-06 ***
## fruityTRUE             9.4223     3.7630   2.504  0.01452 *  
## caramelTRUE            2.2245     3.6574   0.608  0.54493    
## peanutyalmondyTRUE    10.0707     3.6158   2.785  0.00681 ** 
## nougatTRUE             0.8043     5.7164   0.141  0.88849    
## crispedricewaferTRUE   8.9190     5.2679   1.693  0.09470 .  
## hardTRUE              -6.1653     3.4551  -1.784  0.07852 .  
## barTRUE                0.4415     5.0611   0.087  0.93072    
## pluribusTRUE          -0.8545     3.0401  -0.281  0.77945    
## sugarpercent           9.0868     4.6595   1.950  0.05500 .  
## pricepercent          -5.9284     5.5132  -1.075  0.28578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.4709 
## F-statistic: 7.797 on 11 and 73 DF,  p-value: 9.504e-09

# Plot the residuals vs the fitted values
augment(win_mod)

##    winpercent              competitorname chocolate fruity caramel
## 1    66.97173                   100 Grand      TRUE  FALSE    TRUE
## 2    67.60294                3 Musketeers      TRUE  FALSE   FALSE
## 3    32.26109                    One dime     FALSE  FALSE   FALSE
## 4    46.11650                 One quarter     FALSE  FALSE   FALSE
## 5    52.34146                   Air Heads     FALSE   TRUE   FALSE
## 6    50.34755                  Almond Joy      TRUE  FALSE   FALSE
## 7    56.91455                   Baby Ruth      TRUE  FALSE    TRUE
## 8    23.41782          Boston Baked Beans     FALSE  FALSE   FALSE
## 9    38.01096                  Candy Corn     FALSE  FALSE   FALSE
## 10   34.51768          Caramel Apple Pops     FALSE   TRUE    TRUE
## 11   38.97504             Charleston Chew      TRUE  FALSE   FALSE
## 12   36.01763  Chewey Lemonhead Fruit Mix     FALSE   TRUE   FALSE
## 13   24.52499                    Chiclets     FALSE   TRUE   FALSE
## 14   42.27208                        Dots     FALSE   TRUE   FALSE
## 15   39.46056                    Dum Dums     FALSE   TRUE   FALSE
## 16   43.08892                 Fruit Chews     FALSE   TRUE   FALSE
## 17   39.18550                     Fun Dip     FALSE   TRUE   FALSE
## 18   46.78335                  Gobstopper     FALSE   TRUE   FALSE
## 19   57.11974           Haribo Gold Bears     FALSE   TRUE   FALSE
## 20   34.15896           Haribo Happy Cola     FALSE  FALSE   FALSE
## 21   51.41243           Haribo Sour Bears     FALSE   TRUE   FALSE
## 22   42.17877          Haribo Twin Snakes     FALSE   TRUE   FALSE
## 23   55.37545            Hershey's Kisses      TRUE  FALSE   FALSE
## 24   62.28448           Hershey's Krackel      TRUE  FALSE   FALSE
## 25   56.49050    Hershey's Milk Chocolate      TRUE  FALSE   FALSE
## 26   59.23612      Hershey's Special Dark      TRUE  FALSE   FALSE
## 27   28.12744                  Jawbusters     FALSE   TRUE   FALSE
## 28   57.21925                Junior Mints      TRUE  FALSE   FALSE
## 29   76.76860                     Kit Kat      TRUE  FALSE   FALSE
## 30   41.38956                 Laffy Taffy     FALSE   TRUE   FALSE
## 31   39.14106                   Lemonhead     FALSE   TRUE   FALSE
## 32   52.91139 Lifesavers big ring gummies     FALSE   TRUE   FALSE
## 33   71.46505         Peanut butter M&M's      TRUE  FALSE   FALSE
## 34   66.57458                       M&M's      TRUE  FALSE   FALSE
## 35   46.41172                  Mike & Ike     FALSE   TRUE   FALSE
## 36   55.06407                   Milk Duds      TRUE  FALSE    TRUE
## 37   73.09956                   Milky Way      TRUE  FALSE    TRUE
## 38   60.80070          Milky Way Midnight      TRUE  FALSE    TRUE
## 39   64.35334    Milky Way Simply Caramel      TRUE  FALSE    TRUE
## 40   47.82975                      Mounds      TRUE  FALSE   FALSE
## 41   54.52645                 Mr Good Bar      TRUE  FALSE   FALSE
## 42   55.35405                       Nerds     FALSE   TRUE   FALSE
## 43   70.73564         Nestle Butterfinger      TRUE  FALSE   FALSE
## 44   66.47068               Nestle Crunch      TRUE  FALSE   FALSE
## 45   22.44534                   Nik L Nip     FALSE   TRUE   FALSE
## 46   39.44680                 Now & Later     FALSE   TRUE   FALSE
## 47   46.29660                      Payday     FALSE  FALSE   FALSE
## 48   69.48379                 Peanut M&Ms      TRUE  FALSE   FALSE
## 49   37.72234                Pixie Sticks     FALSE  FALSE   FALSE
## 50   41.26551                   Pop Rocks     FALSE   TRUE   FALSE
## 51   37.34852                   Red vines     FALSE   TRUE   FALSE
## 52   81.86626          Reese's Miniatures      TRUE  FALSE   FALSE
## 53   84.18029   Reese's Peanut Butter cup      TRUE  FALSE   FALSE
## 54   73.43499              Reese's pieces      TRUE  FALSE   FALSE
## 55   72.88790 Reese's stuffed with pieces      TRUE  FALSE   FALSE
## 56   35.29076                    Ring pop     FALSE   TRUE   FALSE
## 57   65.71629                        Rolo      TRUE  FALSE    TRUE
## 58   29.70369           Root Beer Barrels     FALSE  FALSE   FALSE
## 59   42.84914                       Runts     FALSE   TRUE   FALSE
## 60   34.72200                     Sixlets      TRUE  FALSE   FALSE
## 61   63.08514           Skittles original     FALSE   TRUE   FALSE
## 62   55.10370          Skittles wildberry     FALSE   TRUE   FALSE
## 63   37.88719             Nestle Smarties      TRUE  FALSE   FALSE
## 64   45.99583              Smarties candy     FALSE   TRUE   FALSE
## 65   76.67378                    Snickers      TRUE  FALSE    TRUE
## 66   59.52925            Snickers Crisper      TRUE  FALSE    TRUE
## 67   59.86400             Sour Patch Kids     FALSE   TRUE   FALSE
## 68   52.82595       Sour Patch Tricksters     FALSE   TRUE   FALSE
## 69   67.03763                   Starburst     FALSE   TRUE   FALSE
## 70   34.57899         Strawberry bon bons     FALSE   TRUE   FALSE
## 71   33.43755                Sugar Babies     FALSE  FALSE    TRUE
## 72   32.23100                 Sugar Daddy     FALSE  FALSE    TRUE
## 73   27.30386                Super Bubble     FALSE   TRUE   FALSE
## 74   54.86111                Swedish Fish     FALSE   TRUE   FALSE
## 75   48.98265                 Tootsie Pop      TRUE   TRUE   FALSE
## 76   43.06890        Tootsie Roll Juniors      TRUE  FALSE   FALSE
## 77   45.73675        Tootsie Roll Midgies      TRUE  FALSE   FALSE
## 78   49.65350     Tootsie Roll Snack Bars      TRUE  FALSE   FALSE
## 79   47.17323           Trolli Sour Bites     FALSE   TRUE   FALSE
## 80   81.64291                        Twix      TRUE  FALSE    TRUE
## 81   45.46628                   Twizzlers     FALSE   TRUE   FALSE
## 82   39.01190                    Warheads     FALSE   TRUE   FALSE
## 83   44.37552        Welch's Fruit Snacks     FALSE   TRUE   FALSE
## 84   41.90431  Werther's Original Caramel     FALSE  FALSE    TRUE
## 85   49.52411                    Whoppers      TRUE  FALSE   FALSE
##    peanutyalmondy nougat crispedricewafer  hard   bar pluribus
## 1           FALSE  FALSE             TRUE FALSE  TRUE    FALSE
## 2           FALSE   TRUE            FALSE FALSE  TRUE    FALSE
## 3           FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 4           FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 5           FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 6            TRUE  FALSE            FALSE FALSE  TRUE    FALSE
## 7            TRUE   TRUE            FALSE FALSE  TRUE    FALSE
## 8            TRUE  FALSE            FALSE FALSE FALSE     TRUE
## 9           FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 10          FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 11          FALSE   TRUE            FALSE FALSE  TRUE    FALSE
## 12          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 13          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 14          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 15          FALSE  FALSE            FALSE  TRUE FALSE    FALSE
## 16          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 17          FALSE  FALSE            FALSE  TRUE FALSE    FALSE
## 18          FALSE  FALSE            FALSE  TRUE FALSE     TRUE
## 19          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 20          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 21          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 22          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 23          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 24          FALSE  FALSE             TRUE FALSE  TRUE    FALSE
## 25          FALSE  FALSE            FALSE FALSE  TRUE    FALSE
## 26          FALSE  FALSE            FALSE FALSE  TRUE    FALSE
## 27          FALSE  FALSE            FALSE  TRUE FALSE     TRUE
## 28          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 29          FALSE  FALSE             TRUE FALSE  TRUE    FALSE
## 30          FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 31          FALSE  FALSE            FALSE  TRUE FALSE    FALSE
## 32          FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 33           TRUE  FALSE            FALSE FALSE FALSE     TRUE
## 34          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 35          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 36          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 37          FALSE   TRUE            FALSE FALSE  TRUE    FALSE
## 38          FALSE   TRUE            FALSE FALSE  TRUE    FALSE
## 39          FALSE  FALSE            FALSE FALSE  TRUE    FALSE
## 40          FALSE  FALSE            FALSE FALSE  TRUE    FALSE
## 41           TRUE  FALSE            FALSE FALSE  TRUE    FALSE
## 42          FALSE  FALSE            FALSE  TRUE FALSE     TRUE
## 43           TRUE  FALSE            FALSE FALSE  TRUE    FALSE
## 44          FALSE  FALSE             TRUE FALSE  TRUE    FALSE
## 45          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 46          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 47           TRUE   TRUE            FALSE FALSE  TRUE    FALSE
## 48           TRUE  FALSE            FALSE FALSE FALSE     TRUE
## 49          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 50          FALSE  FALSE            FALSE  TRUE FALSE     TRUE
## 51          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 52           TRUE  FALSE            FALSE FALSE FALSE    FALSE
## 53           TRUE  FALSE            FALSE FALSE FALSE    FALSE
## 54           TRUE  FALSE            FALSE FALSE FALSE     TRUE
## 55           TRUE  FALSE            FALSE FALSE FALSE    FALSE
## 56          FALSE  FALSE            FALSE  TRUE FALSE    FALSE
## 57          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 58          FALSE  FALSE            FALSE  TRUE FALSE     TRUE
## 59          FALSE  FALSE            FALSE  TRUE FALSE     TRUE
## 60          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 61          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 62          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 63          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 64          FALSE  FALSE            FALSE  TRUE FALSE     TRUE
## 65           TRUE   TRUE            FALSE FALSE  TRUE    FALSE
## 66           TRUE  FALSE             TRUE FALSE  TRUE    FALSE
## 67          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 68          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 69          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 70          FALSE  FALSE            FALSE  TRUE FALSE     TRUE
## 71          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 72          FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 73          FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 74          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 75          FALSE  FALSE            FALSE  TRUE FALSE    FALSE
## 76          FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 77          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 78          FALSE  FALSE            FALSE FALSE  TRUE    FALSE
## 79          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 80          FALSE  FALSE             TRUE FALSE  TRUE    FALSE
## 81          FALSE  FALSE            FALSE FALSE FALSE    FALSE
## 82          FALSE  FALSE            FALSE  TRUE FALSE    FALSE
## 83          FALSE  FALSE            FALSE FALSE FALSE     TRUE
## 84          FALSE  FALSE            FALSE  TRUE FALSE    FALSE
## 85          FALSE  FALSE             TRUE FALSE FALSE     TRUE
##    sugarpercent pricepercent  .fitted  .se.fit      .resid       .hat
## 1         0.732        0.860 67.42016 4.589589  -0.4484311 0.18388570
## 2         0.604        0.511 57.98693 4.929578   9.6160080 0.21213862
## 3         0.011        0.116 33.94624 4.180558  -1.6851569 0.15256987
## 4         0.011        0.511 31.60454 4.447541  14.5119648 0.17267932
## 5         0.906        0.511 49.15952 4.031590   3.1819498 0.14189038
## 6         0.465        0.767 64.47257 4.235507 -14.1250194 0.15660696
## 7         0.604        0.767 68.76444 4.898842 -11.8498901 0.20950149
## 8         0.313        0.511 43.56493 4.478461 -20.1471075 0.17508860
## 9         0.906        0.325 39.98537 4.296181  -1.9744058 0.16112595
## 10        0.604        0.325 49.74247 4.479715 -15.2247879 0.17518668
## 11        0.604        0.511 57.98693 4.929578 -19.0118910 0.21213862
## 12        0.732        0.511 46.72392 2.549422 -10.7062907 0.05673929
## 13        0.046        0.325 41.59307 2.882302 -17.0680866 0.07252360
## 14        0.732        0.511 46.72392 2.549422  -4.4518427 0.05673929
## 15        0.732        0.034 44.24092 3.822711  -4.7803640 0.12756847
## 16        0.127        0.034 44.05426 2.942445  -0.9653315 0.07558177
## 17        0.732        0.325 42.51577 3.448166  -3.3302619 0.10379508
## 18        0.906        0.453 42.48353 3.411564   4.2998140 0.10160322
## 19        0.465        0.465 44.57046 2.272905  12.5492824 0.04509860
## 20        0.465        0.465 35.14814 3.480918  -0.9891775 0.10577618
## 21        0.465        0.465 44.57046 2.272905   6.8419724 0.04509860
## 22        0.465        0.465 44.57046 2.272905  -2.3916856 0.04509860
## 23        0.127        0.093 54.03023 3.705246   1.3452269 0.11984898
## 24        0.430        0.918 62.10763 4.524033   0.1768533 0.17867008
## 25        0.430        0.918 53.18866 4.105998   3.3018431 0.14717627
## 26        0.430        0.918 53.18866 4.105998   6.0474641 0.14717627
## 27        0.093        0.511 34.75215 4.149031  -6.6247118 0.15027737
## 28        0.197        0.511 52.18825 3.256920   5.0310044 0.09260077
## 29        0.313        0.511 63.45732 4.609345  13.3112806 0.18547216
## 30        0.220        0.116 45.26770 3.194444  -3.8781414 0.08908217
## 31        0.046        0.104 37.59242 3.712883   1.5486405 0.12034358
## 32        0.267        0.279 44.72845 3.090762   8.1829388 0.08339335
## 33        0.825        0.651 67.13545 3.831854   4.3295998 0.12817937
## 34        0.825        0.651 57.06476 3.427442   9.5098233 0.10255116
## 35        0.872        0.325 49.09874 2.974721  -2.6870246 0.07724900
## 36        0.302        0.511 55.36684 4.464249  -0.3027649 0.17397917
## 37        0.604        0.651 59.38144 4.609916  13.7181175 0.18551813
## 38        0.732        0.441 61.78950 4.737852  -0.9887993 0.19595818
## 39        0.965        0.860 60.61840 5.058007   3.7349381 0.22333620
## 40        0.313        0.860 52.46935 4.079075  -4.6395973 0.14525253
## 41        0.313        0.918 62.19619 4.366107  -7.6697440 0.16641372
## 42        0.848        0.325 42.71533 3.348410  12.6387142 0.09787629
## 43        0.604        0.767 65.73563 4.284363   5.0000159 0.16024071
## 44        0.313        0.767 61.93966 4.472589   4.5310214 0.17462982
## 45        0.197        0.976 39.10581 4.526811 -16.6604714 0.17888959
## 46        0.220        0.325 43.17417 2.441040  -3.7273714 0.05201758
## 47        0.465        0.767 45.52883 6.222662   0.7677680 0.33802802
## 48        0.593        0.651 65.02732 3.640081   4.4564668 0.11567046
## 49        0.093        0.023 34.38820 3.918178   3.3341405 0.13401969
## 50        0.604        0.837 37.46284 4.114747   3.8026706 0.14780416
## 51        0.581        0.116 47.69352 2.732204 -10.3449980 0.06516681
## 52        0.034        0.279 63.00767 4.407513  18.8585861 0.16958503
## 53        0.720        0.651 67.03584 4.221377  17.1444500 0.15556384
## 54        0.406        0.651 63.32810 3.714337  10.1068935 0.12043784
## 55        0.988        0.651 69.47109 4.724808   3.4168090 0.19488066
## 56        0.732        0.965 38.72162 4.900162  -3.4308597 0.20961449
## 57        0.860        0.860 58.36825 4.489595   7.3480335 0.17596029
## 58        0.732        0.069 33.75661 5.105617  -4.0529148 0.22756042
## 59        0.872        0.279 43.20612 3.439858  -0.3569746 0.10329552
## 60        0.220        0.081 54.94644 3.660922 -20.2244364 0.11699880
## 61        0.941        0.220 50.34821 3.408857  12.7369349 0.10144202
## 62        0.941        0.220 50.34821 3.408857   4.7554899 0.10144202
## 63        0.267        0.976 50.06763 4.343769 -12.1804426 0.16471523
## 64        0.267        0.116 38.67495 3.507512   7.3208769 0.10739862
## 65        0.546        0.651 68.92510 4.930439   7.7486870 0.21221278
## 66        0.604        0.651 77.56677 5.821144 -18.0375152 0.29581285
## 67        0.069        0.116 43.04110 2.890821  16.8229004 0.07295292
## 68        0.069        0.116 43.04110 2.890821   9.7848494 0.07295292
## 69        0.151        0.220 43.16966 2.600091  23.8679655 0.05901703
## 70        0.569        0.058 41.76300 3.474481  -7.1840066 0.10538534
## 71        0.965        0.767 40.12563 4.837866  -6.6880828 0.20431867
## 72        0.418        0.325 38.63001 4.579474  -6.3990143 0.18307608
## 73        0.162        0.116 44.74067 3.223612 -17.4368011 0.09071640
## 74        0.604        0.755 44.11429 3.019560  10.7468184 0.07959532
## 75        0.604        0.325 61.10073 5.063631 -12.1180770 0.22383315
## 76        0.313        0.511 54.09681 3.761967 -11.0279125 0.12354649
## 77        0.174        0.011 54.94343 3.890275  -9.2066825 0.13211770
## 78        0.465        0.325 57.02221 4.480837  -7.3687098 0.17527446
## 79        0.313        0.255 44.43423 2.317720   2.7390035 0.04689453
## 80        0.546        0.906 65.45731 4.604338  16.1856004 0.18506943
## 81        0.220        0.116 45.26770 3.194444   0.1985836 0.08908217
## 82        0.093        0.116 37.94835 3.616019   1.0635450 0.11414625
## 83        0.313        0.313 44.09038 2.289832   0.2851384 0.04577280
## 84        0.186        0.267 30.70040 5.423620  11.2039093 0.25679046
## 85        0.872        0.848 65.24292 6.028675 -15.7188093 0.31728098
##      .sigma      .cooksd  .std.resid
## 1  10.77677 4.038839e-05 -0.04637889
## 2  10.70103 2.298946e-02  1.01220873
## 3  10.77477 4.388951e-04 -0.17103640
## 4  10.61163 3.865130e-02  1.49069723
## 5  10.76932 1.419297e-03  0.32093871
## 6  10.62340 3.195569e-02 -1.43705772
## 7  10.66185 3.424781e-02 -1.24527125
## 8  10.45505 7.597798e-02 -2.07257005
## 9  10.77393 6.493276e-04 -0.20141333
## 10 10.59429 4.342215e-02 -1.56629509
## 11 10.47713 8.986485e-02 -2.00124647
## 12 10.69834 5.317623e-03 -1.02996704
## 13 10.57259 1.786744e-02 -1.65589797
## 14 10.76338 9.194309e-04 -0.42827636
## 15 10.76004 2.786260e-03 -0.47818446
## 16 10.77628 5.995869e-05 -0.09380855
## 17 10.76895 1.042649e-03 -0.32868122
## 18 10.76366 1.693127e-03  0.42385347
## 19 10.67013 5.666344e-03  1.19988547
## 20 10.77622 9.415925e-05 -0.09773527
## 21 10.74529 1.684332e-03  0.65418747
## 22 10.77307 2.058132e-04 -0.22867832
## 23 10.77560 2.036714e-04  0.13397295
## 24 10.77690 6.026453e-06  0.01823284
## 25 10.76869 1.604912e-03  0.33406191
## 26 10.74926 5.383750e-03  0.61184839
## 27 10.74360 6.644959e-03 -0.67147302
## 28 10.75894 2.070839e-03  0.49346422
## 29 10.63583 3.603509e-02  1.37805655
## 30 10.76628 1.174624e-03 -0.37965070
## 31 10.77517 2.713421e-04  0.15427454
## 32 10.72975 4.835074e-03  0.79857915
## 33 10.76306 2.299738e-03  0.43324579
## 34 10.71180 8.376936e-03  0.93792425
## 35 10.77189 4.765256e-04 -0.26135424
## 36 10.77686 1.700378e-05 -0.03112506
## 37 10.62700 3.828526e-02  1.42021461
## 38 10.77615 2.155964e-04 -0.10303126
## 39 10.76535 3.757326e-03  0.39597435
## 40 10.76069 3.113348e-03 -0.46887982
## 41 10.73136 1.024870e-02 -0.78488450
## 42 10.66222 1.397559e-02  1.24328304
## 43 10.75773 4.132616e-03  0.50979332
## 44 10.76089 3.828531e-03  0.46598495
## 45 10.55685 5.357666e-02 -1.71785568
## 46 10.76748 5.850244e-04 -0.35768662
## 47 10.77635 3.307902e-04  0.08816788
## 48 10.76245 2.136951e-03  0.44277570
## 49 10.76865 1.445238e-03  0.33475730
## 50 10.76599 2.140939e-03  0.38487453
## 51 10.70291 5.805479e-03 -0.99968578
## 52 10.49733 6.362599e-02  1.93357807
## 53 10.55025 4.664888e-02  1.74317229
## 54 10.70183 1.156869e-02  1.00689593
## 55 10.76758 2.553340e-03  0.35578755
## 56 10.76733 2.873217e-03 -0.36056506
## 57 10.73462 1.017840e-02  0.75630544
## 58 10.76322 4.557448e-03 -0.43085921
## 59 10.77684 1.190903e-05 -0.03522190
## 60 10.47418 4.465095e-02 -2.01092535
## 61 10.65995 1.482773e-02  1.25542867
## 62 10.76070 2.066975e-03  0.46872960
## 63 10.66186 2.548061e-02 -1.24521963
## 64 10.73817 5.255680e-03  0.72399321
## 65 10.72770 1.493582e-02  0.81568752
## 66 10.47498 1.411931e-01 -2.00831961
## 67 10.57838 1.747672e-02  1.63248855
## 68 10.71017 5.912436e-03  0.94951847
## 69 10.37949 2.762255e-02  2.29892439
## 70 10.73969 4.943798e-03 -0.70965765
## 71 10.74064 1.050152e-02 -0.70053885
## 72 10.74458 8.171723e-03 -0.66148872
## 73 10.55927 2.426833e-02 -1.70850913
## 74 10.69577 7.894213e-03  1.04662396
## 75 10.65432 3.969179e-02 -1.28515743
## 76 10.68714 1.422913e-02 -1.10059881
## 77 10.71381 1.081595e-02 -0.92336396
## 78 10.73442 1.017892e-02 -0.75811816
## 79 10.77186 2.817374e-04  0.26213335
## 80 10.56775 5.310925e-02  1.67520776
## 81 10.77690 3.079912e-06  0.01944035
## 82 10.77611 1.196929e-04  0.10557839
## 83 10.77687 2.973271e-06  0.02727282
## 84 10.66754 4.245361e-02  1.21426678
## 85 10.54114 1.223541e-01 -1.77745580

ggplot(augment(win_mod), aes(x=.fitted, y=.resid)) + geom_point() + geom_hline(yintercept=0)

9. Fit a logistic regression model of chocolate

Now let’s try out logistic regression! We’ll be trying to predict if a candy is chocolaty or not based on all the other features in the dataset. A logistic regression is a great choice for this particular modeling task because the variable we’re trying to predict is either TRUE or FALSE. The logistic regression model will output a probability that we can use to make our decision. This model outputs a warning because a few of the features (like crispedricewafer) are only ever true when a candy is chocolate. This means that we can’t draw conclusions from the coefficients, but we can still use the model to make predictions just fine!

# Fit a glm() of chocolate
choc_mod <- glm(chocolate ~ . - competitorname, family = binomial, data = candy_rankings)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

10. Evaluate the logistic regression model

Let’s take a look at our logistic regression model! We’ll start by creating a data frame of predictions we can compare to the actual values. Then we’ll evaluate the model by making a confusion matrix and calculating the accuracy.

Looking at the summary, it looks like most of the coefficients aren’t statistically significant. In this case, that’s okay because we’re not trying to draw any conclusions about the relationships between the predictor variables and the response. We’re only trying to make accurate predictions and, taking a look at our confusion matrix, it seems like we did a pretty good job!

# Print the summary
summary(choc_mod)

## 
## Call:
## glm(formula = chocolate ~ . - competitorname, family = binomial, 
##     data = candy_rankings)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.72224  -0.17612  -0.02787   0.01954   2.57898  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -10.29370    4.12040  -2.498  0.01248 * 
## fruityTRUE             -6.75305    2.20462  -3.063  0.00219 **
## caramelTRUE            -1.85093    1.66750  -1.110  0.26700   
## peanutyalmondyTRUE     -4.11907    2.98175  -1.381  0.16715   
## nougatTRUE            -16.74818 3520.13323  -0.005  0.99620   
## crispedricewaferTRUE   14.98331 4725.35051   0.003  0.99747   
## hardTRUE                1.83504    1.80742   1.015  0.30997   
## barTRUE                19.06799 3520.13379   0.005  0.99568   
## pluribusTRUE            0.22804    1.45457   0.157  0.87542   
## sugarpercent            0.12168    2.07707   0.059  0.95329   
## pricepercent            1.76626    2.24816   0.786  0.43208   
## winpercent              0.23019    0.08593   2.679  0.00739 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 116.407  on 84  degrees of freedom
## Residual deviance:  25.802  on 73  degrees of freedom
## AIC: 49.802
## 
## Number of Fisher Scoring iterations: 19

# Make a dataframe of predictions
preds <- augment(choc_mod, data = candy_rankings, type.predict = "response") %>%
mutate(prediction = .fitted > .5) 


# Create the confusion matrix
conf_mat <- preds %>% 
select(chocolate, prediction) %>%
table()

print(conf_mat)

##          prediction
## chocolate FALSE TRUE
##     FALSE    47    1
##     TRUE      2   35

# Calculate the accuracy
accuracy <- sum(diag(conf_mat))/sum(conf_mat)

cat("Accuracy of the model is:", accuracy)

## Accuracy of the model is: 0.9647059