We’ll run through a whirlwind tour of this dataset and wrap up by trying some modeling techniques out on it! Specifically, we’ll take a look at linear and logistic regression.
First things first, let’s get our packages and data loaded up and take a look at exactly what we’re dealing with.
# Load the candy_rankings dataset from the fivethirtyeight package
data(candy_rankings)
#Look at the data
glimpse(candy_rankings)
## Observations: 85
## Variables: 13
## $ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One...
## $ chocolate <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, ...
## $ fruity <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALS...
## $ caramel <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE...
## $ peanutyalmondy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE...
## $ nougat <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE...
## $ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ hard <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ bar <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, ...
## $ pluribus <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.6...
## $ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.7...
## $ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34...
# Look at the first few rows of the data
head(candy_rankings)
## # A tibble: 6 x 13
## competitorname chocolate fruity caramel peanutyalmondy nougat
## <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 100 Grand TRUE FALSE TRUE FALSE FALSE
## 2 3 Musketeers TRUE FALSE FALSE FALSE TRUE
## 3 One dime FALSE FALSE FALSE FALSE FALSE
## 4 One quarter FALSE FALSE FALSE FALSE FALSE
## 5 Air Heads FALSE TRUE FALSE FALSE FALSE
## 6 Almond Joy TRUE FALSE FALSE TRUE FALSE
## # ... with 7 more variables: crispedricewafer <lgl>, hard <lgl>,
## # bar <lgl>, pluribus <lgl>, sugarpercent <dbl>, pricepercent <dbl>,
## # winpercent <dbl>
Let’s get started by taking a look at the distributions of each of these binary categorical variables. There are quite a few of them, so we’ll have to do some data wrangling to get them in shape for plotting. We’ll explore these by making a bar chart showing the breakdown of each column. This lets us get a sense of the proportion of TRUEs and FALSEs in each column.
# gather() the categorical variables to make them easier to plot
candy_rankings_long <- gather(candy_rankings, key = feature, value = value, chocolate:pluribus)
# Make a bar plot showing the distribution of each variable
ggplot(candy_rankings_long, aes(x=value)) + geom_bar() + facet_wrap(~feature)
Next, we’ll look at the pricepercent variable. This variable records the percentile rank of the candy’s price against all the other candies in the dataset. Let’s see which is the most expensive and which is the least expensive by making a lollipop chart. One of the most interesting aspects of this chart is that a lot of the candies share the same ranking, so it looks like quite a few of them are the same price.
# Make a lollipop chart of pricepercent
ggplot(candy_rankings, aes(x=reorder(competitorname, pricepercent), y=pricepercent)) +
geom_segment(aes(xend=reorder(competitorname, pricepercent), yend=0)) +
geom_point() + coord_flip()
Moving on, we’ll take a look at another numerical variable in the dataset: winpercent. This variable records the percentage of people who prefer this candy over another randomly chosen candy from the dataset. We’ll start with a histogram! The distribution of rankings looks pretty symmetrical, and seems to center on about 45%.
# Plot a histogram of winpercent
ggplot(candy_rankings, aes(winpercent)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now that we’ve looked at the histogram, let’s make another lollipop chart to visualize the rankings. It looks like Reese’s Peanut Butter Cups are the all time favorite out of this set of candies!
# Make a lollipop chart of winpercent
ggplot(candy_rankings, aes(x=reorder(competitorname, winpercent), y=winpercent)) + geom_segment(aes(xend=reorder(competitorname, winpercent), yend=0)) + geom_point() + coord_flip()
Now that we’ve explored the dataset one variable at a time, we’ll see how the variables interact with one another. This is important as we get ready to model the data because it gives us some intuition about which variables might be useful explanatory variables. We’ll use the corrplot package to plot the correlation matrix. Taking a look at this plot, it looks like chocolaty candies are almost never fruity. This also allows us to check for possible multicollinearity, which can be a problem for regression modeling. It looks like we’re good though!
# Plot the correlation matrix using corrplot()
cor(candy_rankings[, 2:13]) %>%
corrplot()
Let’s dive into the deep end of modeling by creating a linear model of winpercent using all the other variables (except competitorname). Because this is a categorical variable with a unique value in every row of the dataset, it’s actually mathematically impossible to fit a linear model with it. Moreover, this variable doesn’t actually include any information that our model could use because these names don’t actually relate to any of the attributes of the candy.
Let’s fit this model, then we can dive into exploring it shortly. Maybe this will give us an idea of why people tend to prefer one candy over another!
# Fit a linear model of winpercent explained by all variables
# except competitorname
win_mod <- lm(winpercent ~ . -competitorname, data = candy_rankings)
win_mod
##
## Call:
## lm(formula = winpercent ~ . - competitorname, data = candy_rankings)
##
## Coefficients:
## (Intercept) chocolateTRUE fruityTRUE
## 34.5340 19.7481 9.4223
## caramelTRUE peanutyalmondyTRUE nougatTRUE
## 2.2245 10.0707 0.8043
## crispedricewaferTRUE hardTRUE barTRUE
## 8.9190 -6.1653 0.4415
## pluribusTRUE sugarpercent pricepercent
## -0.8545 9.0868 -5.9284
Let’s see how we did! We’ll take a look at the results of our linear model and run some basic diagnostics to make sure the output is reliable.
Taking a look at the coefficients, we can make some conclusions about the factors that cause people to choose one candy over another. For example, it looks like people who took this survey really like peanut butter! There are a few other significant coefficients such as chocolate and fruity.
# Take a look at the summary
summary(win_mod)
##
## Call:
## lm(formula = winpercent ~ . - competitorname, data = candy_rankings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2244 -6.6247 0.1986 6.8420 23.8680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5340 4.3199 7.994 1.44e-11 ***
## chocolateTRUE 19.7481 3.8987 5.065 2.96e-06 ***
## fruityTRUE 9.4223 3.7630 2.504 0.01452 *
## caramelTRUE 2.2245 3.6574 0.608 0.54493
## peanutyalmondyTRUE 10.0707 3.6158 2.785 0.00681 **
## nougatTRUE 0.8043 5.7164 0.141 0.88849
## crispedricewaferTRUE 8.9190 5.2679 1.693 0.09470 .
## hardTRUE -6.1653 3.4551 -1.784 0.07852 .
## barTRUE 0.4415 5.0611 0.087 0.93072
## pluribusTRUE -0.8545 3.0401 -0.281 0.77945
## sugarpercent 9.0868 4.6595 1.950 0.05500 .
## pricepercent -5.9284 5.5132 -1.075 0.28578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared: 0.5402, Adjusted R-squared: 0.4709
## F-statistic: 7.797 on 11 and 73 DF, p-value: 9.504e-09
# Plot the residuals vs the fitted values
augment(win_mod)
## winpercent competitorname chocolate fruity caramel
## 1 66.97173 100 Grand TRUE FALSE TRUE
## 2 67.60294 3 Musketeers TRUE FALSE FALSE
## 3 32.26109 One dime FALSE FALSE FALSE
## 4 46.11650 One quarter FALSE FALSE FALSE
## 5 52.34146 Air Heads FALSE TRUE FALSE
## 6 50.34755 Almond Joy TRUE FALSE FALSE
## 7 56.91455 Baby Ruth TRUE FALSE TRUE
## 8 23.41782 Boston Baked Beans FALSE FALSE FALSE
## 9 38.01096 Candy Corn FALSE FALSE FALSE
## 10 34.51768 Caramel Apple Pops FALSE TRUE TRUE
## 11 38.97504 Charleston Chew TRUE FALSE FALSE
## 12 36.01763 Chewey Lemonhead Fruit Mix FALSE TRUE FALSE
## 13 24.52499 Chiclets FALSE TRUE FALSE
## 14 42.27208 Dots FALSE TRUE FALSE
## 15 39.46056 Dum Dums FALSE TRUE FALSE
## 16 43.08892 Fruit Chews FALSE TRUE FALSE
## 17 39.18550 Fun Dip FALSE TRUE FALSE
## 18 46.78335 Gobstopper FALSE TRUE FALSE
## 19 57.11974 Haribo Gold Bears FALSE TRUE FALSE
## 20 34.15896 Haribo Happy Cola FALSE FALSE FALSE
## 21 51.41243 Haribo Sour Bears FALSE TRUE FALSE
## 22 42.17877 Haribo Twin Snakes FALSE TRUE FALSE
## 23 55.37545 Hershey's Kisses TRUE FALSE FALSE
## 24 62.28448 Hershey's Krackel TRUE FALSE FALSE
## 25 56.49050 Hershey's Milk Chocolate TRUE FALSE FALSE
## 26 59.23612 Hershey's Special Dark TRUE FALSE FALSE
## 27 28.12744 Jawbusters FALSE TRUE FALSE
## 28 57.21925 Junior Mints TRUE FALSE FALSE
## 29 76.76860 Kit Kat TRUE FALSE FALSE
## 30 41.38956 Laffy Taffy FALSE TRUE FALSE
## 31 39.14106 Lemonhead FALSE TRUE FALSE
## 32 52.91139 Lifesavers big ring gummies FALSE TRUE FALSE
## 33 71.46505 Peanut butter M&M's TRUE FALSE FALSE
## 34 66.57458 M&M's TRUE FALSE FALSE
## 35 46.41172 Mike & Ike FALSE TRUE FALSE
## 36 55.06407 Milk Duds TRUE FALSE TRUE
## 37 73.09956 Milky Way TRUE FALSE TRUE
## 38 60.80070 Milky Way Midnight TRUE FALSE TRUE
## 39 64.35334 Milky Way Simply Caramel TRUE FALSE TRUE
## 40 47.82975 Mounds TRUE FALSE FALSE
## 41 54.52645 Mr Good Bar TRUE FALSE FALSE
## 42 55.35405 Nerds FALSE TRUE FALSE
## 43 70.73564 Nestle Butterfinger TRUE FALSE FALSE
## 44 66.47068 Nestle Crunch TRUE FALSE FALSE
## 45 22.44534 Nik L Nip FALSE TRUE FALSE
## 46 39.44680 Now & Later FALSE TRUE FALSE
## 47 46.29660 Payday FALSE FALSE FALSE
## 48 69.48379 Peanut M&Ms TRUE FALSE FALSE
## 49 37.72234 Pixie Sticks FALSE FALSE FALSE
## 50 41.26551 Pop Rocks FALSE TRUE FALSE
## 51 37.34852 Red vines FALSE TRUE FALSE
## 52 81.86626 Reese's Miniatures TRUE FALSE FALSE
## 53 84.18029 Reese's Peanut Butter cup TRUE FALSE FALSE
## 54 73.43499 Reese's pieces TRUE FALSE FALSE
## 55 72.88790 Reese's stuffed with pieces TRUE FALSE FALSE
## 56 35.29076 Ring pop FALSE TRUE FALSE
## 57 65.71629 Rolo TRUE FALSE TRUE
## 58 29.70369 Root Beer Barrels FALSE FALSE FALSE
## 59 42.84914 Runts FALSE TRUE FALSE
## 60 34.72200 Sixlets TRUE FALSE FALSE
## 61 63.08514 Skittles original FALSE TRUE FALSE
## 62 55.10370 Skittles wildberry FALSE TRUE FALSE
## 63 37.88719 Nestle Smarties TRUE FALSE FALSE
## 64 45.99583 Smarties candy FALSE TRUE FALSE
## 65 76.67378 Snickers TRUE FALSE TRUE
## 66 59.52925 Snickers Crisper TRUE FALSE TRUE
## 67 59.86400 Sour Patch Kids FALSE TRUE FALSE
## 68 52.82595 Sour Patch Tricksters FALSE TRUE FALSE
## 69 67.03763 Starburst FALSE TRUE FALSE
## 70 34.57899 Strawberry bon bons FALSE TRUE FALSE
## 71 33.43755 Sugar Babies FALSE FALSE TRUE
## 72 32.23100 Sugar Daddy FALSE FALSE TRUE
## 73 27.30386 Super Bubble FALSE TRUE FALSE
## 74 54.86111 Swedish Fish FALSE TRUE FALSE
## 75 48.98265 Tootsie Pop TRUE TRUE FALSE
## 76 43.06890 Tootsie Roll Juniors TRUE FALSE FALSE
## 77 45.73675 Tootsie Roll Midgies TRUE FALSE FALSE
## 78 49.65350 Tootsie Roll Snack Bars TRUE FALSE FALSE
## 79 47.17323 Trolli Sour Bites FALSE TRUE FALSE
## 80 81.64291 Twix TRUE FALSE TRUE
## 81 45.46628 Twizzlers FALSE TRUE FALSE
## 82 39.01190 Warheads FALSE TRUE FALSE
## 83 44.37552 Welch's Fruit Snacks FALSE TRUE FALSE
## 84 41.90431 Werther's Original Caramel FALSE FALSE TRUE
## 85 49.52411 Whoppers TRUE FALSE FALSE
## peanutyalmondy nougat crispedricewafer hard bar pluribus
## 1 FALSE FALSE TRUE FALSE TRUE FALSE
## 2 FALSE TRUE FALSE FALSE TRUE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE
## 6 TRUE FALSE FALSE FALSE TRUE FALSE
## 7 TRUE TRUE FALSE FALSE TRUE FALSE
## 8 TRUE FALSE FALSE FALSE FALSE TRUE
## 9 FALSE FALSE FALSE FALSE FALSE TRUE
## 10 FALSE FALSE FALSE FALSE FALSE FALSE
## 11 FALSE TRUE FALSE FALSE TRUE FALSE
## 12 FALSE FALSE FALSE FALSE FALSE TRUE
## 13 FALSE FALSE FALSE FALSE FALSE TRUE
## 14 FALSE FALSE FALSE FALSE FALSE TRUE
## 15 FALSE FALSE FALSE TRUE FALSE FALSE
## 16 FALSE FALSE FALSE FALSE FALSE TRUE
## 17 FALSE FALSE FALSE TRUE FALSE FALSE
## 18 FALSE FALSE FALSE TRUE FALSE TRUE
## 19 FALSE FALSE FALSE FALSE FALSE TRUE
## 20 FALSE FALSE FALSE FALSE FALSE TRUE
## 21 FALSE FALSE FALSE FALSE FALSE TRUE
## 22 FALSE FALSE FALSE FALSE FALSE TRUE
## 23 FALSE FALSE FALSE FALSE FALSE TRUE
## 24 FALSE FALSE TRUE FALSE TRUE FALSE
## 25 FALSE FALSE FALSE FALSE TRUE FALSE
## 26 FALSE FALSE FALSE FALSE TRUE FALSE
## 27 FALSE FALSE FALSE TRUE FALSE TRUE
## 28 FALSE FALSE FALSE FALSE FALSE TRUE
## 29 FALSE FALSE TRUE FALSE TRUE FALSE
## 30 FALSE FALSE FALSE FALSE FALSE FALSE
## 31 FALSE FALSE FALSE TRUE FALSE FALSE
## 32 FALSE FALSE FALSE FALSE FALSE FALSE
## 33 TRUE FALSE FALSE FALSE FALSE TRUE
## 34 FALSE FALSE FALSE FALSE FALSE TRUE
## 35 FALSE FALSE FALSE FALSE FALSE TRUE
## 36 FALSE FALSE FALSE FALSE FALSE TRUE
## 37 FALSE TRUE FALSE FALSE TRUE FALSE
## 38 FALSE TRUE FALSE FALSE TRUE FALSE
## 39 FALSE FALSE FALSE FALSE TRUE FALSE
## 40 FALSE FALSE FALSE FALSE TRUE FALSE
## 41 TRUE FALSE FALSE FALSE TRUE FALSE
## 42 FALSE FALSE FALSE TRUE FALSE TRUE
## 43 TRUE FALSE FALSE FALSE TRUE FALSE
## 44 FALSE FALSE TRUE FALSE TRUE FALSE
## 45 FALSE FALSE FALSE FALSE FALSE TRUE
## 46 FALSE FALSE FALSE FALSE FALSE TRUE
## 47 TRUE TRUE FALSE FALSE TRUE FALSE
## 48 TRUE FALSE FALSE FALSE FALSE TRUE
## 49 FALSE FALSE FALSE FALSE FALSE TRUE
## 50 FALSE FALSE FALSE TRUE FALSE TRUE
## 51 FALSE FALSE FALSE FALSE FALSE TRUE
## 52 TRUE FALSE FALSE FALSE FALSE FALSE
## 53 TRUE FALSE FALSE FALSE FALSE FALSE
## 54 TRUE FALSE FALSE FALSE FALSE TRUE
## 55 TRUE FALSE FALSE FALSE FALSE FALSE
## 56 FALSE FALSE FALSE TRUE FALSE FALSE
## 57 FALSE FALSE FALSE FALSE FALSE TRUE
## 58 FALSE FALSE FALSE TRUE FALSE TRUE
## 59 FALSE FALSE FALSE TRUE FALSE TRUE
## 60 FALSE FALSE FALSE FALSE FALSE TRUE
## 61 FALSE FALSE FALSE FALSE FALSE TRUE
## 62 FALSE FALSE FALSE FALSE FALSE TRUE
## 63 FALSE FALSE FALSE FALSE FALSE TRUE
## 64 FALSE FALSE FALSE TRUE FALSE TRUE
## 65 TRUE TRUE FALSE FALSE TRUE FALSE
## 66 TRUE FALSE TRUE FALSE TRUE FALSE
## 67 FALSE FALSE FALSE FALSE FALSE TRUE
## 68 FALSE FALSE FALSE FALSE FALSE TRUE
## 69 FALSE FALSE FALSE FALSE FALSE TRUE
## 70 FALSE FALSE FALSE TRUE FALSE TRUE
## 71 FALSE FALSE FALSE FALSE FALSE TRUE
## 72 FALSE FALSE FALSE FALSE FALSE FALSE
## 73 FALSE FALSE FALSE FALSE FALSE FALSE
## 74 FALSE FALSE FALSE FALSE FALSE TRUE
## 75 FALSE FALSE FALSE TRUE FALSE FALSE
## 76 FALSE FALSE FALSE FALSE FALSE FALSE
## 77 FALSE FALSE FALSE FALSE FALSE TRUE
## 78 FALSE FALSE FALSE FALSE TRUE FALSE
## 79 FALSE FALSE FALSE FALSE FALSE TRUE
## 80 FALSE FALSE TRUE FALSE TRUE FALSE
## 81 FALSE FALSE FALSE FALSE FALSE FALSE
## 82 FALSE FALSE FALSE TRUE FALSE FALSE
## 83 FALSE FALSE FALSE FALSE FALSE TRUE
## 84 FALSE FALSE FALSE TRUE FALSE FALSE
## 85 FALSE FALSE TRUE FALSE FALSE TRUE
## sugarpercent pricepercent .fitted .se.fit .resid .hat
## 1 0.732 0.860 67.42016 4.589589 -0.4484311 0.18388570
## 2 0.604 0.511 57.98693 4.929578 9.6160080 0.21213862
## 3 0.011 0.116 33.94624 4.180558 -1.6851569 0.15256987
## 4 0.011 0.511 31.60454 4.447541 14.5119648 0.17267932
## 5 0.906 0.511 49.15952 4.031590 3.1819498 0.14189038
## 6 0.465 0.767 64.47257 4.235507 -14.1250194 0.15660696
## 7 0.604 0.767 68.76444 4.898842 -11.8498901 0.20950149
## 8 0.313 0.511 43.56493 4.478461 -20.1471075 0.17508860
## 9 0.906 0.325 39.98537 4.296181 -1.9744058 0.16112595
## 10 0.604 0.325 49.74247 4.479715 -15.2247879 0.17518668
## 11 0.604 0.511 57.98693 4.929578 -19.0118910 0.21213862
## 12 0.732 0.511 46.72392 2.549422 -10.7062907 0.05673929
## 13 0.046 0.325 41.59307 2.882302 -17.0680866 0.07252360
## 14 0.732 0.511 46.72392 2.549422 -4.4518427 0.05673929
## 15 0.732 0.034 44.24092 3.822711 -4.7803640 0.12756847
## 16 0.127 0.034 44.05426 2.942445 -0.9653315 0.07558177
## 17 0.732 0.325 42.51577 3.448166 -3.3302619 0.10379508
## 18 0.906 0.453 42.48353 3.411564 4.2998140 0.10160322
## 19 0.465 0.465 44.57046 2.272905 12.5492824 0.04509860
## 20 0.465 0.465 35.14814 3.480918 -0.9891775 0.10577618
## 21 0.465 0.465 44.57046 2.272905 6.8419724 0.04509860
## 22 0.465 0.465 44.57046 2.272905 -2.3916856 0.04509860
## 23 0.127 0.093 54.03023 3.705246 1.3452269 0.11984898
## 24 0.430 0.918 62.10763 4.524033 0.1768533 0.17867008
## 25 0.430 0.918 53.18866 4.105998 3.3018431 0.14717627
## 26 0.430 0.918 53.18866 4.105998 6.0474641 0.14717627
## 27 0.093 0.511 34.75215 4.149031 -6.6247118 0.15027737
## 28 0.197 0.511 52.18825 3.256920 5.0310044 0.09260077
## 29 0.313 0.511 63.45732 4.609345 13.3112806 0.18547216
## 30 0.220 0.116 45.26770 3.194444 -3.8781414 0.08908217
## 31 0.046 0.104 37.59242 3.712883 1.5486405 0.12034358
## 32 0.267 0.279 44.72845 3.090762 8.1829388 0.08339335
## 33 0.825 0.651 67.13545 3.831854 4.3295998 0.12817937
## 34 0.825 0.651 57.06476 3.427442 9.5098233 0.10255116
## 35 0.872 0.325 49.09874 2.974721 -2.6870246 0.07724900
## 36 0.302 0.511 55.36684 4.464249 -0.3027649 0.17397917
## 37 0.604 0.651 59.38144 4.609916 13.7181175 0.18551813
## 38 0.732 0.441 61.78950 4.737852 -0.9887993 0.19595818
## 39 0.965 0.860 60.61840 5.058007 3.7349381 0.22333620
## 40 0.313 0.860 52.46935 4.079075 -4.6395973 0.14525253
## 41 0.313 0.918 62.19619 4.366107 -7.6697440 0.16641372
## 42 0.848 0.325 42.71533 3.348410 12.6387142 0.09787629
## 43 0.604 0.767 65.73563 4.284363 5.0000159 0.16024071
## 44 0.313 0.767 61.93966 4.472589 4.5310214 0.17462982
## 45 0.197 0.976 39.10581 4.526811 -16.6604714 0.17888959
## 46 0.220 0.325 43.17417 2.441040 -3.7273714 0.05201758
## 47 0.465 0.767 45.52883 6.222662 0.7677680 0.33802802
## 48 0.593 0.651 65.02732 3.640081 4.4564668 0.11567046
## 49 0.093 0.023 34.38820 3.918178 3.3341405 0.13401969
## 50 0.604 0.837 37.46284 4.114747 3.8026706 0.14780416
## 51 0.581 0.116 47.69352 2.732204 -10.3449980 0.06516681
## 52 0.034 0.279 63.00767 4.407513 18.8585861 0.16958503
## 53 0.720 0.651 67.03584 4.221377 17.1444500 0.15556384
## 54 0.406 0.651 63.32810 3.714337 10.1068935 0.12043784
## 55 0.988 0.651 69.47109 4.724808 3.4168090 0.19488066
## 56 0.732 0.965 38.72162 4.900162 -3.4308597 0.20961449
## 57 0.860 0.860 58.36825 4.489595 7.3480335 0.17596029
## 58 0.732 0.069 33.75661 5.105617 -4.0529148 0.22756042
## 59 0.872 0.279 43.20612 3.439858 -0.3569746 0.10329552
## 60 0.220 0.081 54.94644 3.660922 -20.2244364 0.11699880
## 61 0.941 0.220 50.34821 3.408857 12.7369349 0.10144202
## 62 0.941 0.220 50.34821 3.408857 4.7554899 0.10144202
## 63 0.267 0.976 50.06763 4.343769 -12.1804426 0.16471523
## 64 0.267 0.116 38.67495 3.507512 7.3208769 0.10739862
## 65 0.546 0.651 68.92510 4.930439 7.7486870 0.21221278
## 66 0.604 0.651 77.56677 5.821144 -18.0375152 0.29581285
## 67 0.069 0.116 43.04110 2.890821 16.8229004 0.07295292
## 68 0.069 0.116 43.04110 2.890821 9.7848494 0.07295292
## 69 0.151 0.220 43.16966 2.600091 23.8679655 0.05901703
## 70 0.569 0.058 41.76300 3.474481 -7.1840066 0.10538534
## 71 0.965 0.767 40.12563 4.837866 -6.6880828 0.20431867
## 72 0.418 0.325 38.63001 4.579474 -6.3990143 0.18307608
## 73 0.162 0.116 44.74067 3.223612 -17.4368011 0.09071640
## 74 0.604 0.755 44.11429 3.019560 10.7468184 0.07959532
## 75 0.604 0.325 61.10073 5.063631 -12.1180770 0.22383315
## 76 0.313 0.511 54.09681 3.761967 -11.0279125 0.12354649
## 77 0.174 0.011 54.94343 3.890275 -9.2066825 0.13211770
## 78 0.465 0.325 57.02221 4.480837 -7.3687098 0.17527446
## 79 0.313 0.255 44.43423 2.317720 2.7390035 0.04689453
## 80 0.546 0.906 65.45731 4.604338 16.1856004 0.18506943
## 81 0.220 0.116 45.26770 3.194444 0.1985836 0.08908217
## 82 0.093 0.116 37.94835 3.616019 1.0635450 0.11414625
## 83 0.313 0.313 44.09038 2.289832 0.2851384 0.04577280
## 84 0.186 0.267 30.70040 5.423620 11.2039093 0.25679046
## 85 0.872 0.848 65.24292 6.028675 -15.7188093 0.31728098
## .sigma .cooksd .std.resid
## 1 10.77677 4.038839e-05 -0.04637889
## 2 10.70103 2.298946e-02 1.01220873
## 3 10.77477 4.388951e-04 -0.17103640
## 4 10.61163 3.865130e-02 1.49069723
## 5 10.76932 1.419297e-03 0.32093871
## 6 10.62340 3.195569e-02 -1.43705772
## 7 10.66185 3.424781e-02 -1.24527125
## 8 10.45505 7.597798e-02 -2.07257005
## 9 10.77393 6.493276e-04 -0.20141333
## 10 10.59429 4.342215e-02 -1.56629509
## 11 10.47713 8.986485e-02 -2.00124647
## 12 10.69834 5.317623e-03 -1.02996704
## 13 10.57259 1.786744e-02 -1.65589797
## 14 10.76338 9.194309e-04 -0.42827636
## 15 10.76004 2.786260e-03 -0.47818446
## 16 10.77628 5.995869e-05 -0.09380855
## 17 10.76895 1.042649e-03 -0.32868122
## 18 10.76366 1.693127e-03 0.42385347
## 19 10.67013 5.666344e-03 1.19988547
## 20 10.77622 9.415925e-05 -0.09773527
## 21 10.74529 1.684332e-03 0.65418747
## 22 10.77307 2.058132e-04 -0.22867832
## 23 10.77560 2.036714e-04 0.13397295
## 24 10.77690 6.026453e-06 0.01823284
## 25 10.76869 1.604912e-03 0.33406191
## 26 10.74926 5.383750e-03 0.61184839
## 27 10.74360 6.644959e-03 -0.67147302
## 28 10.75894 2.070839e-03 0.49346422
## 29 10.63583 3.603509e-02 1.37805655
## 30 10.76628 1.174624e-03 -0.37965070
## 31 10.77517 2.713421e-04 0.15427454
## 32 10.72975 4.835074e-03 0.79857915
## 33 10.76306 2.299738e-03 0.43324579
## 34 10.71180 8.376936e-03 0.93792425
## 35 10.77189 4.765256e-04 -0.26135424
## 36 10.77686 1.700378e-05 -0.03112506
## 37 10.62700 3.828526e-02 1.42021461
## 38 10.77615 2.155964e-04 -0.10303126
## 39 10.76535 3.757326e-03 0.39597435
## 40 10.76069 3.113348e-03 -0.46887982
## 41 10.73136 1.024870e-02 -0.78488450
## 42 10.66222 1.397559e-02 1.24328304
## 43 10.75773 4.132616e-03 0.50979332
## 44 10.76089 3.828531e-03 0.46598495
## 45 10.55685 5.357666e-02 -1.71785568
## 46 10.76748 5.850244e-04 -0.35768662
## 47 10.77635 3.307902e-04 0.08816788
## 48 10.76245 2.136951e-03 0.44277570
## 49 10.76865 1.445238e-03 0.33475730
## 50 10.76599 2.140939e-03 0.38487453
## 51 10.70291 5.805479e-03 -0.99968578
## 52 10.49733 6.362599e-02 1.93357807
## 53 10.55025 4.664888e-02 1.74317229
## 54 10.70183 1.156869e-02 1.00689593
## 55 10.76758 2.553340e-03 0.35578755
## 56 10.76733 2.873217e-03 -0.36056506
## 57 10.73462 1.017840e-02 0.75630544
## 58 10.76322 4.557448e-03 -0.43085921
## 59 10.77684 1.190903e-05 -0.03522190
## 60 10.47418 4.465095e-02 -2.01092535
## 61 10.65995 1.482773e-02 1.25542867
## 62 10.76070 2.066975e-03 0.46872960
## 63 10.66186 2.548061e-02 -1.24521963
## 64 10.73817 5.255680e-03 0.72399321
## 65 10.72770 1.493582e-02 0.81568752
## 66 10.47498 1.411931e-01 -2.00831961
## 67 10.57838 1.747672e-02 1.63248855
## 68 10.71017 5.912436e-03 0.94951847
## 69 10.37949 2.762255e-02 2.29892439
## 70 10.73969 4.943798e-03 -0.70965765
## 71 10.74064 1.050152e-02 -0.70053885
## 72 10.74458 8.171723e-03 -0.66148872
## 73 10.55927 2.426833e-02 -1.70850913
## 74 10.69577 7.894213e-03 1.04662396
## 75 10.65432 3.969179e-02 -1.28515743
## 76 10.68714 1.422913e-02 -1.10059881
## 77 10.71381 1.081595e-02 -0.92336396
## 78 10.73442 1.017892e-02 -0.75811816
## 79 10.77186 2.817374e-04 0.26213335
## 80 10.56775 5.310925e-02 1.67520776
## 81 10.77690 3.079912e-06 0.01944035
## 82 10.77611 1.196929e-04 0.10557839
## 83 10.77687 2.973271e-06 0.02727282
## 84 10.66754 4.245361e-02 1.21426678
## 85 10.54114 1.223541e-01 -1.77745580
ggplot(augment(win_mod), aes(x=.fitted, y=.resid)) + geom_point() + geom_hline(yintercept=0)
Now let’s try out logistic regression! We’ll be trying to predict if a candy is chocolaty or not based on all the other features in the dataset. A logistic regression is a great choice for this particular modeling task because the variable we’re trying to predict is either TRUE or FALSE. The logistic regression model will output a probability that we can use to make our decision. This model outputs a warning because a few of the features (like crispedricewafer) are only ever true when a candy is chocolate. This means that we can’t draw conclusions from the coefficients, but we can still use the model to make predictions just fine!
# Fit a glm() of chocolate
choc_mod <- glm(chocolate ~ . - competitorname, family = binomial, data = candy_rankings)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Let’s take a look at our logistic regression model! We’ll start by creating a data frame of predictions we can compare to the actual values. Then we’ll evaluate the model by making a confusion matrix and calculating the accuracy.
Looking at the summary, it looks like most of the coefficients aren’t statistically significant. In this case, that’s okay because we’re not trying to draw any conclusions about the relationships between the predictor variables and the response. We’re only trying to make accurate predictions and, taking a look at our confusion matrix, it seems like we did a pretty good job!
# Print the summary
summary(choc_mod)
##
## Call:
## glm(formula = chocolate ~ . - competitorname, family = binomial,
## data = candy_rankings)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.72224 -0.17612 -0.02787 0.01954 2.57898
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.29370 4.12040 -2.498 0.01248 *
## fruityTRUE -6.75305 2.20462 -3.063 0.00219 **
## caramelTRUE -1.85093 1.66750 -1.110 0.26700
## peanutyalmondyTRUE -4.11907 2.98175 -1.381 0.16715
## nougatTRUE -16.74818 3520.13323 -0.005 0.99620
## crispedricewaferTRUE 14.98331 4725.35051 0.003 0.99747
## hardTRUE 1.83504 1.80742 1.015 0.30997
## barTRUE 19.06799 3520.13379 0.005 0.99568
## pluribusTRUE 0.22804 1.45457 0.157 0.87542
## sugarpercent 0.12168 2.07707 0.059 0.95329
## pricepercent 1.76626 2.24816 0.786 0.43208
## winpercent 0.23019 0.08593 2.679 0.00739 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 116.407 on 84 degrees of freedom
## Residual deviance: 25.802 on 73 degrees of freedom
## AIC: 49.802
##
## Number of Fisher Scoring iterations: 19
# Make a dataframe of predictions
preds <- augment(choc_mod, data = candy_rankings, type.predict = "response") %>%
mutate(prediction = .fitted > .5)
# Create the confusion matrix
conf_mat <- preds %>%
select(chocolate, prediction) %>%
table()
print(conf_mat)
## prediction
## chocolate FALSE TRUE
## FALSE 47 1
## TRUE 2 35
# Calculate the accuracy
accuracy <- sum(diag(conf_mat))/sum(conf_mat)
cat("Accuracy of the model is:", accuracy)
## Accuracy of the model is: 0.9647059