1 k-Nearest Neighbors (kNN)

1.1 Classification with Nearest Neighbors

1.1.1 Recognizing a road sign with kNN

After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.

As it begins to drive away, its camera captures the following image:

Can we apply a kNN classifier to help the car recognize this sign?

The dataset signs is loaded in your workspace along with the dataframe next_sign, which holds the observation you want to classify.

# Load the 'class' package
library(dplyr)
library(class)
next_sign <- readRDS("next_sing.rds")
signs <- read.csv("knn_traffic_signs.csv", stringsAsFactors=FALSE)
signs <- signs %>% 
  select(-id, -sample)
# Load the 'class' package
library(class)

# Create a vector of labels
sign_types <- signs$sign_type

# Classify the next sign observed
knn(train = signs[-1], test = next_sign, cl = sign_types)
[1] stop
Levels: pedestrian speed stop

We’ve trained your first nearest neighbor classifier!

next_sign

1.1.2 Exploring the traffic sign dataset

To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.

# Examine the structure of the signs dataset
str(signs)
'data.frame':   206 obs. of  49 variables:
 $ sign_type: chr  "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...
 $ r1       : int  155 142 57 22 169 75 136 118 149 13 ...
 $ g1       : int  228 217 54 35 179 67 149 105 225 34 ...
 $ b1       : int  251 242 50 41 170 60 157 69 241 28 ...
 $ r2       : int  135 166 187 171 231 131 200 244 34 5 ...
 $ g2       : int  188 204 201 178 254 89 203 245 45 21 ...
 $ b2       : int  101 44 68 26 27 53 107 67 1 11 ...
 $ r3       : int  156 142 51 19 97 214 150 132 155 123 ...
 $ g3       : int  227 217 51 27 107 144 167 123 226 154 ...
 $ b3       : int  245 242 45 29 99 75 134 12 238 140 ...
 $ r4       : int  145 147 59 19 123 156 171 138 147 21 ...
 $ g4       : int  211 219 62 27 147 169 218 123 222 46 ...
 $ b4       : int  228 242 65 29 152 190 252 85 242 41 ...
 $ r5       : int  166 164 156 42 221 67 171 254 170 36 ...
 $ g5       : int  233 228 171 37 236 50 158 254 191 60 ...
 $ b5       : int  245 229 50 3 117 36 108 92 113 26 ...
 $ r6       : int  212 84 254 217 205 37 157 241 26 75 ...
 $ g6       : int  254 116 255 228 225 36 186 240 37 108 ...
 $ b6       : int  52 17 36 19 80 42 11 108 12 44 ...
 $ r7       : int  212 217 211 221 235 44 26 254 34 13 ...
 $ g7       : int  254 254 226 235 254 42 35 254 45 27 ...
 $ b7       : int  11 26 70 20 60 44 10 99 19 25 ...
 $ r8       : int  188 155 78 181 90 192 180 108 221 133 ...
 $ g8       : int  229 203 73 183 110 131 211 106 249 163 ...
 $ b8       : int  117 128 64 73 9 73 236 27 184 126 ...
 $ r9       : int  170 213 220 237 216 123 129 135 226 83 ...
 $ g9       : int  216 253 234 234 236 74 109 123 246 125 ...
 $ b9       : int  120 51 59 44 66 22 73 40 59 19 ...
 $ r10      : int  211 217 254 251 229 36 161 254 30 13 ...
 $ g10      : int  254 255 255 254 255 34 190 254 40 27 ...
 $ b10      : int  3 21 51 2 12 37 10 115 34 25 ...
 $ r11      : int  212 217 253 235 235 44 161 254 34 9 ...
 $ g11      : int  254 255 255 243 254 42 190 254 44 23 ...
 $ b11      : int  19 21 44 12 60 44 6 99 35 18 ...
 $ r12      : int  172 158 66 19 163 197 187 138 241 85 ...
 $ g12      : int  235 225 68 27 168 114 215 123 255 128 ...
 $ b12      : int  244 237 68 29 152 21 236 85 54 21 ...
 $ r13      : int  172 164 69 20 124 171 141 118 205 83 ...
 $ g13      : int  235 227 65 29 117 102 142 105 229 125 ...
 $ b13      : int  244 237 59 34 91 26 140 75 46 19 ...
 $ r14      : int  172 182 76 64 188 197 189 131 226 85 ...
 $ g14      : int  228 228 84 61 205 114 171 124 246 128 ...
 $ b14      : int  235 143 22 4 78 21 140 5 59 21 ...
 $ r15      : int  177 171 82 211 125 123 214 106 235 85 ...
 $ g15      : int  235 228 93 222 147 74 221 94 252 128 ...
 $ b15      : int  244 196 17 78 20 22 201 53 67 21 ...
 $ r16      : int  22 164 58 19 160 180 188 101 237 83 ...
 $ g16      : int  52 227 60 27 183 107 211 91 254 125 ...
 $ b16      : int  53 237 60 29 187 26 227 59 53 19 ...
# Count the number of signs of each type
table(signs$sign_type)

pedestrian      speed       stop 
        65         70         71 
# Check r10's average red level by sign type
aggregate(r10 ~ sign_type, data = signs, mean)

As you might have expected, stop signs tend to have a higher average red value. This is how kNN identifies similar signs.

1.1.3 Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 59 additional road signs divided into three types:

At the conclusion of the trial, we are asked to measure the car’s overall performance at recognizing these sign

test_signs <- readRDS("test_signs.rds")
# Use kNN to identify the test road signs
sign_types <- signs$sign_type
signs_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)

# Create a confusion matrix of the predicted versus actual values
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)
            signs_actual
signs_pred   pedestrian speed stop
  pedestrian         19     0    0
  speed               0    21    0
  stop                0     0   19
# Compute the accuracy
mean(signs_pred == signs_actual)
[1] 1

That self-driving car is really coming along! The confusion matrix lets you look for patterns in the classifier’s errors.

1.2 K in kNN

1.2.1 Understanding the impact of ‘k’

There is a complex relationship between k and classification accuracy. Bigger is not always better.

  • Smaller k increases the impact of noisy data
  • A smaller k may utilize more subtle patterns in the data

1.2.2 Testing other ‘k’ values

By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)
mean(k_1 == signs_actual)
[1] 1
# Modify the above to set k = 7
k_7 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 7)
mean(k_7 == signs_actual)
[1] 0.9661017
# Set k = 15 and compare to the above
k_15 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 15)
mean(k_15 == signs_actual)
[1] 0.9661017

1.2.3 Seeing how the neighbors voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

In this exercise, you will learn how to obtain the voting results from the knn() function.

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 7, prob = TRUE)


# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")

# Examine the first several predictions
head(sign_pred)
[1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
Levels: pedestrian speed stop
# Examine the proportion of votes for the winning class
head(sign_prob)
[1] 0.5714286 0.7142857 0.8571429 0.4285714 1.0000000 0.8571429

Now you can get an idea of how certain your kNN learner is about its classifications.

1.3 Data preparation for kNN

Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization, to ensure all data elements may contribute equal shares to distance.

Rescaling reduces the influence of extreme values on kNN’s distance function.

2 Naive Bayes

2.1 Understanding Bayesian methods

2.1.1 Computing probabilities

The where9am data frame contains 91 days (thirteen weeks) worth of data in which Brett recorded his location at 9am each day as well as whether the daytype was a weekend or weekday.

library(readr)
locations <- read_csv("locations.csv")
Parsed with column specification:
cols(
  month = col_double(),
  day = col_double(),
  weekday = col_character(),
  daytype = col_character(),
  hour = col_double(),
  hourtype = col_character(),
  location = col_character()
)
where9am <- locations %>% 
  filter(hour == 9, hourtype == "morning")

Using the conditional probability formula below, you can compute the probability that Brett is working in the office, given that it is a weekday.

\(\displaystyle P(A|B)=\frac{P(A\ and\ B)}{P(B)}\)

Calculations like these are the basis of the Naive Bayes destination prediction model we’ll develop later.

A = office B = weekday

# Compute P(A) 
p_A <- nrow(subset(where9am, location == "office")) / nrow(where9am)

# Compute P(B)
p_B <- nrow(subset(where9am, daytype == "weekday")) / nrow(where9am)

# Compute the observed P(A and B)
p_AB <- nrow(subset(where9am, location == "office" & daytype == "weekday")) / nrow(where9am)

# Compute P(A | B) and print its value
p_A_given_B <- p_AB/p_B
p_A_given_B
[1] 0.6

We found that there is a 60% chance Brett is in the office at 9am given that it is a weekday. On the other hand, if Brett is never in the office on a weekend:

  • P(office and weekend) = 0.
  • P(office | weekend) = 0.
  • Brett’s location is dependent on the day of the week.

Because the events do not overlap, knowing that one occurred tells you much about the status of the other.

2.1.2 A simple Naive Bayes location model

The probability that Brett is at work or at home at 9am is highly dependent on whether it is the weekend or a weekday.

To see this finding in action, use the where9am data frame to build a Naive Bayes model on the same data.

We can then use this model to predict the future: where does the model think that Brett will be at 9am on Thursday and at 9am on Saturday?

temp_df <- data.frame(daytype=factor(c("weekday", "weekend")))
thursday9am <- subset(temp_df, daytype == "weekday")
saturday9am <- subset(temp_df, daytype == "weekend")
# Load the naivebayes package
library(naivebayes)

# Build the location prediction model
locmodel <- naive_bayes(location ~ daytype, data = where9am)
naive_bayes(): Feature daytype - zero probabilities are present. Consider Laplace smoothing.
# Predict Thursday's 9am location
predict(locmodel, thursday9am)
[1] office
Levels: appointment campus home office
# Predict Saturdays's 9am location
predict(locmodel, saturday9am)
[1] home
Levels: appointment campus home office

Not surprisingly, Brett is most likely at the office at 9am on a Thursday, but at home at the same time on a Saturday!

2.1.3 Examining “raw” probabilities

The naivebayes package offers several ways to peek inside a Naive Bayes model.

Typing the name of the model object provides the a priori (overall) and conditional probabilities of each of the model’s predictors. If one were so inclined, you might use these for calculating posterior (predicted) probabilities by hand.

Alternatively, R will compute the posterior probabilities for you if the type = “prob” parameter is supplied to the predict() function.

Using these methods, examine how the model’s predicted 9am location probability varies from day-to-day.

# The 'naivebayes' package is loaded into the workspace
# and the Naive Bayes 'locmodel' has been built

# Examine the location prediction model
locmodel

=============================== Naive Bayes =============================== 
 
 Call: 
naive_bayes.formula(formula = location ~ daytype, data = where9am)

--------------------------------------------------------------------------- 
 
Laplace smoothing: 0

--------------------------------------------------------------------------- 
 
 A priori probabilities: 

appointment      campus        home      office 
 0.01098901  0.10989011  0.45054945  0.42857143 

--------------------------------------------------------------------------- 
 
 Tables: 

--------------------------------------------------------------------------- 
 ::: daytype (Bernoulli) 
--------------------------------------------------------------------------- 
         
daytype   appointment    campus      home    office
  weekday   1.0000000 1.0000000 0.3658537 1.0000000
  weekend   0.0000000 0.0000000 0.6341463 0.0000000

---------------------------------------------------------------------------
# Obtain the predicted probabilities for Thursday at 9am
predict(locmodel, thursday9am , type = "prob")
     appointment    campus      home office
[1,]  0.01538462 0.1538462 0.2307692    0.6
# Obtain the predicted probabilities for Saturday at 9am
predict(locmodel, saturday9am , type = "prob")
      appointment       campus      home      office
[1,] 3.838772e-05 0.0003838772 0.9980806 0.001497121

Did you notice the predicted probability of Brett being at the office on a Saturday is zero?

2.1.4 Understanding independence

One event is independent of another if knowing one doesn’t give you information about how likely the other is. For example, knowing if it’s raining in New York doesn’t help you predict the weather in San Francisco. The weather events in the two cities are independent of each other.

2.2 Understanding NB’s “naivety”

The Naive Bayes algorithm got its name because it makes a “naive” assumption about event independence. The joint probability of independent events can be computed much more simply by multiplying their individual probabilities.

2.2.1 A more sophisticated location model

The locations dataset records Brett’s location every hour for 13 weeks. Each hour, the tracking information includes the daytype (weekend or weekday) as well as the hourtype (morning, afternoon, evening, or night).

Using this data, build a more sophisticated model to see how Brett’s predicted location not only varies by the day of week but also by the time of day. The dataset locations is already loaded in your workspace.

You can specify additional independent variables in your formula using the + sign (e.g. y ~ x + b).

weekday_afternoon <- locations[13, ]
weekday_evening <- locations[19, ]
# Build a NB model of location
locmodel <- naive_bayes(location ~ daytype + hourtype, data = locations)
naive_bayes(): Feature daytype - zero probabilities are present. Consider Laplace smoothing.naive_bayes(): Feature hourtype - zero probabilities are present. Consider Laplace smoothing.
# Predict Brett's location on a weekday afternoon
predict(locmodel, weekday_afternoon)
predict.naive_bayes(): more features in the newdata are provided as there are probability tables in the object. Calculation is performed based on features to be found in the tables.
[1] office
Levels: appointment campus home office restaurant store theater
# Predict Brett's location on a weekday evening
predict(locmodel, weekday_evening)
predict.naive_bayes(): more features in the newdata are provided as there are probability tables in the object. Calculation is performed based on features to be found in the tables.
[1] home
Levels: appointment campus home office restaurant store theater

Your Naive Bayes model forecasts that Brett will be at the office on a weekday afternoon and at home in the evening.

2.2.2 Preparing for unforeseen circumstances

While Brett was tracking his location over 13 weeks, he never went into the office during the weekend. Consequently, the joint probability of P(office and weekend) = 0.

Explore how this impacts the predicted probability that Brett may go to work on the weekend in the future. Additionally, you can see how using the Laplace correction will allow a small chance for these types of unforeseen circumstances.

locations <- locations %>% 
  select(daytype, hourtype, location) %>% 
  mutate(daytype = as.factor(daytype), hourtype = as.factor(hourtype), location = as.factor(location))
weekend_afternoon <- locations[85, ]
# Observe the predicted probabilities for a weekend afternoon
predict(locmodel, weekend_afternoon, type = "prob")
predict.naive_bayes(): more features in the newdata are provided as there are probability tables in the object. Calculation is performed based on features to be found in the tables.
     appointment       campus      home      office restaurant      store
[1,]  0.02462883 0.0004802622 0.8439145 0.003349521  0.1111338 0.01641922
         theater
[1,] 7.38865e-05
# Build a new model using the Laplace correction
locmodel2 <- naive_bayes(location ~ daytype + hourtype, data = locations, laplace = 1)

# Observe the new predicted probabilities for a weekend afternoon
predict(locmodel2, weekend_afternoon, type = "prob")
predict.naive_bayes(): more features in the newdata are provided as there are probability tables in the object. Calculation is performed based on features to be found in the tables.
     appointment      campus      home      office restaurant      store
[1,]  0.02013872 0.006187715 0.8308154 0.007929249  0.1098743 0.01871085
         theater
[1,] 0.006343697

Adding the Laplace correction allows for the small chance that Brett might go to the office on the weekend in the future.

The small probability added to every outcome ensures that they are all possible even if never previously observed.

3 Logistic Regression

3.1 Making binary predictions with regression

3.1.1 Building simple logistic regression models

The donors dataset contains 93,462 examples of people mailed in a fundraising solicitation for paralyzed military veterans. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise. This binary outcome will be the dependent variable for the logistic regression model.

The remaining columns are features of the prospective donors that may influence their donation behavior. These are the model’s independent variables.

donors <- read_csv("donors.csv")
Parsed with column specification:
cols(
  donated = col_double(),
  veteran = col_double(),
  bad_address = col_double(),
  age = col_double(),
  has_children = col_double(),
  wealth_rating = col_double(),
  interest_veterans = col_double(),
  interest_religion = col_double(),
  pet_owner = col_double(),
  catalog_shopper = col_double(),
  recency = col_character(),
  frequency = col_character(),
  money = col_character()
)

When building a regression model, it is often helpful to form a hypothesis about which independent variables will be predictive of the dependent variable. The bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.

# Examine the dataset to identify potential independent variables
str(donors)
tibble [93,462 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ donated          : num [1:93462] 0 0 0 0 0 0 0 0 0 0 ...
 $ veteran          : num [1:93462] 0 0 0 0 0 0 0 0 0 0 ...
 $ bad_address      : num [1:93462] 0 0 0 0 0 0 0 0 0 0 ...
 $ age              : num [1:93462] 60 46 NA 70 78 NA 38 NA NA 65 ...
 $ has_children     : num [1:93462] 0 1 0 0 1 0 1 0 0 0 ...
 $ wealth_rating    : num [1:93462] 0 3 1 2 1 0 2 3 1 0 ...
 $ interest_veterans: num [1:93462] 0 0 0 0 0 0 0 0 0 0 ...
 $ interest_religion: num [1:93462] 0 0 0 0 1 0 0 0 0 0 ...
 $ pet_owner        : num [1:93462] 0 0 0 0 0 0 1 0 0 0 ...
 $ catalog_shopper  : num [1:93462] 0 0 0 0 1 0 0 0 0 0 ...
 $ recency          : chr [1:93462] "CURRENT" "CURRENT" "CURRENT" "CURRENT" ...
 $ frequency        : chr [1:93462] "FREQUENT" "FREQUENT" "FREQUENT" "FREQUENT" ...
 $ money            : chr [1:93462] "MEDIUM" "HIGH" "MEDIUM" "MEDIUM" ...
 - attr(*, "spec")=
  .. cols(
  ..   donated = col_double(),
  ..   veteran = col_double(),
  ..   bad_address = col_double(),
  ..   age = col_double(),
  ..   has_children = col_double(),
  ..   wealth_rating = col_double(),
  ..   interest_veterans = col_double(),
  ..   interest_religion = col_double(),
  ..   pet_owner = col_double(),
  ..   catalog_shopper = col_double(),
  ..   recency = col_character(),
  ..   frequency = col_character(),
  ..   money = col_character()
  .. )
# Explore the dependent variable
table(donors$donated)

    0     1 
88751  4711 
# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans, 
                      data = donors, family = "binomial")

# Summarize the model results
summary(donation_model)

Call:
glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
    family = "binomial", data = donors)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3480  -0.3192  -0.3192  -0.3192   2.5678  

Coefficients:
                  Estimate Std. Error  z value Pr(>|z|)    
(Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
bad_address       -0.30780    0.14348   -2.145   0.0319 *  
interest_religion  0.06724    0.05069    1.327   0.1847    
interest_veterans  0.11009    0.04676    2.354   0.0186 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 37330  on 93461  degrees of freedom
Residual deviance: 37316  on 93458  degrees of freedom
AIC: 37324

Number of Fisher Scoring iterations: 5

With the model built, we can now use it to make predictions.

3.1.2 Making a binary prediction

As with many of R’s machine learning methods, we can apply the predict() function to the model object to forecast future behavior. By default, predict() outputs predictions in terms of log odds unless type = “response” is specified. This converts the log odds to probabilities.

Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive. For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.

# Estimate the donation probability
donors$donation_prob <- predict(donation_model, type = "response")

# Find the donation probability of the average prospect
mean(donors$donated)
[1] 0.05040551
# Predict a donation if probability of donation is greater than average (0.0504)
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)

# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)
[1] 0.794815

With an accuracy of nearly 80%, the model seems to be doing its job. But is it too good to be true?

Despite this relatively high accuracy, the result is misleading due to the rarity of outcome being predicted.

What would the accuracy have been if a model had simply predicted “no donation” for each person?

mean(donors$donated == 0)
[1] 0.9495945

95%!!

With an accuracy of only 80%, the model is actually performing WORSE than if it were to predict non-donor for every record.

3.2 Model performance tradeoffs

3.2.1 Calculating ROC Curves and AUC

We have demonstrated that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.

We you will create a ROC curve and compute the area under the curve (AUC) to evaluate the logistic regression model of donations you built earlier.

library(pROC)
# Load the pROC package
library(pROC)

# Create a ROC curve
ROC <- roc(donors$donated, donors$donation_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# Plot the ROC curve
plot(ROC, col = "blue")


# Calculate the area under the curve (AUC)
auc(ROC)
Area under the curve: 0.5102

Based on this visualization, the model isn’t doing much better than baseline— a model doing nothing but making predictions at random.

3.3 Dummy variables, missing data, and interactions

3.3.1 Coding categorical features

Sometimes a dataset contains numeric values that represent a categorical feature.

In the donors dataset, wealth_rating uses numbers to indicate the donor’s wealth level:

  • 0 = Unknown
  • 1 = Low
  • 2 = Medium
  • 3 = High
# Convert the wealth rating to a factor
donors$wealth_levels <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))

# Use relevel() to change reference category
donors$wealth_levels <- relevel(donors$wealth_levels, ref = "Medium")

# See how our factor coding impacts the model
summary(glm(donated ~ wealth_levels, data = donors, family = "binomial"))

Call:
glm(formula = donated ~ wealth_levels, family = "binomial", data = donors)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3320  -0.3243  -0.3175  -0.3175   2.4582  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
wealth_levelsUnknown -0.04373    0.04243  -1.031    0.303    
wealth_levelsLow     -0.05245    0.05332  -0.984    0.325    
wealth_levelsHigh     0.04804    0.04768   1.008    0.314    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 37330  on 93461  degrees of freedom
Residual deviance: 37323  on 93458  degrees of freedom
AIC: 37331

Number of Fisher Scoring iterations: 5

3.3.2 Handling missing data

Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.

One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.

# Find the average age among non-missing values
summary(donors$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   48.00   62.00   61.65   75.00   98.00   22546 
# Impute missing age values with the mean age
donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm = TRUE), 2), donors$age)

# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)

This is one way to handle missing data, but be careful! Sometimes missing data has to be dealt with using more complicated methods.

Why is it often useful to include this indicator as a predictor in the model?

  • A missing value may represent a unique category by itself
  • There may be an important difference between records with and without missing data
  • Whatever caused the missing value may also be related to the outcome press

3.3.3 Building a more sophisticated model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M:

  • Recency
  • Frequency
  • Money

Donors that haven’t given both recently and frequently may be especially likely to give again; in other words, the combined impact of recency and frequency may be greater than the sum of the separate effects.

Because these predictors together have a greater impact on the dependent variable, their joint effect must be modeled as an interaction.

# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ money + recency * frequency, data= donors, family = "binomial")

# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)

Call:
glm(formula = donated ~ money + recency * frequency, family = "binomial", 
    data = donors)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3696  -0.3696  -0.2895  -0.2895   2.7924  

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)
(Intercept)                       -3.01142    0.04279 -70.375   <2e-16
moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16
recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364
frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16
recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490
                                     
(Intercept)                       ***
moneyMEDIUM                       ***
recencyLAPSED                     *  
frequencyINFREQUENT               ***
recencyLAPSED:frequencyINFREQUENT *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 37330  on 93461  degrees of freedom
Residual deviance: 36938  on 93457  degrees of freedom
AIC: 36948

Number of Fisher Scoring iterations: 6
# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, type = "response")

# Plot the ROC curve and find AUC for the new model
library(pROC)
ROC <- roc(donors$donated, rfm_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(ROC, col = "red")

auc(ROC)
Area under the curve: 0.5785

Based on the ROC curve, you’ve confirmed that past giving patterns are certainly predictive of future giving.

3.4 Automatic feature selection

3.4.1 Stepwise regression

In spite of its utility for feature selection, stepwise regression is not frequently used in disciplines outside of machine learning due to some important caveats.

  • It is not guaranteed to find the best possible model press
  • The stepwise regression procedure violates some statistical assumptions press
  • It can result in a model that makes little sense in the real world press

Though stepwise regression is frowned upon, it may still be useful for building predictive models in the absence of another starting place.

3.4.2 Building a stepwise regression model

In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.

# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")

# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
Start:  AIC=37332.13
donated ~ 1
using the 70916/93462 rows from a combined fit
                    Df Deviance   AIC
+ frequency          1    28502 37122
+ money              1    28621 37241
+ wealth_rating      1    28705 37326
+ has_children       1    28705 37326
+ age                1    28707 37328
+ imputed_age        1    28707 37328
+ wealth_levels      3    28704 37328
+ interest_veterans  1    28709 37330
+ donation_prob      1    28710 37330
+ donation_pred      1    28710 37330
+ catalog_shopper    1    28710 37330
+ pet_owner          1    28711 37331
<none>                    28714 37332
+ interest_religion  1    28712 37333
+ recency            1    28713 37333
+ bad_address        1    28714 37334
+ veteran            1    28714 37334

Step:  AIC=37024.77
donated ~ frequency
using the 70916/93462 rows from a combined fit
                    Df Deviance   AIC
+ money              1    28441 36966
+ wealth_rating      1    28493 37018
+ wealth_levels      3    28490 37019
+ has_children       1    28494 37019
+ donation_prob      1    28498 37023
+ interest_veterans  1    28498 37023
+ catalog_shopper    1    28499 37024
+ donation_pred      1    28499 37024
+ age                1    28499 37024
+ imputed_age        1    28499 37024
+ pet_owner          1    28499 37024
<none>                    28502 37025
+ interest_religion  1    28501 37026
+ recency            1    28501 37026
+ bad_address        1    28502 37026
+ veteran            1    28502 37027

Step:  AIC=36949.71
donated ~ frequency + money
using the 70916/93462 rows from a combined fit
                    Df Deviance   AIC
+ wealth_levels      3    28427 36942
+ wealth_rating      1    28431 36942
+ has_children       1    28432 36943
+ interest_veterans  1    28438 36948
+ donation_prob      1    28438 36949
+ catalog_shopper    1    28438 36949
+ donation_pred      1    28438 36949
+ age                1    28438 36949
+ imputed_age        1    28438 36949
+ pet_owner          1    28439 36949
<none>                    28441 36950
+ interest_religion  1    28440 36951
+ recency            1    28440 36951
+ bad_address        1    28441 36951
+ veteran            1    28441 36952

Step:  AIC=36945.48
donated ~ frequency + money + wealth_levels
using the 70916/93462 rows from a combined fit
                    Df Deviance   AIC
+ has_children       1    28416 36937
+ age                1    28424 36944
+ imputed_age        1    28424 36944
+ interest_veterans  1    28424 36945
+ donation_prob      1    28424 36945
+ catalog_shopper    1    28424 36945
+ donation_pred      1    28425 36945
<none>                    28427 36945
+ pet_owner          1    28425 36946
+ interest_religion  1    28426 36947
+ recency            1    28426 36947
+ bad_address        1    28427 36947
+ veteran            1    28427 36947

Step:  AIC=36938.4
donated ~ frequency + money + wealth_levels + has_children
using the 70916/93462 rows from a combined fit
                    Df Deviance   AIC
+ pet_owner          1    28413 36937
+ donation_prob      1    28413 36937
+ catalog_shopper    1    28413 36937
+ interest_veterans  1    28413 36937
+ donation_pred      1    28414 36938
<none>                    28416 36938
+ interest_religion  1    28415 36939
+ age                1    28416 36940
+ imputed_age        1    28416 36940
+ recency            1    28416 36940
+ bad_address        1    28416 36940
+ veteran            1    28416 36940

Step:  AIC=36932.25
donated ~ frequency + money + wealth_levels + has_children + 
    pet_owner
using the 70916/93462 rows from a combined fit
                    Df Deviance   AIC
<none>                    28413 36932
+ donation_prob      1    28411 36932
+ interest_veterans  1    28411 36932
+ catalog_shopper    1    28412 36933
+ donation_pred      1    28412 36933
+ age                1    28412 36933
+ imputed_age        1    28412 36933
+ recency            1    28413 36934
+ interest_religion  1    28413 36934
+ bad_address        1    28413 36934
+ veteran            1    28413 36934
# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")

# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(ROC, col = "red")

auc(ROC)
Area under the curve: 0.5849

Despite the caveats of stepwise regression, it seems to have resulted in a relatively strong model!

4 Classification Trees

4.1 Making decisions with trees

4.1.1 Building a simple decision tree

The loans dataset contains 11,312 randomly-selected people who applied for and later received loans from Lending Club, a US-based peer-to-peer lending company.

loans <- readRDS("loans.rds")

We will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.

Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.

good_credit <- head(loans %>%
  filter(home_ownership == "MORTGAGE"), 1) 
bad_credit <- head(loans %>%
  filter(home_ownership == "RENT", debt_to_income == "LOW", emp_length == "6 - 9 years"), 1) 
# Load the rpart package
library(rpart)

# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

# Make a prediction for someone with good credit
predict(loan_model, good_credit, type = "class")
     1 
repaid 
Levels: default repaid
# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")
      1 
default 
Levels: default repaid

4.1.2 Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.

# Examine the loan_model object
loan_model
n= 11312 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 11312 5654 repaid (0.4998232 0.5001768)  
   2) credit_score=AVERAGE,LOW 9490 4437 default (0.5324552 0.4675448)  
     4) credit_score=LOW 1667  631 default (0.6214757 0.3785243) *
     5) credit_score=AVERAGE 7823 3806 default (0.5134859 0.4865141)  
      10) loan_amount=HIGH 2472 1079 default (0.5635113 0.4364887) *
      11) loan_amount=LOW,MEDIUM 5351 2624 repaid (0.4903756 0.5096244)  
        22) loan_amount=LOW 1810  874 default (0.5171271 0.4828729) *
        23) loan_amount=MEDIUM 3541 1688 repaid (0.4767015 0.5232985) *
   3) credit_score=HIGH 1822  601 repaid (0.3298573 0.6701427) *
# Load the rpart.plot package
library(rpart.plot)

# Plot the loan_model with default settings
rpart.plot(loan_model)


# Plot the loan_model with customized settings
rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)

Using the tree structure, you can clearly see how the tree makes its decisions.

4.2 Growing larger classification trees

A classification tree grows using a divide-and-conquer process. Each time the tree grows larger, it splits groups of data into smaller subgroups, creating new branches in the tree. Divide-and-conquer always looks to create the split resulting in the greatest improvement to purity.

4.2.1 Creating random test datasets

Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.

As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

Use the resulting vector of row IDs to subset the loans into training and testing datasets.

# Determine the number of rows for training
nrow(loans)
[1] 11312
# Create a random sample of row IDs
sample_rows <- sample(nrow(loans), nrow(loans)*0.75)

# Create the training dataset
loans_train <- loans[sample_rows,]

# Create the test dataset
loans_test <- loans[-sample_rows,]

Creating a test set is an easy way to check your model’s performance.

4.2.2 Building and evaluating a larger tree

Previously, we created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.

Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Examine the confusion matrix
table(loans_test$outcome, loans_test$pred)
         
          default repaid
  default     761    688
  repaid      598    781
# Compute the accuracy on the test dataset
mean(loans_test$outcome == loans_test$pred)
[1] 0.5452617

How did adding more predictors change the model’s performance?

Holding out test data reduces the amount of data available for growing the decision tree. In spite of this, it is very important to evaluate decision trees on data it has not seen before.

4.3 Tending to classification trees

4.3.1 Preventing overgrown trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.

Using the pre-pruning methods for early stopping, we can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

# Grow a tree with maxdepth of 6
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, maxdepth = 6))

# Make a class prediction on the test set
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Compute the accuracy of the simpler tree
mean(loans_test$outcome == loans_test$pred)
[1] 0.5731966
# Swap maxdepth for a minimum split of 500 
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, minsplit = 500))

# Run this. How does the accuracy change?
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
[1] 0.5958274

It may seem surprising, but creating a simpler decision tree may actually result in greater performance on the test dataset.

4.3.2 Creating a nicely pruned tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.

By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.

# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Examine the complexity plot
plotcp(loan_model)


# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
[1] 0.5947666

As with pre-pruning, creating a simpler tree actually improved the performance of the tree on the test dataset.

4.3.3 Why do trees benefit from pruning?

Classification trees can grow indefinitely, until they are told to stop or run out of data to divide-and-conquer.

Just like trees in nature, classification trees that grow overly large can require pruning to reduce the excess growth. However, this generally results in a tree that classifies fewer training examples correctly.

Why, then, are pre-pruning and post-pruning almost always used?

  • Simpler trees are easier to interpret
  • Simpler trees using early stopping are faster to train
  • Simpler trees may perform better on the testing data press

4.4 Random forest

Groups of classification trees can be combined into an ensemble that generates a single prediction by allowing the trees to “vote” on the outcome.

The diversity among the trees may lead it to discover more subtle patterns. press. The teamwork-based approach of the random forest may help it find important trends a single tree may miss.

4.4.1 Building a random forest model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.

Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

# Load the randomForest package
library(randomForest)

# Build a random forest model
loan_model <- randomForest(outcome ~ ., data = loans_train)

# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, loans_test)
mean(loans_test$pred == loans_test$outcome)
[1] 0.5940594
---
title: "Supervised Learning in R: Classification"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: false
    number_sections: true
    
toc_depth: 3
---
# k-Nearest Neighbors (kNN)

## Classification with Nearest Neighbors

### Recognizing a road sign with kNN

After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.

As it begins to drive away, its camera captures the following image:

![](http://s3.amazonaws.com/assets.datacamp.com/production/course_2906/datasets/knn_stop_99.gif)

Can we apply a kNN classifier to help the car recognize this sign?

The dataset signs is loaded in your workspace along with the dataframe next_sign, which holds the observation you want to classify.

```{r}
# Load the 'class' package
library(dplyr)
library(class)
```
```{r}
next_sign <- readRDS("next_sing.rds")
signs <- read.csv("knn_traffic_signs.csv", stringsAsFactors=FALSE)
signs <- signs %>% 
  select(-id, -sample)
```
```{r}
# Load the 'class' package
library(class)

# Create a vector of labels
sign_types <- signs$sign_type

# Classify the next sign observed
knn(train = signs[-1], test = next_sign, cl = sign_types)
```
We've trained your first nearest neighbor classifier!
```{r}
next_sign
```
### Exploring the traffic sign dataset

To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.
![](http://s3.amazonaws.com/assets.datacamp.com/production/course_2906/datasets/knn_sign_data.png)

```{r}
# Examine the structure of the signs dataset
str(signs)

# Count the number of signs of each type
table(signs$sign_type)

# Check r10's average red level by sign type
aggregate(r10 ~ sign_type, data = signs, mean)
```
As you might have expected, stop signs tend to have a higher average red value. This is how kNN identifies similar signs.

### Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 59 additional road signs divided into three types:
![](http://s3.amazonaws.com/assets.datacamp.com/production/course_2906/datasets/knn_stop_28.gif)
![](http://s3.amazonaws.com/assets.datacamp.com/production/course_2906/datasets/knn_speed_55.gif)
![](http://s3.amazonaws.com/assets.datacamp.com/production/course_2906/datasets/knn_peds_47.gif)

At the conclusion of the trial, we are asked to measure the car's overall performance at recognizing these sign
```{r}
test_signs <- readRDS("test_signs.rds")
```
```{r}
# Use kNN to identify the test road signs
sign_types <- signs$sign_type
signs_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)

# Create a confusion matrix of the predicted versus actual values
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)

# Compute the accuracy
mean(signs_pred == signs_actual)
```
That self-driving car is really coming along! The confusion matrix lets you look for patterns in the classifier's errors.

## K in kNN

### Understanding the impact of 'k'

There is a complex relationship between k and classification accuracy. Bigger is not always better.

- Smaller k increases the impact of noisy data
- A smaller k may utilize more subtle patterns in the data

### Testing other 'k' values

By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.
```{r}
# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)
mean(k_1 == signs_actual)

# Modify the above to set k = 7
k_7 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 7)
mean(k_7 == signs_actual)

# Set k = 15 and compare to the above
k_15 <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 15)
mean(k_15 == signs_actual)
```

### Seeing how the neighbors voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters' confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

In this exercise, you will learn how to obtain the voting results from the knn() function.
```{r}
# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types, k = 7, prob = TRUE)


# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")

# Examine the first several predictions
head(sign_pred)

# Examine the proportion of votes for the winning class
head(sign_prob)
```
Now you can get an idea of how certain your kNN learner is about its classifications.

## Data preparation for kNN

Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization, to ensure all data elements may contribute equal shares to distance.

Rescaling reduces the influence of extreme values on kNN's distance function.

# Naive Bayes

## Understanding Bayesian methods

### Computing probabilities

The where9am data frame contains 91 days (thirteen weeks) worth of data in which Brett recorded his location at 9am each day as well as whether the daytype was a weekend or weekday.

```{r}
library(readr)
locations <- read_csv("locations.csv")
where9am <- locations %>% 
  filter(hour == 9, hourtype == "morning")
```
Using the conditional probability formula below, you can compute the probability that Brett is working in the office, given that it is a weekday.

$\displaystyle P(A|B)=\frac{P(A\ and\ B)}{P(B)}$

Calculations like these are the basis of the Naive Bayes destination prediction model we'll develop later.

A = office
B = weekday

```{r}
# Compute P(A) 
p_A <- nrow(subset(where9am, location == "office")) / nrow(where9am)

# Compute P(B)
p_B <- nrow(subset(where9am, daytype == "weekday")) / nrow(where9am)

# Compute the observed P(A and B)
p_AB <- nrow(subset(where9am, location == "office" & daytype == "weekday")) / nrow(where9am)

# Compute P(A | B) and print its value
p_A_given_B <- p_AB/p_B
p_A_given_B
```
We found that there is a 60% chance Brett is in the office at 9am given that it is a weekday. On the other hand, if Brett is never in the office on a weekend:

- P(office and weekend) = 0.
- P(office | weekend) = 0.
- Brett's location is dependent on the day of the week.

Because the events do not overlap, knowing that one occurred tells you much about the status of the other.

### A simple Naive Bayes location model

The probability that Brett is at work or at home at 9am is highly dependent on whether it is the weekend or a weekday.

To see this finding in action, use the where9am data frame to build a Naive Bayes model on the same data.

We can then use this model to predict the future: where does the model think that Brett will be at 9am on Thursday and at 9am on Saturday?
```{r}
temp_df <- data.frame(daytype=factor(c("weekday", "weekend")))
thursday9am <- subset(temp_df, daytype == "weekday")
saturday9am <- subset(temp_df, daytype == "weekend")
```
```{r}
# Load the naivebayes package
library(naivebayes)

# Build the location prediction model
locmodel <- naive_bayes(location ~ daytype, data = where9am)

# Predict Thursday's 9am location
predict(locmodel, thursday9am)

# Predict Saturdays's 9am location
predict(locmodel, saturday9am)
```
Not surprisingly, Brett is most likely at the office at 9am on a Thursday, but at home at the same time on a Saturday!

### Examining "raw" probabilities

The naivebayes package offers several ways to peek inside a Naive Bayes model.

Typing the name of the model object provides the a priori (overall) and conditional probabilities of each of the model's predictors. If one were so inclined, you might use these for calculating posterior (predicted) probabilities by hand.

Alternatively, R will compute the posterior probabilities for you if the type = "prob" parameter is supplied to the predict() function.

Using these methods, examine how the model's predicted 9am location probability varies from day-to-day. 
```{r}
# The 'naivebayes' package is loaded into the workspace
# and the Naive Bayes 'locmodel' has been built

# Examine the location prediction model
locmodel

# Obtain the predicted probabilities for Thursday at 9am
predict(locmodel, thursday9am , type = "prob")

# Obtain the predicted probabilities for Saturday at 9am
predict(locmodel, saturday9am , type = "prob")
```
Did you notice the predicted probability of Brett being at the office on a Saturday is zero?

### Understanding independence

One event is independent of another if knowing one doesn't give you information about how likely the other is. For example, knowing if it's raining in New York doesn't help you predict the weather in San Francisco. The weather events in the two cities are independent of each other.

## Understanding NB's "naivety"

The Naive Bayes algorithm got its name because it makes a "naive" assumption about event independence. The joint probability of independent events can be computed much more simply by multiplying their individual probabilities.

### A more sophisticated location model

The locations dataset records Brett's location every hour for 13 weeks. Each hour, the tracking information includes the daytype (weekend or weekday) as well as the hourtype (morning, afternoon, evening, or night).

Using this data, build a more sophisticated model to see how Brett's predicted location not only varies by the day of week but also by the time of day. The dataset locations is already loaded in your workspace.

You can specify additional independent variables in your formula using the + sign (e.g. y ~ x + b).
```{r}
weekday_afternoon <- locations[13, ]
weekday_evening <- locations[19, ]
```
```{r}
# Build a NB model of location
locmodel <- naive_bayes(location ~ daytype + hourtype, data = locations)

# Predict Brett's location on a weekday afternoon
predict(locmodel, weekday_afternoon)

# Predict Brett's location on a weekday evening
predict(locmodel, weekday_evening)
```
Your Naive Bayes model forecasts that Brett will be at the office on a weekday afternoon and at home in the evening.

### Preparing for unforeseen circumstances

While Brett was tracking his location over 13 weeks, he never went into the office during the weekend. Consequently, the joint probability of P(office and weekend) = 0.

Explore how this impacts the predicted probability that Brett may go to work on the weekend in the future. Additionally, you can see how using the Laplace correction will allow a small chance for these types of unforeseen circumstances.

```{r}
locations <- locations %>% 
  select(daytype, hourtype, location) %>% 
  mutate(daytype = as.factor(daytype), hourtype = as.factor(hourtype), location = as.factor(location))
```
```{r}
weekend_afternoon <- locations[85, ]
```

```{r}
# Observe the predicted probabilities for a weekend afternoon
predict(locmodel, weekend_afternoon, type = "prob")

# Build a new model using the Laplace correction
locmodel2 <- naive_bayes(location ~ daytype + hourtype, data = locations, laplace = 1)

# Observe the new predicted probabilities for a weekend afternoon
predict(locmodel2, weekend_afternoon, type = "prob")
```
Adding the Laplace correction allows for the small chance that Brett might go to the office on the weekend in the future.

The small probability added to every outcome ensures that they are all possible even if never previously observed.

#  Logistic Regression

## Making binary predictions with regression

### Building simple logistic regression models

The donors dataset contains 93,462 examples of people mailed in a fundraising solicitation for paralyzed military veterans. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise. This binary outcome will be the dependent variable for the logistic regression model.

The remaining columns are features of the prospective donors that may influence their donation behavior. These are the model's independent variables.
```{r}
donors <- read_csv("donors.csv")
```
When building a regression model, it is often helpful to form a hypothesis about which independent variables will be predictive of the dependent variable. The bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.

```{r}
# Examine the dataset to identify potential independent variables
str(donors)

# Explore the dependent variable
table(donors$donated)

# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans, 
                      data = donors, family = "binomial")

# Summarize the model results
summary(donation_model)
```
With the model built, we can now use it to make predictions.

### Making a binary prediction

As with many of R's machine learning methods, we can apply the predict() function to the model object to forecast future behavior. By default, predict() outputs predictions in terms of log odds unless type = "response" is specified. This converts the log odds to probabilities.

Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive. For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.
```{r}
# Estimate the donation probability
donors$donation_prob <- predict(donation_model, type = "response")

# Find the donation probability of the average prospect
mean(donors$donated)

# Predict a donation if probability of donation is greater than average (0.0504)
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)

# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)
```
With an accuracy of nearly 80%, the model seems to be doing its job. But is it too good to be true?

Despite this relatively high accuracy, the result is misleading due to the rarity of outcome being predicted.

What would the accuracy have been if a model had simply predicted "no donation" for each person?
```{r}
mean(donors$donated == 0)
```
95%!!

With an accuracy of only 80%, the model is actually performing WORSE than if it were to predict non-donor for every record.

## Model performance tradeoffs

### Calculating ROC Curves and AUC

We have demonstrated that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model's performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.

We you will create a ROC curve and compute the area under the curve (AUC) to evaluate the logistic regression model of donations you built earlier.
```{r}
library(pROC)
```
```{r}
# Load the pROC package
library(pROC)

# Create a ROC curve
ROC <- roc(donors$donated, donors$donation_prob)

# Plot the ROC curve
plot(ROC, col = "blue")

# Calculate the area under the curve (AUC)
auc(ROC)
```
 Based on this visualization, the model isn't doing much better than baseline— a model doing nothing but making predictions at random.
 
## Dummy variables, missing data, and interactions

### Coding categorical features
 
Sometimes a dataset contains numeric values that represent a categorical feature.

In the donors dataset, wealth_rating uses numbers to indicate the donor's wealth level:

* 0 = Unknown
* 1 = Low
* 2 = Medium
* 3 = High 

```{r}
# Convert the wealth rating to a factor
donors$wealth_levels <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))

# Use relevel() to change reference category
donors$wealth_levels <- relevel(donors$wealth_levels, ref = "Medium")

# See how our factor coding impacts the model
summary(glm(donated ~ wealth_levels, data = donors, family = "binomial"))
```
### Handling missing data

Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.

One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.
```{r}
# Find the average age among non-missing values
summary(donors$age)

# Impute missing age values with the mean age
donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm = TRUE), 2), donors$age)

# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)
```
This is one way to handle missing data, but be careful! Sometimes missing data has to be dealt with using more complicated methods.

Why is it often useful to include this indicator as a predictor in the model?

- A missing value may represent a unique category by itself
- There may be an important difference between records with and without missing data
- Whatever caused the missing value may also be related to the outcome
press

### Building a more sophisticated model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M:

* Recency
* Frequency
* Money

Donors that haven't given both recently and frequently may be especially likely to give again; in other words, the combined impact of recency and frequency may be greater than the sum of the separate effects.

Because these predictors together have a greater impact on the dependent variable, their joint effect must be modeled as an interaction.
```{r}
# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ money + recency * frequency, data= donors, family = "binomial")

# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)

# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, type = "response")

# Plot the ROC curve and find AUC for the new model
library(pROC)
ROC <- roc(donors$donated, rfm_prob)
plot(ROC, col = "red")
auc(ROC)
```
Based on the ROC curve, you've confirmed that past giving patterns are certainly predictive of future giving.

## Automatic feature selection

### Stepwise regression

In spite of its utility for feature selection, stepwise regression is not frequently used in disciplines outside of machine learning due to some important caveats. 

* It is not guaranteed to find the best possible model
press
* The stepwise regression procedure violates some statistical assumptions
press
* It can result in a model that makes little sense in the real world
press

Though stepwise regression is frowned upon, it may still be useful for building predictive models in the absence of another starting place.

### Building a stepwise regression model

In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.

```{r}
# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")

# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")

# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)
plot(ROC, col = "red")
auc(ROC)
```
Despite the caveats of stepwise regression, it seems to have resulted in a relatively strong model!

# Classification Trees

## Making decisions with trees

### Building a simple decision tree

The loans dataset contains 11,312 randomly-selected people who applied for and later received loans from Lending Club, a US-based peer-to-peer lending company.
```{r}
loans <- readRDS("loans.rds")
```
We will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.

Then, see how the tree's predictions differ for an applicant with good credit versus one with bad credit.
```{r}
good_credit <- head(loans %>%
  filter(home_ownership == "MORTGAGE"), 1) 
bad_credit <- head(loans %>%
  filter(home_ownership == "RENT", debt_to_income == "LOW", emp_length == "6 - 9 years"), 1) 
```
```{r}
# Load the rpart package
library(rpart)

# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

# Make a prediction for someone with good credit
predict(loan_model, good_credit, type = "class")

# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")
```
### Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions. 

```{r}
# Examine the loan_model object
loan_model

# Load the rpart.plot package
library(rpart.plot)

# Plot the loan_model with default settings
rpart.plot(loan_model)

# Plot the loan_model with customized settings
rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)
```
Using the tree structure, you can clearly see how the tree makes its decisions. 

## Growing larger classification trees

A classification tree grows using a divide-and-conquer process. Each time the tree grows larger, it splits groups of data into smaller subgroups, creating new branches in the tree. Divide-and-conquer always looks to create the split resulting in the greatest improvement to purity.

### Creating random test datasets

Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.

As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

![](http://s3.amazonaws.com/assets.datacamp.com/production/course_2906/datasets/dtree_test_set.png)
The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

Use the resulting vector of row IDs to subset the loans into training and testing datasets.
```{r}
# Determine the number of rows for training
nrow(loans)

# Create a random sample of row IDs
sample_rows <- sample(nrow(loans), nrow(loans)*0.75)

# Create the training dataset
loans_train <- loans[sample_rows,]

# Create the test dataset
loans_test <- loans[-sample_rows,]
```
Creating a test set is an easy way to check your model's performance.

### Building and evaluating a larger tree

Previously, we created a simple decision tree that used the applicant's credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.

Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.
```{r}
# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Examine the confusion matrix
table(loans_test$outcome, loans_test$pred)

# Compute the accuracy on the test dataset
mean(loans_test$outcome == loans_test$pred)
```
How did adding more predictors change the model's performance?

Holding out test data reduces the amount of data available for growing the decision tree. In spite of this, it is very important to evaluate decision trees on data it has not seen before.

## Tending to classification trees

### Preventing overgrown trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.

Using the pre-pruning methods for early stopping, we can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.
```{r}
# Grow a tree with maxdepth of 6
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, maxdepth = 6))

# Make a class prediction on the test set
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Compute the accuracy of the simpler tree
mean(loans_test$outcome == loans_test$pred)
```
```{r}
# Swap maxdepth for a minimum split of 500 
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, minsplit = 500))

# Run this. How does the accuracy change?
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
```
It may seem surprising, but creating a simpler decision tree may actually result in greater performance on the test dataset.

### Creating a nicely pruned tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.

By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.
```{r}
# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Examine the complexity plot
plotcp(loan_model)

# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
```
As with pre-pruning, creating a simpler tree actually improved the performance of the tree on the test dataset.

### Why do trees benefit from pruning?

Classification trees can grow indefinitely, until they are told to stop or run out of data to divide-and-conquer.

Just like trees in nature, classification trees that grow overly large can require pruning to reduce the excess growth. However, this generally results in a tree that classifies fewer training examples correctly.

Why, then, are pre-pruning and post-pruning almost always used?

- Simpler trees are easier to interpret
- Simpler trees using early stopping are faster to train
- Simpler trees may perform better on the testing data
press

## Random forest

Groups of classification trees can be combined into an ensemble that generates a single prediction by allowing the trees to "vote" on the outcome.

The diversity among the trees may lead it to discover more subtle patterns.
press. The teamwork-based approach of the random forest may help it find important trends a single tree may miss.

### Building a random forest model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.

Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

```{r}
# Load the randomForest package
library(randomForest)

# Build a random forest model
loan_model <- randomForest(outcome ~ ., data = loans_train)

# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, loans_test)
mean(loans_test$pred == loans_test$outcome)
```

