Visit my website for more like this!

Data Sources:

Heavily borrowed from:

require(knitr)
## Loading required package: knitr

1.0 Overview

Linear regression ([tutorial here()]) assumes that the response variable Y is quantitative. However, in many situations we are dealing with qualitative response variables. Generally, we will refer to these types of variables as categorical variables. For example: eye color is categorical since it has values like brown, blue, and green. Classification thereby involves assigning categorical variables to a specific class. Usually, we predict the probability of any observation belonging to a specific class.

There are many classification techniques, or classifiers, that could be used to predict a given qualitative response variables. Examples covered in this notebook include:

Later notebooks (link here) will include more complicated classifiers such as:

Just like linear regression, in classification we have a set of training observations which we leverage to build a classifier, and we test our model performance on the test data to simulate out of sample error. In this notebook we will use a dataset of credit card information as model inputs to predict whether an individual will default on their credit card payment.

# Load the textbook R package
require(ISLR)
# Load in the credit data
attach(Default)
# Lets take a look at the data
str(Default)
## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...
# How many people actual default?
tmp <- table(default)
(tmp[[2]]/tmp[[1]])*100
## [1] 3.445

We can see that these data have 10000 observations of 4 variables, and that only about 3% of people actually default. Let’s create a few diagnostic plots to get a sense of the data. Remember, the goal here will be to predict whether someone will default on their credit card payment, based on the variables student, balance and income.

library(ggplot2); library(gridExtra)
## Loading required package: grid
x <- qplot(x=balance, y=income, color=default, shape=default, geom='point')+scale_shape(solid=FALSE)
y <- qplot(x=default, y=balance, fill=default, geom='boxplot')+guides(fill=FALSE)
z <- qplot(x=default, y=income, fill=default, geom='boxplot')+guides(fill=FALSE)
# Plot
x