Visit my website for more like this!

#### Data Sources:

Heavily borrowed from:

``require(knitr)``
``## Loading required package: knitr``

## 1.0 Overview

Linear regression ([tutorial here()]) assumes that the response variable Y is quantitative. However, in many situations we are dealing with qualitative response variables. Generally, we will refer to these types of variables as categorical variables. For example: eye color is categorical since it has values like brown, blue, and green. Classification thereby involves assigning categorical variables to a specific class. Usually, we predict the probability of any observation belonging to a specific class.

There are many classification techniques, or classifiers, that could be used to predict a given qualitative response variables. Examples covered in this notebook include:

• Logistic Regression
• Linear Discriminant Analysis
• K-nearest neighbors

Later notebooks (link here) will include more complicated classifiers such as:

• Tree methods
• Random forests
• Support Vector Machines

Just like linear regression, in classification we have a set of training observations which we leverage to build a classifier, and we test our model performance on the test data to simulate out of sample error. In this notebook we will use a dataset of credit card information as model inputs to predict whether an individual will default on their credit card payment.

``````# Load the textbook R package
require(ISLR)
# Load in the credit data
attach(Default)``````
``````# Lets take a look at the data
str(Default)``````
``````## 'data.frame':    10000 obs. of  4 variables:
##  \$ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  \$ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  \$ balance: num  730 817 1074 529 786 ...
##  \$ income : num  44362 12106 31767 35704 38463 ...``````
``````# How many people actual default?
tmp <- table(default)
(tmp[]/tmp[])*100``````
``##  3.445``

We can see that these data have 10000 observations of 4 variables, and that only about 3% of people actually default. Let’s create a few diagnostic plots to get a sense of the data. Remember, the goal here will be to predict whether someone will default on their credit card payment, based on the variables `student`, `balance` and `income`.

``library(ggplot2); library(gridExtra)``
``## Loading required package: grid``
``````x <- qplot(x=balance, y=income, color=default, shape=default, geom='point')+scale_shape(solid=FALSE)
y <- qplot(x=default, y=balance, fill=default, geom='boxplot')+guides(fill=FALSE)
z <- qplot(x=default, y=income, fill=default, geom='boxplot')+guides(fill=FALSE)
# Plot
x``````