Visit my website for more like this!

Heavily borrowed from:

Textbook: Introduction to statistical learning

Textbook: Elements of statistical learning

UCLA Example link

`require(knitr)`

`## Loading required package: knitr`

Linear regression ([tutorial here()]) assumes that the response variable *Y* is quantitative. However, in many situations we are dealing with *qualitative* response variables. Generally, we will refer to these types of variables as **categorical** variables. For example: eye color is categorical since it has values like *brown*, *blue*, and *green*. Classification thereby involves assigning categorical variables to a specific class. Usually, we predict the probability of any observation belonging to a specific class.

There are many classification techniques, or *classifiers*, that could be used to predict a given qualitative response variables. Examples covered in this notebook include:

- Logistic Regression
- Linear Discriminant Analysis
- K-nearest neighbors

Later notebooks (link here) will include more complicated classifiers such as:

- Generalized additive models
- Tree methods
- Random forests
- Support Vector Machines

Just like linear regression, in classification we have a set of training observations which we leverage to build a classifier, and we test our model performance on the test data to simulate *out of sample error*. In this notebook we will use a dataset of credit card information as model inputs to predict whether an individual will default on their credit card payment.

```
# Load the textbook R package
require(ISLR)
# Load in the credit data
attach(Default)
```

```
# Lets take a look at the data
str(Default)
```

```
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
```

```
# How many people actual default?
tmp <- table(default)
(tmp[[2]]/tmp[[1]])*100
```

`## [1] 3.445`

We can see that these data have 10000 observations of 4 variables, and that only about 3% of people actually default. Letâ€™s create a few diagnostic plots to get a sense of the data. Remember, the goal here will be to predict whether someone will default on their credit card payment, based on the variables `student`

, `balance`

and `income`

.

`library(ggplot2); library(gridExtra)`

`## Loading required package: grid`

```
x <- qplot(x=balance, y=income, color=default, shape=default, geom='point')+scale_shape(solid=FALSE)
y <- qplot(x=default, y=balance, fill=default, geom='boxplot')+guides(fill=FALSE)
z <- qplot(x=default, y=income, fill=default, geom='boxplot')+guides(fill=FALSE)
# Plot
x
```