1. What is Machine Learning

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.3
str(Wage)
## 'data.frame':    3000 obs. of  12 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ sex       : Factor w/ 2 levels "1. Male","2. Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

Basic prediction model

# Build Linear Model: lm_wage
lm_wage <- lm(wage ~ age, data = Wage)
unseen <- data.frame(age = 60)
# Predict the wage for a 60-year old worker
predict(lm_wage, unseen)
##        1 
## 124.1413

Common ML Problems

  • Classification
  • Regression
  • Clustering

Classification

  1. Predict category of new observation

    earlier Observations -> (Estimate) -> Classifier
    Unseen Data -> (Classifier) -> Class
  2. Classification Applications
  • Medical Diagnosis : Sick and Not Sick
  • Animal Recognition : Dog, Cat and Horse
  1. Qualitative Output, Predefined Classes

Regression

  1. predictors -> (Regression Function) Response

    Relationship : Height - Weight? Linear? Predict : Weight -> Height

  2. Regression Applications
  • payments -> credit scores
  • Time -> Subscriptions
  • Grades -> landing a job
  1. Quantitative Output, Previous input-output observations

Clustering

  1. grouping objects in clusters
  • similar within cluster
  • dissimilar between clusters
  1. example : Grouping similar animal photos
  • No labels
  • No right or wrong
  • Plenty possible clusterings

Classification : Filtering spam

emails <- data.frame(avg_capital_seq=c(1.000, 2.112, 4.123, 1.863, 2.973, 1.687, 5.891, 3.167, 1.230, 2.441, 3.555, 3.250, 1.333), spam=c(0,0,1,0,1,0,1,0,0,1,0,0,1))

spam_classifier <- function(x){
  prediction <- rep(NA, length(x)) # initialize prediction vector
  prediction[x > 4] <- 1
  prediction[x >= 3 & x <= 4] <- 0
  prediction[x >= 2.2 & x < 3] <- 1
  prediction[x >= 1.4 & x < 2.2] <- 0
  prediction[x > 1.25 & x < 1.4] <- 1
  prediction[x <= 1.25] <- 0
  return(prediction) # prediction is either 0 or 1
}

# Apply the classifier to the avg_capital_seq column: spam_pred
spam_pred <- spam_classifier(emails$avg_capital_seq)

# Compare spam_pred to emails$spam. Use ==
emails$spam == spam_pred
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Regression : LinkedIn views for the next 3 days

linkedin <- c(5,7,4,9,11,10,14,17,13,11,18,17,21,21,24,23,28,35,21,27,23)
days <- seq(1:21)

# Fit a linear model called on the linkedin views per day: linkedin_lm
linkedin_lm <- lm(linkedin ~ days)

# Predict the number of views for the next three days: linkedin_pred
future_days <- data.frame(days = 22:24)
linkedin_pred <- predict(linkedin_lm, future_days)

# Plot historical data and predictions
plot(linkedin ~ days, xlim = c(1, 24))
points(22:24, linkedin_pred, col = "green")

Clustering : Separating the iris species

This technique tries to group your objects. It does this without any prior knowledge of what these groups could or should look like. For clustering, the concepts of prior knowledge and unseen observations are less meaningful than for classification and regression.

set.seed(1)

my_iris <- iris[-5]
species <- iris$Species

# Perform k-means clustering on my_iris: kmeans_iris
kmeans_iris <- kmeans(my_iris, 3)

# Compare the actual Species to the clustering using table()
table(species, kmeans_iris$cluster)
##             
## species       1  2  3
##   setosa     50  0  0
##   versicolor  0  2 48
##   virginica   0 36 14
# Plot Petal.Width against Petal.Length, coloring by cluster
plot(Petal.Length ~ Petal.Width, data = my_iris, col = kmeans_iris$cluster)

Getting practical with supervised learning

library(rpart)
## Warning: package 'rpart' was built under R version 3.3.3
tree <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
              data = iris, method = "class")

# A dataframe containing unseen observations
unseen <- data.frame(Sepal.Length = c(5.3, 7.2),
                     Sepal.Width = c(2.9, 3.9),
                     Petal.Length = c(1.7, 5.4),
                     Petal.Width = c(0.8, 2.3))

# Predict the label of the unseen observations. Print out the result.
predict(tree, unseen, type="class")
##         1         2 
##    setosa virginica 
## Levels: setosa versicolor virginica

How to do unsupervised learning

set.seed(1)
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
# Group the dataset into two clusters: km_cars
km_cars <- kmeans(cars, 2)
km_cars$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 2 2 2
## [36] 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2
plot(cars, col = km_cars$cluster)
km_cars$centers
##      speed     dist
## 1 12.84375 27.15625
## 2 19.94444 71.11111
points(km_cars$centers, pch = 22, bg = c(1, 2), cex = 2)

Centroids are kind of like the centers of each cluster.