HarvardX - PH125.8x | Machine Learning - Data Science Section 2: Machine Learning Basics / Machine Learning Basics Overview

Ezio Filho

01/07/2021

Overview of this Section:

Starting to use the caret package.
Constructing and interpreting a confusion matrix.
Using conditional probabilities in the context of machine learning.

2.1: Basics of Evaluating Machine Learning Algorithms

Comprehension Check: Basics of Evaluating Machine Learning Algorithms

Q1

For each of the following, indicate whether the outcome is continuous or categorical.

Digit reader
categorical

Height
continuous

Spam filter
categorical

Stock prices
continuous

Sex
categorical

Q2

How many features are available to us for prediction in the mnist digits dataset? You can download the mnist dataset using the read_mnist() function from the dslabs package.

library(dslabs)
mnist <- read_mnist()
ncol(mnist$train$images)

## [1] 784

Comprehension Check: Practice with Machine Learning, Part 1

The following questions all ask you to work with the dataset described below.

The reported_heights and heights datasets were collected from three classes taught in the Departments of Computer Science and Biostatistics, as well as remotely through the Extension School. The Biostatistics class was taught in 2016 along with an online version offered by the Extension School. On 2016-01-25 at 8:15 AM, during one of the lectures, the instructors asked student to fill in the sex and height questionnaire that populated the reported_heights dataset. The online students filled out the survey during the next few days, after the lecture was posted online. We can use this insight to define a variable which we will call type, to denote the type of student, inclass or online.

The code below sets up the dataset for you to analyze in the following exercises:

library(dslabs)
library(dplyr)
library(lubridate)
data(reported_heights)

dat <- mutate(reported_heights, date_time = ymd_hms(time_stamp)) %>%
  filter(date_time >= make_date(2016, 01, 25) & date_time < make_date(2016, 02, 1)) %>%
  mutate(type = ifelse(day(date_time) == 25 & hour(date_time) == 8 & between(minute(date_time), 15, 30), "inclass","online")) %>%
  select(sex, type)

y <- factor(dat$sex, c("Female", "Male"))
x <- dat$type

Q1

The type column of dat indicates whether students took classes in person (“inclass”) or online (“online”). What proportion of the inclass group is female? What proportion of the online group is female?

Enter your answer as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths place.

In class

(dat %>% 
  group_by(type) %>% 
  summarize(propf = mean(sex == "Female")))[1,] #propf = proportion of females

## # A tibble: 1 x 2
##   type    propf
##   <chr>   <dbl>
## 1 inclass 0.667

Online

(dat %>% 
  group_by(type) %>% 
  summarize(propf = mean(sex == "Female")))[2,] #propf = proportion of females

## # A tibble: 1 x 2
##   type   propf
##   <chr>  <dbl>
## 1 online 0.378

Q2

In the course videos, height cutoffs were used to predict sex. Instead of height, use the type variable to predict sex. Assume that for each class type the students are either all male or all female, based on the most prevalent sex in each class type you calculated in Q1. Report the accuracy of your prediction of sex based on type. You do not need to split the data into training and test sets.

Enter your accuracy as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths place.

# Based on Question 1, the most prevalent sex in "inclass" is Female, and "Online" is Male. 

 ifelse(x == "inclass", "Female", "Male") %>% 
    factor(levels = levels(y)) -> y_hat 

mean(y_hat==y)

## [1] 0.6333333

Q3

Write a line of code using the table() function to show the confusion matrix between y_hat and y. Use the exact format function(a, b) for your answer and do not name the columns and rows. Your answer should have exactly one space. Enter the line of code below.

table(y_hat, y)

##         y
## y_hat    Female Male
##   Female     26   13
##   Male       42   69

Q4

What is the sensitivity of this prediction? You can use the sensitivity() function from the caret package. Enter your answer as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths place.

library(caret)
sensitivity(y_hat, y)

## [1] 0.3823529

Q5

What is the specificity of this prediction? You can use the specificity() function from the caret package. Enter your answer as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths place.

# no need to load caret again
specificity(y_hat, y)

## [1] 0.8414634

Q6

What is the prevalence (% of females) in the dat dataset defined above? Enter your answer as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths place.

mean(y == "Female")

## [1] 0.4533333

Comprehension Check: Practice with Machine Learning, Part 2

We will practice building a machine learning algorithm using a new dataset, iris, that provides multiple predictors for us to use to train. To start, we will remove the setosa species and we will focus on the versicolor and virginica iris species using the following code:

library(caret)
data(iris)
iris <- iris[-which(iris$Species=='setosa'),]
y <- iris$Species

The following questions all involve work with this dataset.

Q7

First let us create an even split of the data into train and test partitions using createDataPartition() from the caret package. The code with a missing line is given below:

## set.seed(2) # if using R 3.5 or earlier
#set.seed(2, sample.kind="Rounding") # if using R 3.6 or later
## line of code
#test <- iris[test_index,]
#train <- iris[-test_index,]

Which code should be used in place of # line of code above?

A. test_index <- createDataPartition(y,times=1,p=0.5)

B. test_index <- sample(2,length(y),replace=FALSE)

C. test_index <- createDataPartition(y,times=1,p=0.5,list=FALSE)

D. test_index <- rep(1,length(y))

Note: for this question, you may ignore any warning message generated by the code. If you have R 3.6 or later, you should always use the sample.kind argument in set.seed for this course.

Q8

Next we will figure out the singular feature in the dataset that yields the greatest overall accuracy when predicting species. You can use the code from the introduction and from Q7 to start your analysis.

Using only the train iris dataset, for each feature, perform a simple search to find the cutoff that produces the highest accuracy, predicting virginica if greater than the cutoff and versicolor otherwise. Use the seq function over the range of each feature by intervals of 0.1 for this search.

Which feature produces the highest accuracy?

A. Sepal.Length

B. Sepal.Width

C. Petal.Length

D. Petal.Width

How to answer this question?

#First run the code from the last question completing the "line of code' line
# set.seed(2) # if using R 3.5 or earlier
set.seed(2, sample.kind="Rounding") # if using R 3.6 or later
test_index <- createDataPartition(y,times=1,p=0.5,list=FALSE)
test <- iris[test_index,]
train <- iris[-test_index,]

#Second let's find which one gives the highest accuracy

f <- function(x){
    rv <- seq(min(x), max(x), by=0.1) #rv = ranged values
    sapply(rv,
           function(i){
        y_hat <- ifelse(x > i,'virginica','versicolor')
        mean(y_hat == train$Species)} #here we can find the accuracy 
)}

predictions <- apply(train[,-5],MARGIN = 2, FUN = f)


sapply(predictions,max)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         0.70         0.62         0.96         0.94

We can see that Petal.Lenght has the highest accuracy.

Q9

For the feature selected in Q8, use the smart cutoff value from the training data to calculate overall accuracy in the test data. What is the overall accuracy?

predictions <- f(train[,3]) #f is previously created function
rv <- seq(min(train[,3]),max(train[,3]),by=0.1) #rv = ranged values
cutoffs <-rv[which(predictions==max(predictions))]

y_hat <- ifelse(test[,3]>cutoffs[1],'virginica','versicolor')
mean(y_hat==test$Species)

## [1] 0.9

Q10

Notice that we had an overall accuracy greater than 96% in the training data, but the overall accuracy was lower in the test data. This can happen often if we overtrain. In fact, it could be the case that a single feature is not the best choice. For example, a combination of features might be optimal. Using a single feature and optimizing the cutoff as we did on our training data can lead to overfitting.

Given that we know the test data, we can treat it like we did our training data to see if the same feature with a different cutoff will optimize our predictions. Repeat the analysis in Q8 but this time using the test data instead of the training data.

Which feature best optimizes our overall accuracy when using the test set?

A. Sepal.Length

B. Sepal.Width

C. Petal.Length

D. Petal.Width

predictions <- apply(test[,-5], MARGIN = 2, FUN = f) #f is the previously created function
sapply(predictions,max)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         0.78         0.64         0.90         0.94

Q11

Now we will perform some exploratory data analysis on the data.

plot(iris,pch=21,bg=iris$Species)

Notice that Petal.Length and Petal.Width in combination could potentially be more information than either feature alone.

Optimize the the cutoffs for Petal.Length and Petal.Width separately in the train dataset by using the seq function with increments of 0.1. Then, report the overall accuracy when applied to the test dataset by creating a rule that predicts virginica if Petal.Length is greater than the length cutoff OR Petal.Width is greater than the width cutoff, and versicolor otherwise.

What is the overall accuracy for the test data now?

petalLR <- seq(min(train$Petal.Length),max(train$Petal.Length),by=0.1) #PetalLR = Petal Length Range
petalWR <- seq(min(train$Petal.Width),max(train$Petal.Width),by=0.1) #PetalWR = Petal Width Range

length_predictions <- sapply(petalLR,function(i){
        y_hat <- ifelse(train$Petal.Length>i,'virginica','versicolor')
        mean(y_hat==train$Species)
    })

length_cutoff <- petalLR[which.max(length_predictions)] # 4.7

width_predictions <- sapply(petalWR,function(i){
        y_hat <- ifelse(train$Petal.Width>i,'virginica','versicolor')
        mean(y_hat==train$Species)
    })

width_cutoff <- petalWR[which.max(width_predictions)] # 1.5

y_hat <- ifelse(test$Petal.Length>length_cutoff | test$Petal.Width>width_cutoff,'virginica','versicolor')
mean(y_hat==test$Species)

## [1] 0.88

2.2: Conditional Probabilities

Comprehension Check: Conditional Probabilities Part 1

Q1

In a previous module, we covered Bayes’ theorem and the Bayesian paradigm. Conditional probabilities are a fundamental part of this previous covered rule.

\[ P(A \mid B)=P(B \mid A) \frac{P(A)}{P(B)} \] We first review a simple example to go over conditional probabilities.

Assume a patient comes into the doctor’s office to test whether they have a particular disease.

The test is positive \(85 \%\) of the time when tested on a patient with the disease (high sensitivity): \(P( \text{test}+\mid \text{disease} )=0.85\)
The test is negative \(90 \%\) of the time when tested on a healthy patient (high specificity): \(P(\text{test}-\mid \text{heathy})=0.90\)
The disease is prevalent in about \(2 \%\) of the community: \(P(\text{disease})=0.02\)

This answer is pure probability, here is the explanation:

\[ \begin{gathered} P(\text { disease } \mid \text { test }+)=P(\text { test }+\mid \text { disease }) \times \frac{P(\text { disease })}{P(\text { test }+)} \\ =\frac{P(\text { test }+\mid \text { disease }) P(\text { disease })}{P(\text { test }+\mid \text { disease }) P(\text { disease })+P(\text { test }+\mid \text { healthy }) P(\text { healthy })} \\ =\frac{0.85 \times 0.02}{0.85 \times 0.02+0.1 \times 0.98}=0.1478261 \end{gathered} \]

The following 4 questions (Q2-Q5) all relate to implementing this calculation using \(\mathrm{R}\).

We have a hypothetical population of 1 million individuals with the following conditional probabilities as described below:

The test is positive \(85 \%\) of the time when tested on a patient with the disease (high sensitivity): \(P(\text{test}+\mid\text{disease})=0.85\)
The test is negative \(90 \%\) of the time when tested on a healthy patient (high specificity): \(P(\text{test}-\mid\text{heathy})=0.90\)

\(-\) The disease is prevalent in about \(2 \%\) of the community: \(P(\text{disease})=0.02\)

Here is some sample code to get you started:

# set.seed(1) # if using R 3.5 or earlier
set.seed(1, sample.kind = "Rounding") # if using R 3.6 or later
disease <- sample(c(0,1), size=1e6, replace=TRUE, prob=c(0.98,0.02))
test <- rep(NA, 1e6)
test[disease==0] <- sample(c(0,1), size=sum(disease==0), replace=TRUE, prob=c(0.90,0.10))
test[disease==1] <- sample(c(0,1), size=sum(disease==1), replace=TRUE, prob=c(0.15, 0.85))

Q2

What is the probability that a test is positive?

mean(test)

## [1] 0.114509

Q3

What is the probability that an individual has the disease if the test is negative?

mean(disease[test==0])

## [1] 0.003461356

Q4

What is the probability that you have the disease if the test is positive?

Remember: calculate the conditional probability the disease is positive assuming a positive test.

mean(disease[test==1]==1) #  probability of having the disease given a positive test

## [1] 0.1471762

Q5

Compare the prevalence of disease in people who test positive to the overall prevalence of disease.

If a patient’s test is positive, by how many times does that increase their risk of having the disease?

First calculate the probability of having the disease given a positive test, then divide by the probability of having the disease.

mean(disease[test==1]==1)/mean(disease==1)

## [1] 7.389106

Comprehension Check: Conditional Probabilities Part 2

Q6

We are now going to write code to compute conditional probabilities for being male in the heights dataset. Round the heights to the closest inch.

Plot the estimated conditional probability \(P(x)=\operatorname{Pr}(\text{Male}\mid\text{height=x})\) for each \(x\). Part of the code is provided here:

##library(dslabs)
#data("heights")
# MISSING CODE
    #qplot(height, p, data =.)

Which of the following blocks of code can be used to replace # MISSING CODE to make the correct plot?

A .heights %>%
group_by(height) %>%
summarize(p = mean(sex == “Male”)) %>%

B. heights %>%
mutate(height = round(height)) %>%
group_by(height) %>%
summarize(p = mean(sex == “Female”)) %>%

C. heights %>%
mutate(height = round(height)) %>%
summarize(p = mean(sex == “Male”)) %>%

D. heights %>%
mutate(height = round(height)) %>%
group_by(height) %>%
summarize(p = mean(sex == “Male”)) %>%

Now making the plot:

#library(dslabs)
data("heights")
heights %>%  
    mutate(height = round(height)) %>%  
    group_by(height) %>%  
    summarize(p = mean(sex == "Male")) %>%
    qplot(height, p, data =.)

Q7

In the plot we just made in Q6 we see high variability for low values of height. This is because we have few data points. This time use the quantile and the cut() function to assure each group has the same number of points. Note that for any numeric vector x, you can create groups based on quantiles like this: cut(x, quantile(x, seq(0, 1, 0.1)), include.lowest = TRUE).

Part of the code is provided here:

#ps <- seq(0, 1, 0.1)
#heights %>% 
    ## MISSING CODE
    #group_by(g) %>%
    #summarize(p = mean(sex == "Male"), height = mean(height)) %>%
    #qplot(height, p, data =.)

Which of the following lines of code can be used to replace # MISSING CODE to make the correct plot?

A. mutate(g = cut(male, quantile(height, ps), include.lowest = TRUE)) %>%

B. mutate(g = cut(height, quantile(height, ps), include.lowest = TRUE)) %>%

C. mutate(g = cut(female, quantile(height, ps), include.lowest = TRUE)) %>%

D. mutate(g = cut(height, quantile(height, ps))) %>%

Now making the plot:

ps <- seq(0, 1, 0.1)
heights %>% 
  mutate(g = cut(height, quantile(height, ps), include.lowest = TRUE)) %>%
    group_by(g) %>%
    summarize(p = mean(sex == "Male"), height = mean(height)) %>%
    qplot(height, p, data =.)

Q8

You can generate data from a bivariate normal distrubution using the MASS package using the following code:

Sigma <- 9*matrix(c(1,0.5,0.5,1), 2, 2)
dat <- MASS::mvrnorm(n = 10000, c(69, 69), Sigma) %>%
    data.frame() %>% setNames(c("x", "y"))

And you can make a quick plot using plot(dat).

Using an approach similar to that used in the previous exercise, let’s estimate the conditional expectations and make a plot. Part of the code has again been provided for you:

#ps <- seq(0, 1, 0.1)
#dat %>% 
    ## MISSING CODE
    #qplot(x, y, data =.)

Which of the following blocks of code can be used to replace # MISSING CODE to make the correct plot?

A. mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%
group_by(g) %>%
summarize(y = mean(y), x = mean(x)) %>%

B. mutate(g = cut(x, quantile(x, ps))) %>%
group_by(g) %>%
summarize(y = mean(y), x = mean(x)) %>%

C. mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%
summarize(y = mean(y), x = mean(x)) %>%

D. mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%
group_by(g) %>%
summarize(y =(y), x =(x)) %>%

Now making the plot:

ps <- seq(0, 1, 0.1)
dat %>% 
   mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%  
   group_by(g) %>%  
   summarize(y = mean(y), x = mean(x)) %>%
    qplot(x, y, data =.)