Acquainting yourself with the data

Reveal number of observations and variables in two different ways

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

dim(iris)

## [1] 150   5

Show first and last observations in the iris data set

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Summarize the iris data set

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Which of the following statements uses a machine learning model?

1. Determine whether an incoming email is spam or not.
1. Obtain the name of last year’s Giro d’Italia champion.
1. Automatically tagging your new Facebook photos.
1. Select the student with the highest grade on a statistics course.

# (1) and (3)

Identify the one which is not a machine learning problem.

1 Given a viewer’s shopping habits, recommend a product to purchase the next time she visits your website.
2 Given the symptoms of a patient, identify her illness.
3 Predict the USD/EUR exchange rate for February 2016.
4 Compute the mean wage of 10 employees for your company.

# (4)

Basic prediction model

# You'll be working with the Wage dataset. It contains the wage and some general information for workers in the mid-Atlantic region of the US. Just like in the video example, there could be a relationship between a worker's age and his wage. Older workers tend to earn more on average than their younger counterparts, hence you could expect an increasing trend in wage as workers age. So we built a linear regression model for you, using lm(): lm_wage. It models the wage of a worker based on his age. With this linear model lm_wage, that is built with previous observations, you can predict the wage of new observations. For example, suppose you want to predict the wage of a 60 year old worker. You can use the predict() function for this. This generic function takes a model as the first argument. The second argument should be some unseen observations as a data frame. predict() is then able to predict outcomes for these observations.

Take a look at the code that builds lm_wage, which models the wage by the age variable.

#install.packages("ISLR")
library(ISLR)

## Warning: package 'ISLR' was built under R version 3.2.2

lm_wage <- lm(wage ~ age, data = Wage)
lm_wage

## 
## Call:
## lm(formula = wage ~ age, data = Wage)
## 
## Coefficients:
## (Intercept)          age  
##     81.7047       0.7073

See how the data frame unseen is created with a single column, age, containing a single value, 60.

unseen <- data.frame(age = 60)

Predict the average wage at age 60 using predict(): you have to pass the arguments lm_wage and unseen. Make sure the variable is displayed in the console (don’t assign it to a variable). Can you interpret the result?

predict(lm_wage, unseen)

##        1 
## 124.1413

Based on the linear model that was estimated from the Wage dataset, you predicted the average wage for a 60 year old worker to be around 124 USD a day.

Classification: Filtering spam

Which of the following questions can be answered using a classification algorithm?

1 How does the exchange rate depend on the GDP?
2 Does a document contain the handwritten letter S?
3 How can I group supermarket products using purchase frequency?

# 2

Classification: Filtering spam

Filtering spam from relevant emails is a typical machine learning task. Information such as word frequency, character frequency and the amount of capital letters can indicate whether an email is spam or not.
In the following exercise you’ll work with the dataset emails, which is loaded in your workspace (Source: UCI Machine Learning Repository). Here, several emails have been labeled by humans as spam (1) or not spam (0) and the results are found in the column spam. The considered feature in emails is avg_capital_seq. It is the average amount of sequential capital letters found in each email.

emails <- read.csv("emails.csv")
emails

##    avg_capital_seq spam
## 1            1.000    0
## 2            2.112    0
## 3            4.123    1
## 4            1.863    0
## 5            2.973    1
## 6            1.687    0
## 7            5.891    1
## 8            3.167    0
## 9            1.230    0
## 10           2.441    1
## 11           3.555    0
## 12           3.250    0
## 13           1.333    1

In the code, you’ll find a crude spamfilter we built for you, spam_classifier() that uses avg_capital_seq to predict whether an email is spam or not.

spam_classifier <- function(x){
  prediction <- rep(NA,length(x))
  prediction[x > 4] <- 1
  prediction[x >= 3 & x <= 4] <- 0
  prediction[x >= 2.2 & x < 3] <- 1
  prediction[x >= 1.4 & x < 2.2] <- 0
  prediction[x > 1.25 & x < 1.4] <- 1
  prediction[x <= 1.25] <- 0
  return(prediction)
}

Your job is to inspect the emails dataset, apply spam_classifier to it and compare the outcome with the true labels! If you want to play some more with the emails dataset, you can download it here. http://s3.amazonaws.com/assets.datacamp.com/course/intro_to_ml/emails_small.csv

Pass the avg_capital_seq column of emails to spam_classifier() to determine which emails are spam and which aren’t. Assign the resulting outcomes to spam_pred.

spam_pred <- spam_classifier(emails$avg_capital_seq) 
spam_pred

##  [1] 0 0 1 0 1 0 1 0 0 1 0 0 1

Compare the vector with your predictions, spam_pred, to the true spam labels in emails$spam with the == operator. Simply print out the result. This can be done in one line of code! How many of the emails were correctly classified?

spam_pred == emails$spam

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Good job! You correctly filtered the spam 13 out of 13 times! Sadly, the classifier we gave you, was made to perfectly filter all 13 examples. If you were to use it on new emails, the results would be far less satisfying.

Regression: Linkedin views for the next 3 days

You can predict how often your profile will be visited in the future. The instructions will help you predict the number of profile views in the next 3 days, based on the views for the past 3 weeks. The linkedin vector, that contains this information, is already available in the workspace for you.

linkedin <- c(5,  7,  4,  9, 11, 10, 14, 17, 13, 11, 18, 17, 21, 21, 24, 23, 28, 35, 21, 27, 23)

Create a vector days with the numbers from 1 to 21, which represent the previous 21 days of your linkedin views. You can use the seq() function, or simply :.

days <- 1:21

Try to fit a linear model that explains the linkedin views based on days. Use the lm() function with the appropriate formula. lm(y ~ x), for example, builds a linear model of y based on x. Assign the resulting linear model to linkedin_lm.

linkedin_lm <- lm(linkedin ~ days)

Using this linear model, predict the number of views for the next three days (22, 23 and 24). Use predict() and the predefined future_days data frame. Assign the result to linkedin_pred.

future_days <- data.frame(days = 22:24)
future_days

##   days
## 1   22
## 2   23
## 3   24

linkedin_pred <- predict(linkedin_lm, future_days)

Plot historical data and predictions

plot(linkedin ~ days, xlim = c(1, 24))
points(22:24, linkedin_pred, col = "green")

Clustering: Separating the iris species

This technique tries to group your objects. It does this without any prior knowledge of what these groups could or should look like. In this case, the concepts of prior knowledge and unseen observations are less meaningful than for classification and regression.
In this exercise, you’ll group irises in 3 distinct clusters, based on several flower characteristics in the iris dataset. It has already been chopped up in a data frame my_iris and a vector species, as shown in the sample code below.

# Set random seed. Don't remove this line.
set.seed(1)
# Chop up iris in my_iris and species
my_iris <- iris[-5]
head(my_iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

species <- iris$Species
head(species)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

The clustering itself will be done with the kmeans() function. Note: In problems that have a random aspect (like kmeans()), the set.seed() function will be used to enforce reproducibility. If you fix the seed, the random numbers that are generated afterwards are always the same.

Use the kmeans() function. The first argument is my_iris; the second argument is 3, as you want to find three groups in my_iris. Assign the result to a new variable, kmeans_iris.

kmeans_iris <- kmeans(my_iris, 3)

The actual species of the observations is stored in species. Use table() to compare it to the groups the clustering came up with. These groups can be found in the cluster attribute of kmeans_iris.

table(species, kmeans_iris$cluster)

##             
## species       1  2  3
##   setosa     50  0  0
##   versicolor  0  2 48
##   virginica   0 36 14

Plot Petal.Width against Petal.Length, coloring by cluster

plot(Petal.Length ~ Petal.Width, data = my_iris, col = kmeans_iris$cluster)

Getting practical with supervised learning

In the previous exercises, you used kmeans() to perform clustering on the iris dataset. Remember that you created your own copy of the dataset, and dropped the Species attribute? That’s right, you removed the labels of the observations. In this exercise, you will use the same dataset. But instead of dropping the Species labels, you will use them do some supervised learning!

# Set random seed. Don't remove this line.
set.seed(1)
library(rpart)

## Warning: package 'rpart' was built under R version 3.2.2

# Take a look at the iris dataset
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

The code that builds a supervised learning model with the rpart() function is already coded for you. This model trains a decision tree on the iris dataset.

tree <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
              data = iris, method = "class")
# A dataframe containing unseen observations
unseen <- data.frame(Sepal.Length = c(5.3, 7.2), 
                     Sepal.Width = c(2.9, 3.9), 
                     Petal.Length = c(1.7, 5.4), 
                     Petal.Width = c(0.8, 2.3))
unseen

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.3         2.9          1.7         0.8
## 2          7.2         3.9          5.4         2.3

Use the predict() function with the tree model as the first argument. The second argument should be a dataframe containing observations of which you want to predict the label. In this case, you can use the predefined unseen data frame. The third argument should be type = “class”. Simply print out the result of this prediction step.

predict(tree, unseen, type = "class")

##         1         2 
##    setosa virginica 
## Levels: setosa versicolor virginica

How to do unsupervised learning

In this exercise, you will group cars based on their horsepower and their weight. You can find the types of car and corresponding attributes in the cars data frame, which has been derived from the mtcars dataset.
To cluster the different observations, you will once again use kmeans(). In short, your job is to cluster the cars in 2 groups

# Set random seed. Don't remove this line.
set.seed(1)
# Explore the cars dataset
str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Use kmeans() with two arguments to group the cars into two clusters based on the cars’ hp and wt. Assign the result to km_cars.

km_cars <- kmeans(cars, 2)
print(km_cars$cluster)

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 2 2 2
## [36] 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2

You can see for example that the Ferrari Dino is in cluster 2, while the Fiat X1-9 is grouped in cluster 1. However, if you would like a more comprehensive overview of the results, you should definitely visualize them!

Finish the plot() command by coloring the cars based on their cluster. Do this by setting the col argument to the cluster partitioning vector: km_cars$cluster

#plot(cars, col=km_cars$cluster)

Print out the clusters’ centroids, which are kind of like the centers of each cluster. They can be found in the centers element of km_cars.

#km_cars$centers
#points(km_cars$centers, pch = 22, bg = c(1, 2), cex = 2)

From the following list, select the supervised learning problems:

1. Identify a face on a list of facebook photos. You can train your system on tagged Facebook pictures.
1. Given some features, predict whether a fruit has gone bad or not. Several supermarkets provided you with their previous observations and results.
1. Group datacamp students into three groups. Students within the same group should be similar, while those in different groups must be dissimilar.

# only (1) and (2) are supervised.