Acquainting yourself with the data
Reveal number of observations and variables in two different ways
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(iris)
## [1] 150 5
Show first and last observations in the iris data set
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Summarize the iris data set
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Which of the following statements uses a machine learning model?
- Determine whether an incoming email is spam or not.
- Obtain the name of last year’s Giro d’Italia champion.
- Automatically tagging your new Facebook photos.
- Select the student with the highest grade on a statistics course.
# (1) and (3)
Identify the one which is not a machine learning problem.
- 1 Given a viewer’s shopping habits, recommend a product to purchase the next time she visits your website.
- 2 Given the symptoms of a patient, identify her illness.
- 3 Predict the USD/EUR exchange rate for February 2016.
- 4 Compute the mean wage of 10 employees for your company.
# (4)
Basic prediction model
# You'll be working with the Wage dataset. It contains the wage and some general information for workers in the mid-Atlantic region of the US. Just like in the video example, there could be a relationship between a worker's age and his wage. Older workers tend to earn more on average than their younger counterparts, hence you could expect an increasing trend in wage as workers age. So we built a linear regression model for you, using lm(): lm_wage. It models the wage of a worker based on his age. With this linear model lm_wage, that is built with previous observations, you can predict the wage of new observations. For example, suppose you want to predict the wage of a 60 year old worker. You can use the predict() function for this. This generic function takes a model as the first argument. The second argument should be some unseen observations as a data frame. predict() is then able to predict outcomes for these observations.
Take a look at the code that builds lm_wage, which models the wage by the age variable.
#install.packages("ISLR")
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.2.2
lm_wage <- lm(wage ~ age, data = Wage)
lm_wage
##
## Call:
## lm(formula = wage ~ age, data = Wage)
##
## Coefficients:
## (Intercept) age
## 81.7047 0.7073
See how the data frame unseen is created with a single column, age, containing a single value, 60.
unseen <- data.frame(age = 60)
Predict the average wage at age 60 using predict(): you have to pass the arguments lm_wage and unseen. Make sure the variable is displayed in the console (don’t assign it to a variable). Can you interpret the result?
predict(lm_wage, unseen)
## 1
## 124.1413
- Based on the linear model that was estimated from the Wage dataset, you predicted the average wage for a 60 year old worker to be around 124 USD a day.
Classification: Filtering spam
Which of the following questions can be answered using a classification algorithm?
- 1 How does the exchange rate depend on the GDP?
- 2 Does a document contain the handwritten letter S?
- 3 How can I group supermarket products using purchase frequency?
# 2
Classification: Filtering spam
- Filtering spam from relevant emails is a typical machine learning task. Information such as word frequency, character frequency and the amount of capital letters can indicate whether an email is spam or not.
- In the following exercise you’ll work with the dataset emails, which is loaded in your workspace (Source: UCI Machine Learning Repository). Here, several emails have been labeled by humans as spam (1) or not spam (0) and the results are found in the column spam. The considered feature in emails is avg_capital_seq. It is the average amount of sequential capital letters found in each email.
emails <- read.csv("emails.csv")
emails
## avg_capital_seq spam
## 1 1.000 0
## 2 2.112 0
## 3 4.123 1
## 4 1.863 0
## 5 2.973 1
## 6 1.687 0
## 7 5.891 1
## 8 3.167 0
## 9 1.230 0
## 10 2.441 1
## 11 3.555 0
## 12 3.250 0
## 13 1.333 1
In the code, you’ll find a crude spamfilter we built for you, spam_classifier() that uses avg_capital_seq to predict whether an email is spam or not.
spam_classifier <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 4] <- 1
prediction[x >= 3 & x <= 4] <- 0
prediction[x >= 2.2 & x < 3] <- 1
prediction[x >= 1.4 & x < 2.2] <- 0
prediction[x > 1.25 & x < 1.4] <- 1
prediction[x <= 1.25] <- 0
return(prediction)
}
Pass the avg_capital_seq column of emails to spam_classifier() to determine which emails are spam and which aren’t. Assign the resulting outcomes to spam_pred.
spam_pred <- spam_classifier(emails$avg_capital_seq)
spam_pred
## [1] 0 0 1 0 1 0 1 0 0 1 0 0 1
Compare the vector with your predictions, spam_pred, to the true spam labels in emails$spam with the == operator. Simply print out the result. This can be done in one line of code! How many of the emails were correctly classified?
spam_pred == emails$spam
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
- Good job! You correctly filtered the spam 13 out of 13 times! Sadly, the classifier we gave you, was made to perfectly filter all 13 examples. If you were to use it on new emails, the results would be far less satisfying.
Regression: Linkedin views for the next 3 days
Create a vector days with the numbers from 1 to 21, which represent the previous 21 days of your linkedin views. You can use the seq() function, or simply :.
days <- 1:21
Using this linear model, predict the number of views for the next three days (22, 23 and 24). Use predict() and the predefined future_days data frame. Assign the result to linkedin_pred.
future_days <- data.frame(days = 22:24)
future_days
## days
## 1 22
## 2 23
## 3 24
linkedin_pred <- predict(linkedin_lm, future_days)
Plot historical data and predictions
plot(linkedin ~ days, xlim = c(1, 24))
points(22:24, linkedin_pred, col = "green")

Clustering: Separating the iris species
- This technique tries to group your objects. It does this without any prior knowledge of what these groups could or should look like. In this case, the concepts of prior knowledge and unseen observations are less meaningful than for classification and regression.
- In this exercise, you’ll group irises in 3 distinct clusters, based on several flower characteristics in the iris dataset. It has already been chopped up in a data frame my_iris and a vector species, as shown in the sample code below.
# Set random seed. Don't remove this line.
set.seed(1)
# Chop up iris in my_iris and species
my_iris <- iris[-5]
head(my_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
species <- iris$Species
head(species)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
The clustering itself will be done with the kmeans() function. Note: In problems that have a random aspect (like kmeans()), the set.seed() function will be used to enforce reproducibility. If you fix the seed, the random numbers that are generated afterwards are always the same.
Use the kmeans() function. The first argument is my_iris; the second argument is 3, as you want to find three groups in my_iris. Assign the result to a new variable, kmeans_iris.
kmeans_iris <- kmeans(my_iris, 3)
The actual species of the observations is stored in species. Use table() to compare it to the groups the clustering came up with. These groups can be found in the cluster attribute of kmeans_iris.
table(species, kmeans_iris$cluster)
##
## species 1 2 3
## setosa 50 0 0
## versicolor 0 2 48
## virginica 0 36 14
Plot Petal.Width against Petal.Length, coloring by cluster
plot(Petal.Length ~ Petal.Width, data = my_iris, col = kmeans_iris$cluster)

Getting practical with supervised learning
The code that builds a supervised learning model with the rpart() function is already coded for you. This model trains a decision tree on the iris dataset.
tree <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris, method = "class")
# A dataframe containing unseen observations
unseen <- data.frame(Sepal.Length = c(5.3, 7.2),
Sepal.Width = c(2.9, 3.9),
Petal.Length = c(1.7, 5.4),
Petal.Width = c(0.8, 2.3))
unseen
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.3 2.9 1.7 0.8
## 2 7.2 3.9 5.4 2.3
Use the predict() function with the tree model as the first argument. The second argument should be a dataframe containing observations of which you want to predict the label. In this case, you can use the predefined unseen data frame. The third argument should be type = “class”. Simply print out the result of this prediction step.
predict(tree, unseen, type = "class")
## 1 2
## setosa virginica
## Levels: setosa versicolor virginica
How to do unsupervised learning
- In this exercise, you will group cars based on their horsepower and their weight. You can find the types of car and corresponding attributes in the cars data frame, which has been derived from the mtcars dataset.
- To cluster the different observations, you will once again use kmeans(). In short, your job is to cluster the cars in 2 groups
# Set random seed. Don't remove this line.
set.seed(1)
# Explore the cars dataset
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Use kmeans() with two arguments to group the cars into two clusters based on the cars’ hp and wt. Assign the result to km_cars.
km_cars <- kmeans(cars, 2)
print(km_cars$cluster)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 2 2 2
## [36] 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2
- You can see for example that the Ferrari Dino is in cluster 2, while the Fiat X1-9 is grouped in cluster 1. However, if you would like a more comprehensive overview of the results, you should definitely visualize them!
Finish the plot() command by coloring the cars based on their cluster. Do this by setting the col argument to the cluster partitioning vector: km_cars$cluster
#plot(cars, col=km_cars$cluster)
Print out the clusters’ centroids, which are kind of like the centers of each cluster. They can be found in the centers element of km_cars.
#km_cars$centers
#points(km_cars$centers, pch = 22, bg = c(1, 2), cex = 2)
From the following list, select the supervised learning problems:
- Identify a face on a list of facebook photos. You can train your system on tagged Facebook pictures.
- Given some features, predict whether a fruit has gone bad or not. Several supermarkets provided you with their previous observations and results.
- Group datacamp students into three groups. Students within the same group should be similar, while those in different groups must be dissimilar.
# only (1) and (2) are supervised.