Part 1
First lets import the data:
ads.train <- read.csv("internetads_train.csv", header=TRUE)
ads.test <- read.csv("internetads_test.csv", header=TRUE)
I tried creating a summary of the data. But there are just too many columns to visualize.
#summary(ads.train)
First lets transform all the “?” into NA’s. I coerced it when it wasn’t a numerical character.
ads.train$height = as.numeric(as.character(ads.train$height))
NAs introduced by coercion
ads.train$width = as.numeric(as.character(ads.train$width))
NAs introduced by coercion
ads.train$aratio = as.numeric(as.character(ads.train$aratio))
NAs introduced by coercion
ads.train$local = as.numeric(as.character(ads.train$local))
NAs introduced by coercion
ads.test$height = as.numeric(as.character(ads.test$height))
NAs introduced by coercion
ads.test$width = as.numeric(as.character(ads.test$width))
NAs introduced by coercion
ads.test$aratio = as.numeric(as.character(ads.test$aratio))
NAs introduced by coercion
ads.test$local = as.numeric(as.character(ads.test$local))
NAs introduced by coercion
There are many missing values, so removing those rows would cause a loss of a large amount of data. Also removing the columns which have missing data would result a loss of important information regarding the height and width of the images. The best solution I have found is to impute the missing data values using KNN. I am using an imputing function from the library DMwR to do it.
#install.packages("DMwR")
library(DMwR)
package <U+393C><U+3E31>DMwR<U+393C><U+3E32> was built under R version 3.3.3Loading required package: lattice
Loading required package: grid
ads.train.imp <- knnImputation(ads.train, k = 10, scale = F, meth = "weighAvg",distData = NULL)
ads.test.imp <- knnImputation(ads.test, k = 10, scale = F, meth = "weighAvg",distData = NULL)
Now we do not have any missing values. For many other algorithms it might be a good idea to standardize the data. But for trees this does not make sense, because the algorithms looks at how best to separate the data. Thus if one columns values are much bigger than the others, it doesn’t matter.
Now there are no more missing values in the data. So we are ready to run the tree algorithms on it.
Part 2
** Build a classification tree using all possible predictors in the training set with a maximum tree depth of 3. **
library(rpart)
Sys.setlocale("LC_ALL", "C")
[1] "C"
tree.3 <- rpart(class ~., data=ads.train.imp, cp=0.001, maxdepth = 3)
** Plot the classification tree and explain, in words, what the decision rules are for classification. **
par(mar=c(0,4.1,0,2.1))
plot(tree.3)
text(tree.3, cex=.75)

The rules go as follows:
If the image is wider than 224, the height is smaller than 94, but its height is over 58, then it is an ad.
If the image is wider than 224, and the height is bigger than 93, then it is not an ad.
If the image is not wider then 224, its destination URL contains the pattern com, and the ratio between its width and height is less than 4.585, then it probably is an ad.
If the image is not wider then 224, its destination URL contains the pattern com, and the ratio is more than 4.585, then its probably not an ad.
If the image is not wider then 224, and its destination URL does not contain the pattern com, and its URL contains the pattern for ads, then it is probably also an ad.
If the image is not wider then 224, and its destination URL does not contain the pattern com, and its URL does not contain the pattern for ads, then it is probably also not an ad.
** Then show a confusion matrix and compute the misclassification error rate on the test set. **
pred.3 <- predict(tree.3, ads.test.imp)
ads.test.imp$class2 <- factor(ifelse(ads.test.imp$class == "ad.", "yes", "no"))
pred.3.final <- ifelse(pred.3[,2] >= 0.5, "no", "yes")
mean(pred.3.final != ads.test.imp$class2)
[1] 0.06161746
table(ads.test.imp$class2, pred.3.final)
pred.3.final
no yes
no 649 12
yes 36 82
The error rate of a tree with depth 3, is 6,16%. The result does not seem great, but with a tree with more layers, we might get better results. Looking at the confusion matrix, it looks like the most common error was that for some of the images that were ads, it was classifying them as they were not.
** Now build a classification tree using all possible predictors in the training set with a maximum tree depth of 5. **
tree.5 <- rpart(class ~., data=ads.train.imp, cp=0.001, maxdepth = 5)
** Plot this new tree **
par(mar=c(0,4.1,0,2.1))
plot(tree.5)
text(tree.5, cex=.75)

** How different are the two trees, structurally, and how do their test errors compare? **
pred.5 <- predict(tree.5, ads.test.imp)
#ads.test.imp$class2 <- factor(ifelse(ads.test.imp$class == "ad.", "yes", "no"))
pred.5.final <- ifelse(pred.5[,2] >= 0.5, "no", "yes")
mean(pred.5.final != ads.test.imp$class2)
[1] 0.05519897
table(ads.test.imp$class2, pred.5.final)
pred.5.final
no yes
no 650 11
yes 32 86
The complexity of the tree has increases, as there are more levels to the tree. The first 3 layers are very similar, with everyting that existed in the previous tree with depth 3, still existing in this tree. But in the new tree there is more detail, especially if the width is over 224, and the height is lower than 93. It also has many rules when the width is under 224, and the destination URL has the com pattern.
Suprisingly though, their test errors do not make a big difference. The new error is 5,52%, which is not a huge decrease from a tree with 3 layers.
The most common type of error continues to be classifying some of the images as not advertisement, when they actually are.
Part 3
** Use bagging with 50 trees to build an ensemble classifier for these images using the training set. **
#install.packages("randomForest")
library(randomForest)
package 'randomForest' was built under R version 3.3.3randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
set.seed(222)
baf <- randomForest(class ~ ., data=ads.train.imp,ntree=50, importance=TRUE)
** What is the out-of-bag error rate? What test error do you obtain using this new classifier? In your own words, explain what the difference is between the two. **
baf
Call:
randomForest(formula = class ~ ., data = ads.train.imp, ntree = 50, importance = TRUE)
Type of random forest: classification
Number of trees: 50
No. of variables tried at each split: 39
OOB estimate of error rate: 2.6%
Confusion matrix:
ad. nonad. class.error
ad. 286 55 0.161290323
nonad. 10 2149 0.004631774
bag.pred <- predict(baf, ads.test.imp)
mean(bag.pred != ads.test.imp$class)
[1] 0.03080873
The out of the bag error rate is 2,6%. The error on the test data is 3,1%.
The out of the bag error rate, is the error it got total, from the bags of random samples (samples of the training data). The error on the test data, was tested on data that was not included on training the tree.
The error rate is much lower than when just using a tree of depth 5.
** Display a variable importance (or feature importance) plot associated with the bagged trees. Just include the top 10 most important variables in this plot. In a couple of sentences, interpret this plot. **
varImpPlot(baf, n.var = 10)

Above we have the variable importance plot. On the left we have the Mean decrease accuracy. The destination URL click pattern, if removed would cause the most errors when classifying, followed by the width, and the destination url paterrn http www. The mean decrease gini looks at the purity that is caused by the splits. Thas is how well that variable splits the data. the most important is the width, followed by the destination URL pattern com, and the destination URL click pattern.
---
title: "R Notebook"
output: html_notebook
---

#Part 1

First lets import the data:

```{r}
ads.train <- read.csv("internetads_train.csv", header=TRUE)
ads.test <- read.csv("internetads_test.csv", header=TRUE)

```

I tried creating a summary of the data. But there are just too many columns to visualize.

```{r}
#summary(ads.train)
```

First lets transform all the "?" into NA's. I coerced it when it wasn't a numerical character.

```{r}
ads.train$height = as.numeric(as.character(ads.train$height))
ads.train$width = as.numeric(as.character(ads.train$width))
ads.train$aratio = as.numeric(as.character(ads.train$aratio))
ads.train$local = as.numeric(as.character(ads.train$local))
ads.test$height = as.numeric(as.character(ads.test$height))
ads.test$width = as.numeric(as.character(ads.test$width))
ads.test$aratio = as.numeric(as.character(ads.test$aratio))
ads.test$local = as.numeric(as.character(ads.test$local))

```

There are many missing values, so removing those rows would cause a loss of a large amount of data. Also removing the columns which have missing data would result a loss of important information regarding the height and width of the images. The best solution I have found is to impute the missing data values using KNN. I am using an imputing function from the library DMwR to do it.

```{r}
#install.packages("DMwR")
library(DMwR)
```

```{r}
ads.train.imp <- knnImputation(ads.train, k = 10, scale = F, meth = "weighAvg",distData = NULL)
ads.test.imp <- knnImputation(ads.test, k = 10, scale = F, meth = "weighAvg",distData = NULL)
```

Now we do not have any missing values. For many other algorithms it might be a good idea to standardize the data. But for trees this does not make sense, because the algorithms looks at how best to separate the data. Thus if one columns values are much bigger than the others, it doesn't matter.

Now there are no more missing values in the data. So we are ready to run the tree algorithms on it.

#Part 2

** Build a classification tree using all possible predictors in the training set with a maximum tree depth of 3. **


```{r}
library(rpart)
Sys.setlocale("LC_ALL", "C")
tree.3 <- rpart(class ~., data=ads.train.imp, cp=0.001, maxdepth = 3)
```

** Plot the classification tree and explain, in words, what the decision rules are for classification. **

```{r}
par(mar=c(0,4.1,0,2.1))
plot(tree.3)
text(tree.3, cex=.75)
```

The rules go as follows:

* If the image is wider than 224, the height is smaller than 94, but its height is over 58, then it is an ad.

* If the image is wider than 224, and the height is bigger than 93, then it is not an ad.

* If the image is not wider then 224, its destination URL contains the pattern com, and the ratio between its width and height is less than 4.585, then it probably is an ad. 

* If the image is not wider then 224, its destination URL contains the pattern com, and the ratio is more than 4.585, then its probably not an ad.

* If the image is not wider then 224, and its destination URL does not contain the pattern com, and its URL contains the pattern for ads, then it is probably also an ad. 

* If the image is not wider then 224, and its destination URL does not contain the pattern com, and its URL does not contain the pattern for ads, then it is probably also not an ad.


** Then show a confusion matrix and compute the misclassification error rate on the test set. **

```{r}
pred.3 <- predict(tree.3, ads.test.imp)
ads.test.imp$class2 <- factor(ifelse(ads.test.imp$class == "ad.", "yes", "no"))
pred.3.final <- ifelse(pred.3[,2] >= 0.5, "no", "yes")
mean(pred.3.final != ads.test.imp$class2)
table(ads.test.imp$class2, pred.3.final)
```

The error rate of a tree with depth 3, is 6,16%. The result does not seem great, but with a tree with more layers, we might get better results. Looking at the confusion matrix, it looks like the most common error was that for some of the images that were ads, it was classifying them as they were not. 

** Now build a classification tree using all possible predictors in the training set with a maximum tree depth of 5. **

```{r}
tree.5 <- rpart(class ~., data=ads.train.imp, cp=0.001, maxdepth = 5)
```

** Plot this new tree **

```{r}
par(mar=c(0,4.1,0,2.1))
plot(tree.5)
text(tree.5, cex=.75)
```


** How different are the two trees, structurally, and how do their test errors compare? **

```{r}
pred.5 <- predict(tree.5, ads.test.imp)
#ads.test.imp$class2 <- factor(ifelse(ads.test.imp$class == "ad.", "yes", "no"))
pred.5.final <- ifelse(pred.5[,2] >= 0.5, "no", "yes")
mean(pred.5.final != ads.test.imp$class2)
table(ads.test.imp$class2, pred.5.final)
```

The complexity of the tree has increases, as there are more levels to the tree. The first 3 layers are very similar, with everyting that existed in the previous tree with depth 3, still existing in this tree. But in the new tree there is more detail, especially if the width is over 224, and the height is  lower than 93. It also has many rules when the width is under 224, and the destination URL has the com pattern.

Suprisingly though, their test errors do not make a big difference. The new error is 5,52%, which is not a huge decrease from a tree with 3 layers.

The most common type of error continues to be classifying some of the images as not advertisement, when they actually are.

# Part 3

** Use bagging with 50 trees to build an ensemble classifier for these images using the training set. **

```{r}
#install.packages("randomForest")
library(randomForest)

set.seed(222)
baf <- randomForest(class ~ ., data=ads.train.imp,ntree=50, importance=TRUE)
```

 ** What is the out-of-bag error rate? What test error do you obtain using this new classifier? In your own words, explain what the difference is between the two. **

```{r}
baf
bag.pred <- predict(baf, ads.test.imp)
mean(bag.pred != ads.test.imp$class)
```

The out of the bag error rate is 2,6%.
The error on the test data is 3,1%.

The out of the bag error rate, is the error it got total, from the bags of random samples (samples of the training data). The error on the test data, was tested on data that was not included on training the tree.

The error rate is much lower than when just using a tree of depth 5.

** Display a variable importance (or feature importance) plot associated with the bagged trees. Just include the top 10 most important variables in this plot. In a couple of sentences, interpret this plot. **

```{r}
varImpPlot(baf, n.var = 10)
```

Above we have the variable importance plot. On the left we have the Mean decrease accuracy. The destination URL click pattern, if removed would cause the most errors when classifying, followed by the width, and the destination url paterrn http www. The mean decrease gini looks at the purity that is caused by the splits. Thas is how well that variable splits the data. the most important is the width, followed by the destination URL pattern com, and the destination URL click pattern.














