libraries <- c("ggplot2", "dplyr", "caTools", "party",
"randomForest", "nnet", "e1071")
lapply(libraries, library, character.only = TRUE)
# Sample size
split_size <- 0.8
sample_size <- floor(split_size * nrow(mtcars))
set.seed(10)
train_indices <- sample(seq_len(nrow(mtcars)), size = sample_size)
# Splitting dataframe
train <- mtcars[train_indices, ]
test <- mtcars[-train_indices, ]
# Model training
model <- lm(mpg ~ disp, data = train)
# Model prediction
new_data <- dplyr::tibble(disp = test$disp)
test$output <- predict(model, new_data)
# RMSE
sqrt(sum(test$mpg - test$output) ** 2 / nrow(test))
## [1] 1.974877
mtcars %>%
ggplot(aes(x = mpg, y = am)) +
geom_point() + theme_bw() +
xlab("Fuel Efficiency (Miles per Gallon)") +
ylab("Vehicle Transmission Type (0 = Automatic, 1 = Manual)")
# Splitting Y and X
label_train <- train[, 9]
data_train <- train[, -9]
# Logistic regression
l_model <- LogitBoost(data_train, label_train)
# Prediction
Lab <- predict(l_model, test, type = "raw")
l_prediction <- data.frame(car_name = row.names(test), test$mpg, test$am, Lab)
l_prediction
## car_name test.mpg test.am X0 X1
## 1 Mazda RX4 21.0 1 0.5000000 0.5000000000
## 2 Datsun 710 22.8 1 0.8807971 0.1192029220
## 3 Hornet 4 Drive 21.4 0 0.9996646 0.0003353501
## 4 Merc 450SLC 15.2 0 0.9996646 0.0003353501
## 5 Fiat 128 32.4 1 0.1192029 0.8807970780
## 6 Maserati Bora 15.0 1 0.9975274 0.0024726232
## 7 Volvo 142E 21.4 1 0.5000000 0.5000000000
iris %>%
ggplot(aes(x = Petal.Length, y = Petal.Width)) +
geom_point() +
xlab("Petal Lenght") +
ylab("Petal Width") +
theme_minimal()
data_cluster <- tibble(iris$Petal.Length, iris$Petal.Width)
iris_kmeans <- kmeans(data_cluster, 3)
iris$cluster <- as.factor(iris_kmeans$cluster)
iris %>%
ggplot(aes(x = Petal.Length, y = Petal.Width, color = cluster)) +
geom_point() +
xlab("Petal Lenght") +
ylab("Petal Width") +
theme_minimal() +
geom_point(data = as.data.frame(iris_kmeans$centers),
aes(x = `iris$Petal.Length`, y = `iris$Petal.Width`, col = "Centroid"))
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 2 48
## virginica 0 44 6
### Train - test prediction
test$mpg_tree <- predict(tree_train, test)
test$class <- predict(tree_train, test, type = "node")
t_model <- data.frame(row.names(test), test$mpg, test$mpg_tree, test$class)
t_model
## row.names.test. test.mpg mpg test.class
## 1 Mazda RX4 21.0 25.88889 2
## 2 Datsun 710 22.8 25.88889 2
## 3 Hornet 4 Drive 21.4 16.29375 3
## 4 Merc 450SLC 15.2 16.29375 3
## 5 Fiat 128 32.4 25.88889 2
## 6 Maserati Bora 15.0 16.29375 3
## 7 Volvo 142E 21.4 25.88889 2
mtcars_rf <- randomForest(mpg ~ ., data = mtcars, ntree = 100,
keep.forest = FALSE, importance = TRUE)
plot(mtcars_rf, log = "y", title = "")
A neural network, as its name implies, takes its computational form from the way neurons in a biological system work. In essence, for a given list of inputs, a neural nwtwork performs a number of processing steps before running an output. The complexity in neural networks comes in how many of th processing steps are, and how complex each particular step might be.
We have a number of aspects in a neural network to be cognizant of:
The input layer: This is a layer that takes in a number of features, including a bias node, which id often just an offset parameter
The hidden layer, or compute" layer: This is the layer that computes some function of each feature. The number of nodes in this hidden layer depends on the computation. Sometimes, it might be as simple as one node in this layer. Other times, the picture might be more complex with multiple hidden layers.
## # weights: 27
## initial value 188.800225
## iter 10 value 69.665454
## iter 20 value 69.326775
## iter 30 value 69.314729
## final value 69.314718
## converged
##
## setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 28 22
## virginica 0 21 29
the idea is that we are taking data and trying to find a plane or a line that can separate the data into different classes.
Suppose that we have n features in your data and m observations, or rows . If n is much greater than m (e.g, n = 1000, m = 10), we would want to use a logistic regression. If we have the opossite, we might want to use an SVM instead.
iris_svm <- svm(Species ~ ., data = iris[, -6])
table(iris$Species, predict(iris_svm, iris, type = "class"))
##
## setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 2 48
When we talk about values related to the terms mean, variance and standard deviation in relation to the total population, these are called parameters. When we talk about those same values, but specific toa certain subset of the data, we call them statistics
low bias and low variance: best-case scenario. Samples are pretty well representative of the population.
High bias, low variance: The samples are all pretty consistent, but not really reflective of the population.
Low bias, high variance: The samples vary widly in their consistency, but some might be representative of the population
Low bias, high variance: the samples are a little more consistent, but not likely to be representative of the population.
iris_df <- data.frame(iris)
sample_index <- sample(1:nrow(iris_df), nrow(iris) * 0.75, replace = FALSE)
head(iris[sample_index, ])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species cluster
## 83 5.8 2.7 3.9 1.2 versicolor 3
## 113 6.8 3.0 5.5 2.1 virginica 2
## 75 6.4 2.9 4.3 1.3 versicolor 3
## 80 5.7 2.6 3.5 1.0 versicolor 3
## 139 6.0 3.0 4.8 1.8 virginica 3
## 20 5.1 3.8 1.5 0.3 setosa 1
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species cluster
## setosa :50 1:50
## versicolor:50 2:46
## virginica :50 3:54
##
##
##
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.200 Min. :1.100 Min. :0.100
## 1st Qu.:5.175 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.375
## Median :5.800 Median :3.000 Median :4.400 Median :1.300
## Mean :5.896 Mean :3.059 Mean :3.835 Mean :1.232
## 3rd Qu.:6.400 3rd Qu.:3.400 3rd Qu.:5.100 3rd Qu.:1.900
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species cluster
## setosa :36 1:36
## versicolor:36 2:36
## virginica :40 3:40
##
##
##
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.400 Min. :2.200 Min. :1.200 Min. :0.100
## 1st Qu.:5.200 1st Qu.:2.800 1st Qu.:1.500 1st Qu.:0.350
## Median :5.700 Median :3.000 Median :4.500 Median :1.400
## Mean :5.844 Mean :3.125 Mean :3.765 Mean :1.215
## 3rd Qu.:6.350 3rd Qu.:3.400 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.700 Max. :4.400 Max. :6.900 Max. :2.500
## Species cluster
## setosa :25 1:25
## versicolor:25 2:22
## virginica :25 3:28
##
##
##
sys_sample <- function(N, n){
k = ceiling(N/n)
r = sample(1:k, 1)
sys_samp = seq(r, r + k * (n + 1), k)
}
sys_index <- sys_sample(nrow(iris), nrow(iris) * 0.75)
summary(iris[sys_index, ])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.200 Min. :1.10 Min. :0.10
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.55 1st Qu.:0.35
## Median :5.700 Median :3.000 Median :4.20 Median :1.30
## Mean :5.847 Mean :3.051 Mean :3.74 Mean :1.18
## 3rd Qu.:6.400 3rd Qu.:3.250 3rd Qu.:5.10 3rd Qu.:1.80
## Max. :7.900 Max. :4.400 Max. :6.70 Max. :2.50
## NA's :39 NA's :39 NA's :39 NA's :39
## Species cluster
## setosa :25 1 :25
## versicolor:25 2 :24
## virginica :25 3 :26
## NA's :39 NA's:39
##
##
##
There are two major assuptions that we work with doing these training/test splits:
set.seed(123)
x <- rnorm(100, 2, 1)
y <- exp(x) + rnorm(5, 0, 2)
qplot(x, y) + theme_minimal() + geom_smooth(method = "lm", se = FALSE)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6481 -3.7122 -1.9390 0.9698 29.8283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.6323 1.6335 -8.345 4.63e-13 ***
## x 11.9801 0.7167 16.715 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.51 on 98 degrees of freedom
## Multiple R-squared: 0.7403, Adjusted R-squared: 0.7377
## F-statistic: 279.4 on 1 and 98 DF, p-value: < 2.2e-16
data_linear <- data.frame(x, y)
data_samples <- sample(1:nrow(data_linear),
nrow(data_linear) * 0.7, replace = FALSE)
training_data <- data_linear[data_samples, ]
test_data <- data_linear[-data_samples, ]
train_linear <- lm(y ~ x, training_data)
train_output <- predict(train_linear, test_data)
\[ RMSE = \sqrt{ \frac{1}{n} \sum{(y_{predicted} - y_{actual})} ^ 2} \]
RMSE_df <- data.frame( predicted = train_output, actual = test_data$y,
SE = ((train_output - test_data$y)^2) / length(train_output))
head(RMSE_df)
## predicted actual SE
## 2 7.874300 6.383579 0.07407499
## 3 28.504227 34.624423 1.24855995
## 4 11.341893 7.233768 0.56255641
## 5 12.019753 6.505638 1.01351529
## 12 14.678243 11.102747 0.42613909
## 15 4.118657 2.335049 0.10604193
## [1] 6.946493
train_quadratic <- lm(y ~ x^2 + x, data = training_data)
quadratic_output <- predict(train_quadratic, test_data)
RMSE_quad_df <- data.frame( predicted = quadratic_output, actual = test_data$y,
SE = ((quadratic_output - test_data$y)^2) / length(train_output))
sqrt(sum(RMSE_quad_df$SE))
## [1] 6.946493
train_polyn <- lm(y ~ poly(x, 4), data = training_data)
polyn_output <- predict(train_polyn, test_data)
RMSE_poly_df <- data.frame( predicted = polyn_output, actual = test_data$y,
SE = ((polyn_output - test_data$y)^2) / length(train_output))
sqrt(sum(RMSE_poly_df$SE))
## [1] 0.8836878
\[ MAE = \frac{1}{n}\sum|y_{predicted} - y_{actual}| \] #### RRSE Root relative squared error
\[ RRSE = \sqrt{\frac{\sum(y_{predicted}-y_{actual})^2}{\sum(\bar{y}_{predicted}-y_{actual})^2}} \] #### RAE Relative Absolute error
\[ RRSE = \frac{\sum|y_{predicted}-y_{actual}|}{\sum|\bar{y}_{predicted}-y_{actual}|} \]
Where \(\bar{y}_{predicted}\) is the average value of our model output and is just a scalar number
iris_df <- iris[, -6] %>%
mutate(Species = as.character(Species),
Species = if_else(Species != "setosa", "other", Species),
Species = as.factor(Species))
iris_sample <- sample(1:nrow(iris_df), nrow(iris_df) * 0.7,
replace = FALSE)
train_iris <- iris_df[iris_sample, ]
test_iris <- iris_df[-iris_sample, ]
iris_rf <- randomForest(Species ~ ., data = train_iris)
iris_prediction <- predict(iris_rf, test_iris)
table(iris_prediction, test_iris$Species)
##
## iris_prediction other setosa
## other 34 0
## setosa 0 11
example <- table(iris_prediction, test_iris$Species)
example[1,1] <- 28
example[1,2] <- 3
example[2,1] <- 2
example[2,2] <- 12
example
##
## iris_prediction other setosa
## other 28 3
## setosa 2 12
TP = 15 TN = 26 FP = 2 FN = 3
Sensitivity (equivalent to hit rate, or recall: often called recall, this is if you have a lower threshold set for your classification model. You would set a lower bar if you didn’t want to miss out on any plants that could possibly be of a “setosa” type
\[ Sensitivity = \frac{TP}{TP + FN} \]
Specificity: Logically the same thing as precision, but for the opposite case when you’re predicting whether a plant isn’t a “setosa” variant
\[ Specificity = \frac{TN}{TN + FP} \]
Precision (Or positive predictive value): The number of positive cases you’ve predicted divided divided by the total predicted positive. If you had a model that had a very high sensitivity, that would be akin to setting a threshold in your model to say, “Only classify a plant as”setosa" if we are absolutely sure about it"
\[ Precision = \frac{TP}{TP + FP} \]
Accuracy: Number of true cases divided by the total true and false cases \[ Accuracy = \frac{TP + TN}{ TP + TN + TP + FN} \]
F1 Score weighted average of precision and recall scores
\[ F1 = \frac{2TP}{2TP + FP + FN} \]
Is a statistical technique by which you take your entire dataset, split it into a number of small train/test chunks, evaluate the error for each chunk, and then average those final errors. This approach winds up being a more accurate way of assessing whether your modeling approach has any issues that could be hidden in various combinations of the combinations of the training and test parts o the dataset
This involves taking your dataset and splitting it into k chunks. For each of these chunks, you then split the data into a smaller train/test set and then evaluate that individual chunk’s error. After you have all the errors for all the chunks, you simply take the average. the advantage to this method is that you can then see the error in all aspects of your data instead of just testing no one specific subset of it.
set.seed(2)
x <- rnorm(100, 2, 1)
y <- exp(x) + rnorm(5, 0, 2)
data <- data.frame(x, y)
data_shuffled <- data[sample(nrow(data)), ]
folds <- cut(seq(1, nrow(data)), breaks = 10, labels = FALSE)
errors <- c(0)
for (i in 1:10) {
fold_indexes <- which(folds == i, arr.ind = TRUE)
test_data <- data[fold_indexes, ]
training_data <- data[-fold_indexes, ]
train_linear <- lm(y ~ x, training_data)
train_output <- predict(train_linear, test_data)
errors <- c(errors,
sqrt(sum((train_output - test_data$y)^2/length(train_output))))
}
errors[2:11]
## [1] 7.091023 8.244583 11.774470 5.571905 9.145773 8.569813 9.325125
## [8] 6.289492 7.452926 7.116796
## [1] 7.325628