Libraries, comments are related to which function was used
library(ISLR) #book library
library(ggplot2) #plots
library(MASS) #lda(), qda()
library(class) #cbind #knn()
library(DMwR2) #book library (DATA MINING WITH R) kNN() (knn() from library(class) with formula argument)
library(caret) #train(), #trainControl() #createDataPartition()
library(magrittr) # chaining operator %>%
library(dplyr) #glimpse()
library(gridExtra) #grid.Arrange()
library(ggpubr) #ggarrange()
library(tidyverse) #ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
library(reshape2) #melt()
library(RColorBrewer) #colorRampPalette()Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In other words, the logistic function representation and logit representation for the logistic regression model are equivalent.
The Equation (4.2) is:
\[ p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} \]
Multiplying in cross the fractions the result is:
\[p(X) + p(X) e^{\beta_0+\beta_1X} = e^{\beta_0+\beta_1X} \]
Isolating \(p(X)\):
\[ p(X) = e^{\beta_0+\beta_1X} - p(X) e^{\beta_0+\beta_1X} \]
Putting the exponential in evidence:
\[ p(X) = e^{\beta_0+\beta_1X}(1- p(X)) \]
And isolating the exponential on the right side:
\[ \frac{p(X)}{1- p(X)} = e^{\beta_0+\beta_1X} \]
It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observations in the \(k\)th class are drawn from a \(N(μ_k, σ^2)\) distribution, the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized.
Equation 4.12 is expressed by:
\[ \begin{align*} p_k(x) & = \frac{\pi_k \frac{1}{\sigma\sqrt{2\pi}} exp({-\frac{1}{2\sigma^2} (x-\mu_k)^2})} {\sum_{l=1}^{K} \pi_l \frac{1}{\sigma\sqrt{2\pi}} exp({-\frac{1}{2\sigma^2} (x-\mu_l)^2})} \\ \end{align*} \]
Doing some algebra it is possible to cancel some terms that are constants:
\[ p_k(x) = \frac{\frac{1}{\sigma\sqrt{2\pi}} \pi_k exp({-\frac{1}{2\sigma^2} (x-\mu_k)^2})} { \frac{1}{\sigma\sqrt{2\pi}} \sum_{l=1}^{K} \pi_l exp({-\frac{1}{2\sigma^2} (x-\mu_l)^2})} \]
\[ p_k(x) = \frac{\pi_k exp({-\frac{1}{2\sigma^2} (x-\mu_k)^2})} { \sum_{l=1}^{K} \pi_l exp({-\frac{1}{2\sigma^2} (x-\mu_l)^2})} \]
Calling \(\alpha_k = -\frac{1}{2\sigma^2} (x-\mu_k)^2\):
\[ p_k(x) = \frac{\pi_k e^{\alpha_k}} { \sum_{l=1}^{K} \pi_l e^{\alpha_l}} \]
Taking the \(\log_e\) in both sides (it must to be considered \(\ln\) in order to perform an operation later):
\[ \log{(p_k(x))} = \log{\left( \frac{\pi_k e^{\alpha_k}} { \sum_{l=1}^{K} \pi_l e^{\alpha_l}} \right)} \]
The notation was left in \(\log\) to be in agreement with the book.
Doing some \(\log\) operations:
\[ \log{(p_k(x))} = \log{ ( \pi_k e^{\alpha_k})} - \log{ \left( \sum_{l=1}^{K} \pi_l e^{\alpha_l} \right) } \]
\[ \log{(p_k(x))} = \log{(\pi_k)} + \log{e^{\alpha_k}} - \log{\left( \sum_{l=1}^{K} \pi_l e^{\alpha_l} \right)} \]
\[ \log{(p_k(x))} = \log{(\pi_k)} + \alpha_k - \log{\left( \sum_{l=1}^{K} \pi_l e^{\alpha_l} \right)} \]
Since the objective is to maximize the value of \(p_k(x)\), isolating the positive terms and calling them in a new variable \(\delta_k\)
\[\delta_k = \log{(\pi_k)} + \alpha_k\]
Plugging in the value of \(\alpha_k\):
\[\delta_k = \log{(\pi_k)} -\frac{1}{2\sigma^2} (x-\mu_k)^2\]
Expanding the quadratic term:
\[\delta_k = \log{(\pi_k)} -\frac{1}{2\sigma^2} (x^2 - 2x\mu_k + \mu_k^2)\]
Rearranging the expression:
\[\delta_k = \log{(\pi_k)} - \frac{x^2}{2\sigma^2} + \frac{x\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2}\]
Leaving only the terms which are class \(k\) dependent, results in Equation 4.13:
\[\delta_k = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log{(\pi_k)} \]
And one could observe that is linear in relation to \(x\), therefore LDA
This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class- specific mean vector and a class specific covariance matrix. We consider the simple case where \(p = 1\); i.e. there is only one feature.
Suppose that we have \(K\) classes, and that if an observation belongs to the \(k\)th class then \(X\) comes from a one-dimensional normal distribution, \(X ∼ N(μ_k, σ_k^2)\). Recall that the density function for the one-dimensional normal distribution is given in (4.11). Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic.
Hint: For this problem, you should follow the arguments laid out in Section 4.4.2, but without making the assumption that \(σ_1^2 = . . . = σ_K^2\).
In this case, the variance term, \(\sigma\), will be allowed to vary according to the class \(k\) being evaluated in the density function \(f_k(x)\).
Therefore, different from the previous case, there will be a \(\sigma_k\) and \(\sigma_l\) term instead of a \(\sigma\) term.
The density function \(f_k(x)\) for the class \(k\) is expressed as:
\[ f_k(x) = \frac{1}{\sigma_k\sqrt{2\pi}} exp\left({\frac{-1}{2\sigma_k^2} (x-\mu_k)^2}\right) \]
Note the \(\sigma\) index;
When plugged into the Bayes`s Theorem:
\[ p_k(x) = \frac{\pi_k \cdot f_k(x)}{\sum_{l=1}^{K} \pi_l \cdot f_l(x)} \]
The result is similar from the previous exercise, but with the \(\sigma \implies \sigma_k\) modification:
\[ \begin{align*} p_k(x) & = \frac{\pi_k \frac{1}{\sigma_k\sqrt{2\pi}} exp\left({\frac{-1}{2\sigma_k^2} (x-\mu_k)^2}\right)} {\sum_{l=1}^{K} \pi_l \frac{1}{\sigma_l\sqrt{2\pi}} exp\left({\frac{-1}{2\sigma_l^2} (x-\mu_l)^2}\right)} \\ \end{align*} \]
Canceling the square root term:
\[ p_k(x) = \frac{\frac{1}{\sqrt{2\pi}} \frac{\pi_k}{\sigma_k} exp\left({\frac{-1}{2\sigma_k^2} (x-\mu_k)^2}\right)} {\frac{1}{\sqrt{2\pi}} \sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} exp\left({\frac{-1}{2\sigma_l^2} (x-\mu_l)^2}\right)} \]
\[ p_k(x) = \frac{ \frac{\pi_k}{\sigma_k} exp\left({\frac{-1}{2\sigma_k^2} (x-\mu_k)^2}\right)} { \sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} exp\left({\frac{-1}{2\sigma_l^2} (x-\mu_l)^2}\right)} \]
Taking the \(ln{(p_k(x))}\) and switching:
\(\alpha_k = -\frac{1}{2\sigma_k^2} (x-\mu_k)^2\)
\(\alpha_l = -\frac{1}{2\sigma_l^2} (x-\mu_l)^2\), we have:
\[ p_k(x) = \frac{ \frac{\pi_k}{\sigma_k} e^{\alpha_k}} { \sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} e^{\alpha_l}} \]
\[ \ln{p_k(x)} = \ln\left( \frac{ \frac{\pi_k}{\sigma_k} e^{\alpha_k}} { \sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} e^{\alpha_l}} \right) \]
Doing some operations:
\[ \ln{p_k(x)} = \ln\left(\frac{\pi_k}{\sigma_k} e^{\alpha_k}\right) - \ln\left( \sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} e^{\alpha_l}\right) \]
\[ \ln{p_k(x)} = \ln\left(\frac{\pi_k}{\sigma_k}\right) + \ln e^{\alpha_k} - \ln\left( \sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} e^{\alpha_l}\right) \]
\[ \ln{p_k(x)} = \ln \pi_k - \ln \sigma_k + \alpha_k - \ln\left( \sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} e^{\alpha_l}\right) \]
Separating the term which depending on the class \(k\) in a new variable \(\delta_k\) , similarly from the previous exercise,
\[ \delta_k = \ln \pi_k - \ln \sigma_k + \alpha_k \]
Plugging in the value of \(\alpha_k\):
\[ \delta_k = \ln \pi_k - \ln \sigma_k - \frac{(x-\mu_k)^2}{2\sigma_k^2} \]
Expanding the quadratic term:
\[ \delta_k = \ln \pi_k - \ln \sigma_k - \frac{x^2 -2x\mu_k + \mu_k^2}{2\sigma_k^2} \]
\[ \delta_k = \ln \pi_k - \ln \sigma_k - \frac{x^2}{2\sigma_k^2} + \frac{x\mu_k}{\sigma_k^2} - \frac{\mu_k^2}{2\sigma_k^2} \]
Rearranging the terms following the \(ax^2 + bx + c\) pattern:
\[ \delta_k = \left(-\frac{1}{2\sigma_k^2}\right)x^2 + \left(\frac{\mu_k}{\sigma_k^2}\right)x + \left(\ln \pi_k - \ln \sigma_k - \frac{\mu_k^2}{2\sigma_k^2}\right) \]
As one can see, the equation is not linear according to \(x\), but quadratic, QDA;
The Bayes’ classifier is not linear when there are different covariance matrix for each specific \(k\)-class or \(X ∼ N(μ_k, σ_k^2)\);
When the number of features \(p\) is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large. We will now investigate this curse.
Suppose that we have a set of observations, each with measurements on \(p = 1\) feature, \(X\). We assume that \(X\) is uniformly (evenly) distributed on [0, 1]. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10 % of the range of \(X\) closest to that test observation. For instance, in order to predict the response for a test observation with \(X = 0.6\), we will use observations in the range [0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction?
Let \(r\) represents the fraction of the available observations that can be used to make the prediction
One can express this fraction as:
\[ r = \frac{k}{N} \]
Where:
\(k\): is the number of observations closest to the test observation that one wants to predict or classify
\(N\): set of observations, in other words, all the observations
The question says that we have to use observations that are within \(10\%\) of the range of \(X\) closest to that test observation. This could be represented as a length \(e_p\), which \(p\) is the number of features
Therefore, for one-dimension, \(p=1\):
\[ e_p = e_1 = 10\% = 0.1 \]
The one-dimension scenario can be imagined as a straight line where the observations are spread
Hence, the question is: what is the fraction \(r\) of the available observations will one use on a range \(e_1\)?
Since the fraction \(r\) of the unit length is expected to be the nearest neighbor length \(e_p\):
\[ e_p = r \]
\[ 0.1 = r \]
\[ r = 10\% \]
The fraction available is therefore \(10\%\)
Now suppose that we have a set of observations, each with measurements on \(p = 2\) features, \(X_1\) and \(X_2\). We assume that \((X_1, X_2)\) are uniformly distributed on [0, 1] \(×\) [0, 1]. We wish to predict a test observation’s response using only observations that are within 10 % of the range of \(X_1\) and within 10 % of the range of \(X_2\) closest to that test observation. For instance, in order to predict the response for a test observation with \(X_1\) = 0.6 and \(X_2\) = 0.35, we will use observations in the range [0.55, 0.65] for \(X_1\) and in the range [0.3, 0.4] for \(X_2\). On average, what fraction of the available observations will we use to make the prediction?
Now a two-dimension scenario is required;
Instead of a straight line, the two-dimension can be imagined as a square in which the observations are spread;
The difference now is that the length is an area formed by the two predictors range multiplying each other;
Since the range still \(10\%\) of the total range \([0,1]\):
\[ e_{X_1} = [0.55, 0.65] \implies 0.65 - 0.55 = 0.1\\ e_{X_2} = [0.30, 0.40] \implies 0.40 - 0.30 = 0.1 \]
The fraction \(r\) can be calculated as:
\[ e_{X_1} \times e_{X_2} = r \]
\[ 10\% \times 10\% = r \]
\[ r = 10\% \times 10\% \implies r = 0.1 \times 0.1 \implies r = 0.01 \implies r = 1\% \]
The result is lower from the previous case where only one predictor is used because only \(1\%\) of all data will be available, meaning that the observation points are too far from each other for the desirable range of \(0.1\).
Now suppose that we have a set of observations on \(p\) = 100 features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10 % of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?
This case is a limit one that will show that for 100 features, or a in 100 dimensions, the observation data points are so distant from each other that there is no data available for the desirable range.
\[ e_{X_1} \times e_{X_2} \times ... \times e_{X_{100}} = r \]
\[ \prod_{i=1}^{100} e_{X_i} = r\]
As all the ranges are \(10\% = 0.1\):
\[ r = 0.1^{100} \implies r = 0 \]
And as this number is so small, there is no available data because the observations data points are too far away from each other in \(p=100\) dimensions
Using your answers to parts (a)–(c), argue that a drawback of KNN when \(p\) is large is that there are very few training observations “near” any given test observation.
As the answers expressed:
\[ (a) \implies p=1 \implies 10\% \\ (b) \implies p=2 \implies 1\% \\ (c) \implies p=100 \implies 0.1^{100} = 0\% \]
When the number of predictors is high, even for a simple case of 2 predictors, if one wish to capture \(10\%\) of the range of the observations limits, the distance of between the test data that will be predicted and the training observations will be so large that only \(1\%\) of all the data will be enclosure in this neighborhood.
Now suppose that we wish to make a prediction for a test observation by creating a \(p\)-dimensional hypercube centered around the test observation that contains, on average, 10 % of the training observations. For \(p\) = 1, 2, and 100, what is the length of each side of the hypercube? Comment on your answer.
Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When \(p\) = 1, a hypercube is simply a line segment, when \(p\) = 2 it is a square, and when \(p\) = 100 it is a 100-dimensional cube.
This item is asking to calculate \(e_p\) based on \(r\) for each case:
The base formula for the expected edge length (range) considering the nearest-neighbor procedure for inputs uniformly distributed in a p-dimensional unit hypercube will be:
\[ e_p(r) = r^{1/p} \]
And for each case:
\[ e_1(10\%) = 0.1^{1/1} = 0.1 \\ e_2(10\%) = 0.1^{1/2} = 0.316 \\ e_{100}(10\%) = 0.1^{1/100} = 0.977 \]
The result says that as the number of dimensions increases, to ensure that \(10\%\) of the data will be used to predict the test data, the edge of the hypercube, (line in a case of one dimension and square in a case of two dimensions) will have to be almost the size of the cube itself, consequence of the observations data are too spread in the domain.1
We now examine the differences between LDA and QDA.
If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?
If the boundary is linear, this means that there is low variance and high bias in Bayes decision boundary;
The LDA will perform better than the QDA in both cases, training and test set;
The LDA has the low variance and high bias required;
Meanwhile, QDA would be best to be chosen in a high variance, low bias case;
If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?
Following the previous item idea, the QDA will perform better due to its low bias high variance properties.
In general, as the sample size \(n\) increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?
The very large sample size is beneficial for the QDA;
Then, in this case, QDA would be expected to have better prediction accuracy than LDA;
With a very large sample data, the variance of the classifier is not a major concern, and the model has freedom to fit through out the data;
The LDA on the other hand is tends to fit better than QDA when few data is available, and low variance is required;
If the sample data is well fit, then the predictions accuracy would be better.
True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.
False;
The linear decision boundary would be better predicted if a high bias model is used (LDA);
The use of a high variance model instead of a high bias is not recommended is this case, because it even if the sample data is large, the tendency is to that the QDA will try to follow a line, which will be worse than a line model itself;
The opposite is also valid, as a small sample size will result is actually recommended to a linear high bias model in this case.
Suppose we collect data for a group of students in a statistics class with variables \(X_1\) = hours studied, \(X_2\) = undergrad GPA, and \(Y\) = receive an A. We fit a logistic regression and produce estimated coefficient, \(\hatβ_0 = −6, \hatβ_1 = 0.05, \hatβ_2 = 1.\)
Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.
The logistic regression for the problem is expressed below:
\[ p(Y = \textrm{receive an A |} X_1 = 40 \textrm{h and } X_2 = 3.5 \textrm{ GPA}) = \frac{ e^{ (\beta_0 + \beta_1X_1 + \beta_2X_2) } }{1 + e^{ (\beta_0 + \beta_1X_1 + \beta_2X_2) } } \]
Plugging in the numbers:
\[ p(Y) = \frac{ e^{ (-6 + 0.05 \times 40 + 1 \times 3.5) } }{1 + e^{ (-6 + 0.05 \times 40 + 1 \times 3.5) } } \]
\[ p(Y) = 0.375 \]
How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class?
Same idea as before:
\[ p(Y = 0.5 \textrm{ |} X_1 = \textrm{how many hours to receive an A and } X_2 = 3.5 \textrm{ GPA}) = \frac{ (e^{\beta_0 + \beta_1X_1 + \beta_2X_2) } }{1 + e^{ (\beta_0 + \beta_1X_1 + \beta_2X_2) } } \]
Plugging in the numbers and making the calculations:
\[ 0.5 = \frac{ e^{(-6 + 0.05X_1 + 1 \times 3.5)} }{1 + e^{(-6 + 0.05X_1 + 1 \times 3.5)} } \]
\[ 0.5 + 0.5e^{(0.05X_1 - 2.5)} = e^{(0.05X_1 - 2.5)} \]
\[ 0.5 = (1-0.5)e^{(0.05X_1 - 2.5)} \]
\[ 1 = e^{(0.05X_1 - 2.5)} \]
\[ \ln{e^0} = \ln{e^{(0.05X_1 - 2.5)}} \]
\[ 0.05X_1 - 2.5 = 0 \implies X_1 = 50 \textrm{ hours} \]
Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on \(X\), last year’s percent profit. We examine a large number of companies and discover that the mean value of \(X\) for companies that issued a dividend was \(\bar{X}\) = 10, while the mean for those that did not was \(\bar{X}\) = 0. In addition, the variance of \(X\) for these two sets of companies was \(\hatσ^2\) = 36. Finally, 80 % of companies issued dividends. Assuming that \(X\) follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was \(X\) = 4 last year.
Hint: Recall that the density function for a normal random variable is \(f(x) = \frac{1}{ \sqrt{2πσ^2} } e^\frac{−(x−μ)^2}{2σ^2}\). You will need to use Bayes’ theorem.
Analyzing the problem, one can infer that is a LDA two classes with same variance case;
The problem can be expressed in the equation below;
\[ p_k(Y = \textrm{"Yes" or "No" | } X = \textrm{last year’s percent profit} ) = \frac{\pi_k \cdot f_k(x)}{\sum_{l=1}^{2} \pi_l \cdot f_l(x)} \]
Where “Yes” or “No” are referring to the stock probability to issue a dividend or not;
Therefore, the variables are going to identified by the index of theirs respective classes, Yes or No;
The question says that the density function \(f_k(x)\) for the predictor \(X\) follows normal distribution with the following parameters;
\(\mu_{Yes} = 10\), mean value of companies that issued a dividend;
\(\mu_{No} = 0\), mean value of companies that NOT issued a dividend;
\(\sigma^2_{Yes} = \sigma^2_{No} = \sigma = 36\), variance of \(X\) for the two classes;
\(\pi_{Yes} = 80\% = 0.8\) prior probability for a company to issue a dividend;
\(\pi_{No} = 20\% = 0.2\) prior probability for a company to NOT issue a dividend;
The probability for a company, with last year percent profit of \(X = 4\), to issue a dividend can be then expressed as:
\[ p_k(Y = \textrm{"Yes" | } X = 4) = \frac{\pi_{Yes} \cdot f_{Yes}(4)} {\pi_{Yes} \cdot f_{Yes}(4) + \pi_{No} \cdot f_{No}(4)} \]
The density function \(f_{Yes}(4)\) can be expressed as:
\[ f_{Yes}(4) = \frac{1}{\sigma\sqrt{2\pi}} exp\left({\frac{-1}{2\sigma^2} (x-\mu_{Yes})^2}\right) \]
Plugging in the numbers:
\[ f_{Yes}(4) = \frac{1}{6\sqrt{2\pi}} exp\left({\frac{-1}{2\times36} (4-10)^2}\right) \implies f_{Yes}(4) = 0.040325 \]
For the density function \(f_{No}(4)\):
\[ f_{No}(4) = \frac{1}{\sigma\sqrt{2\pi}} exp\left({\frac{-1}{2\sigma^2} (x-\mu_{No})^2}\right) \]
Plugging in the numbers again:
\[ f_{No}(4) = \frac{1}{6\sqrt{2\pi}} exp\left({\frac{-1}{2\times36} (4-0)^2}\right) \implies f_{No}(4) = 0.05325 \]
Back to the Bayes equation and replacing the variables:
\[ p_{Yes}(Y = \textrm{"Yes" | } X = 4) = \frac{0.8 \cdot 0.040325} {0.8 \cdot 0.040325 + 0.2 \cdot 0.05325} \implies p_{Yes} = 0.752 = 75.2\% \]
Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of \(20\%\) on the training data and \(30\%\) on the test data. Next we use 1-nearest neighbors (i.e. \(K = 1\)) and get an average error rate (averaged over both test and training data sets) of \(18\%\). Based on these results, which method should we prefer to use for classification of new observations? Why?
The 1-nearest neighbors is preferable;
The logistic regression is preferable
The logistic regression (and the LDA) is characterize for its parametric linear decision boundary, this represents a less flexible case;
The KNN on the other hand, as a non-parametric approach, does not make assumptions on the decision boundary, could be a good option in highly non-linear decision boundary case;
The QDA can be categorized as a compromise between those two, more flexible than a linear, and if there is a limitation in the number of observations, can be better than the KNN;
In the problem, the logistic regression case presented a reduction of performance, both in the training and test data when compared to the KNN with \(K=1\);
This is meant to confuse the reader. The 1-nearest neighbors will always have the training error equal to zero;
Such behavior is due to fact that the nearest-neighbor when \(K=1\) is the observation itself;
Therefore, the decision boundary will never fail on identify the observation that is being evaluated, drawing a boundary around the point when there is no other observation with the same class nearby;
The figure below shows this behavior for a 2 class problem (red and blue classes);
\(K=1\) trainig data behavior
One might observe “islands” of blue and red observations. This means that the classifier made no mistakes in the fit of the decision boundary;
Then, the decision boundary will be:
Example of decision boundary for \(K=1\)
That being said, when the question enunciated that the average error rate, averaged over both test and training data sets was equal to \(18%\), one could express this statement as:
\[ \bar\epsilon_{\textrm{rate}} = \frac{ \epsilon_{\textrm{training}} + \epsilon_{\textrm{test}} }{2} \]
As:
\(\bar\epsilon_{\textrm{rate}} = 18\%\)
\(\epsilon_{\textrm{training}} = 0\)
Plugging in the numbers;
\[ 18\% = \frac{ 0 + \epsilon_{\textrm{test}} }{2} \implies \epsilon_{\textrm{test}} = 36\%\]
Which is greater than the value for the logistic regression (\(\epsilon_{\textrm{test}} = 30\%\));
Hence, is preferable to use the logistic regression over the 1-nearest neighbors.
This problem has to do with odds.
On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
\[ \frac{p(X)}{1- p(X)} = odds \]
\[ \frac{p(X)}{1- p(X)} = 0.37 \implies p(X) = \frac {27}{100} \textrm{ people will default} \]
Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will default?
\[ \frac{ 0.16 }{ 1- 0.16 } = odds \implies odds = 0.19 \]
This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
The pairs() function is plotted below;
One could see that there is a pattern between Year and Volume;
This relation can be seen in the plot below:
p10a1.gg <- ggplot(data = Weekly, mapping = aes(Year,Volume))
p10a1.gg +
geom_line() +
geom_point() +
geom_smooth() +
scale_x_continuous(breaks = Weekly$Year) +
theme_light()The plot shows that the volume of trades increased monotonically for the most part of the years, until 2008, when the volume stated to fall both in the maximum and the minimum values
Another interesting plot is the Direction vs Year, which is illustrated below
p10a2.gg <- ggplot(data = Weekly, mapping = aes(x = Year, fill = Direction))
p10a2.gg +
geom_bar(position = "fill") +
scale_x_continuous(breaks = seq(1990, 2010), minor_breaks = NULL) +
scale_y_continuous(labels = scales::percent_format()) +
theme_light() +
theme(legend.position="bottom") +
geom_hline(yintercept = 0.5) +
ylab("Direction tendency")As one can see, there are only 4 years in which the direction of the market return tendency was Down for more than 50% chance on a given week;
If we look at the two plots, the years in which there was a down tendency were related to two historical markets downturn;
The first one (2000 to 2002) was not very abrupt in the first graph because of the volume of shares;
The second one (2007 and 2008) had much more impact in the following years, breaking the volume of shares significantly when we observe the previous years of constant increase.
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
The summary function result is described below:
f10b <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(f10b)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Weekly)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6949 -1.2565 0.9913 1.0849 1.4579
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
According to the z-statistic, among the predictors used, Lag2 is the one with the major statistically significance;
Lag1 and Lag4 have some significance as well.
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
The confusion matrix and the overall fraction of correct predictions are described below:
f10b.probs <- predict(f10b,type = "response")
#contrasts(Direction)
f10b.pred <- rep("Down",1089)
f10b.pred[f10b.probs > .5] <- "Up" Direction
f10b.pred Down Up
Down 54 48
Up 430 557
[1] 0.5610652
[1] 0.5610652
The confusion matrix result showed that the fitted logistic regression model done in the previous item:
Correctly predicted:
\(54\) weeks of “Down” Direction market tendency;
\(557\) weeks of “Up” Direction market tendency;
Incorrectly predicted:
\(48\) weeks of “Down” Direction market tendency;
\(430\) weeks of “Up” Direction market tendency;
The overall fraction off correct predictions was \(0.56\)
This result does not represent the truth because was done using the same 1089 observations used to fit the model;
If one wants to assess the true accuracy of the model, one way is to held out some of the 1089 observations and use it as new test data.
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
First we make a Boolean vector to be used in the dataframe split;
The problem says data from 1990 to 2008 it will be the train data and from 2009 to 2010 the test data;
In other words:
Year \(> 2008 \implies\) test data
Year \(\leq 2008 \implies\) train data
With the Boolean vector set to the train data split, meaning that this part is true and the other false, one can use it to make the separation in the dataframe itself;
Year \(> 2008 \implies\) False (test data)
Year \(\leq 2008 \implies\) True (train data)
The ! symbol can be used to reverse all of the elements of a Boolean vector;
Creating two different dataframes, one from the train data and another for the test data.
Now fitting the Logistic Regression using the train data, using only Lag2 as a predictor of the response Direction;
Note: it is not necessary to use the train dataframe created by the Boolean. The use of the subset argument in the glm function is proper for that. It was good for verify the number of observations that resulted in the split though.
Then, using the test dataframe and the fitted model, the predictions can be made, as in the previous item;
Creating a vector to be used as the Direction indicator variable and using the probabilities to switch from “Down” to “Up” if:
\(\textrm{Pr} > 0.5\);
Using the table function to compute the confusion matrix:
f10d.pred Down Up
Down 9 5
Up 34 56
The overall fraction of correct prediction based on the confusion matrix diagonal values is then:
[1] 0.625
[1] 0.625
The result was better than the one computed using the same train data as the test data and using more predictors.
Repeat (d) using LDA.
First, we use the lda() function to fit the model using only Lag2 as a predictor for Direction;
Now, calculating the probabilities using the fitted model on the test data;
Creating the vector to hold the predicted indicator and;
Switching the vector indicator from “Down” to “Up” if the calculated probabilities were superior to 0.5;
f10e.pred <- rep("Down", 104)
f10e.pred[f10e.probs$posterior[,2] > .5] <- "Up"
f10e.pred <- as.factor(f10e.pred)Setting the table to compare the test data to the ones predicted, just the indicators variables vectors;
f10e.pred Down Up
Down 9 5
Up 34 56
And the fraction of correct predictions is:
[1] 0.625
[1] 0.625
A more convenient function, confusionMatrix(), can be used, which will organize the relevant results;
f10e.cm <- confusionMatrix(data = f10e.pred, reference = f10d.test$Direction, positive = "Up")
f10e.cm$table Reference
Prediction Down Up
Down 9 5
Up 34 56
Accuracy
0.625
Repeat (d) using QDA.
With the same procedure from the previous item, but in this case using the qda() function to fit the model;
f10f.pred <- rep("Down", 104)
f10f.pred[f10f.probs$class > .5] <- "Up"
f10f.pred <- as.factor(f10f.pred)
f10f.pred Down Up
Down 43 61
It is odd that only “Down” is being shown in the confusion matrix;
Talking a look at the predicted indicators;
[1] Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down
[21] Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down
[41] Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down
[61] Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down
[81] Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down Down
[101] Down Down Down Down
Levels: Down
If only “Down” is being shown, it means that there are any probabilities “Down” superior than 0.5 to be converted to “Up” indicator;
Making a conditional test to see if there is any probability greater than 0.5:
any(ifelse(f10f.probs$posterior[,1]>0.5,T,F)) #if there is any TRUE when all probabilities are evaluated to be > 0.5, return TRUE, if not FALSE[1] FALSE
We see that is exactly the case;
To overcome this, the confusionMatrix() function can be used;
f10f.cm <- confusionMatrix(data = f10f.probs$class, reference = f10d.test$Direction, positive = "Up")
f10f.cm$table Reference
Prediction Down Up
Down 0 0
Up 43 61
Accuracy
0.5865385
The fraction of correct predictions is \(0.5865\).
Repeat (d) using KNN with \(K = 1\).
Now, the knn() function will be used;
This function require that the train and test data be in a dataframe or matrix type;
f10g.train <- Weekly %>% filter(Weekly$Year<=2008) %>% select(Lag2)
f10g.test <- Weekly %>% filter(Weekly$Year>2008) %>% select(Lag2)
f10g.label <- Weekly %>% filter(Weekly$Year<=2008) %>% select(Direction)
f10g.ref <- Weekly %>% filter(Weekly$Year>2008) %>% select(Direction)The knn() function will make the predictions and fit the model, differently from the previous ones;
The Confusion Matrix for \(KNN=1\) is presented below:
f10g.cm <- confusionMatrix(data = f10g.pred, reference = f10g.ref$Direction, positive = "Up")
f10g.cm$table Reference
Prediction Down Up
Down 21 30
Up 22 31
Accuracy
0.5
The fraction of correct predictions is \(0.5096\).
Which of these methods appears to provide the best results on this data?
Logistic regression and LDA models presented the highest fraction of correct predictions of \(0.625\).
Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
Observing the p-values for the logistic regression from item (b), the best features are Lag1 and Lag2;
The feature Volume will also be increased because it has different scale than the lags and might provide interesting results;
The combinations were: addictive, interaction and non-linear;
The methods followed the previous items: logistic, LDA, QDA and KNN;
The criteria for classification was the same as from the previous items:
\[ \textrm{Pr}(\textrm{Direction = Up | } X = x) > 0.5 \]
The chosen models were:
Model 1
\[ p_1 = \textrm{Lag1}\\ p_2 = \textrm{Lag2} \]
Model 2
\[ p_1 = \textrm{Lag1}\\ p_2 = \textrm{Lag2}\\ p_3 = \textrm{Lag1} \times \textrm{Lag2} \]
Model 3
\[ p_1 = \textrm{Lag1}\\ p_2 = \textrm{Lag2}\\ p_3 = \textrm{Lag1} \times \textrm{Lag2}\\ p_4 = \textrm{Volume} \]
Model 4
\[ p_1 = \textrm{Lag1}\\ p_2 = \textrm{Lag2}\\ p_3 = \textrm{Lag1} \times \textrm{Lag2}\\ p_4 = \textrm{Volume}^2 \]
Model 5
\[ p_1 = \textrm{Lag2}\\ p_2 = \textrm{Volume} \]
Model 6
\[ p_1 = \textrm{Lag2}\\ p_2 = \textrm{Volume}^2 \]
f10i.boolean <- (Year<=2008)
f10i.train <- Weekly[f10i.boolean,]
f10i.test <- Weekly[!f10i.boolean,]
f10i.sum <- c()Logistic Regression
Model 1: Lag1 + Lag2
f10i1 <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial, subset = Year <= 2008)
f10i1.probs <- predict(f10i1, f10i.test, type = "response")
f10i1.pred <- rep("Down",104)
f10i1.pred[f10i1.probs > .5] <- "Up"
f10i1.pred <- as.factor(f10i1.pred)
f10i1.cm <- confusionMatrix(data = f10i1.pred, reference = f10i.test$Direction, positive = "Up")
#f10i1.cm$table
f10i1.cm$overall[1] Accuracy
0.5769231
Model 2: Lag1 * Lag2
f10i2 <- glm(Direction ~ Lag1 * Lag2, data = Weekly, family = binomial, subset = Year <= 2008)
f10i2.probs <- predict(f10i2, f10i.test, type = "response")
f10i2.pred <- rep("Down",104)
f10i2.pred[f10i2.probs > .5] <- "Up"
f10i2.pred <- as.factor(f10i2.pred)
f10i2.cm <- confusionMatrix(data = f10i2.pred, reference = f10i.test$Direction, positive = "Up")
#f10i2.cm$table
f10i2.cm$overall[1] Accuracy
0.5769231
Model 3: Lag1 * Lag2 + I(Volume)
f10i3 <- glm(Direction ~ Lag1 * Lag2 + I(Volume), data = Weekly, family = binomial, subset = Year <= 2008)
f10i3.probs <- predict(f10i3, f10i.test, type = "response")
f10i3.pred <- rep("Down",104)
f10i3.pred[f10i3.probs > .5] <- "Up"
f10i3.pred <- as.factor(f10i3.pred)
f10i3.cm <- confusionMatrix(data = f10i3.pred, reference = f10i.test$Direction, positive = "Up")
#f10i3.cm$table
f10i3.cm$overall[1] Accuracy
0.5384615
Model 4: Lag1 * Lag2 + I(Volume^2)
f10i4 <- glm(Direction ~ Lag1 * Lag2 + I(Volume^2), data = Weekly, family = binomial, subset = Year <= 2008)
f10i4.probs <- predict(f10i4, f10i.test, type = "response")
f10i4.pred <- rep("Down",104)
f10i4.pred[f10i4.probs > .5] <- "Up"
f10i4.pred <- as.factor(f10i4.pred)
f10i4.cm <- confusionMatrix(data = f10i4.pred, reference = f10i.test$Direction, positive = "Up")
#f10i4.cm$table
f10i4.cm$overall[1] Accuracy
0.5480769
Model 5: Lag2 + I(Volume)
f10i5 <- glm(Direction ~ Lag2 + I(Volume), data = Weekly, family = binomial, subset = Year <= 2008)
f10i5.probs <- predict(f10i5, f10i.test, type = "response")
f10i5.pred <- rep("Down",104)
f10i5.pred[f10i5.probs > .5] <- "Up"
f10i5.pred <- as.factor(f10i5.pred)
f10i5.cm <- confusionMatrix(data = f10i5.pred, reference = f10i.test$Direction, positive = "Up")
#f10i5.cm$table
f10i5.cm$overall[1] Accuracy
0.5384615
Model 6: Lag2 + I(Volume^2)
f10i6 <- glm(Direction ~ Lag2 + I(Volume^2), data = Weekly, family = binomial, subset = Year <= 2008)
f10i6.probs <- predict(f10i6, f10i.test, type = "response")
f10i6.pred <- rep("Down",104)
f10i6.pred[f10i6.probs > .5] <- "Up"
f10i6.pred <- as.factor(f10i6.pred)
f10i6.cm <- confusionMatrix(data = f10i6.pred, reference = f10i.test$Direction, positive = "Up")
#f10i6.cm$table
f10i6.cm$overall[1] Accuracy
0.5673077
LDA
Model 1: Lag1 + Lag2
f10i7 <- lda(Direction ~ Lag1 + Lag2, data = Weekly, subset = Year <= 2008)
f10i7.probs <- predict(f10i7, f10i.test, type = "response")
f10i7.cm <- confusionMatrix(data = f10i7.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i7.cm$table
f10i7.cm$overall[1] Accuracy
0.5769231
Model 2: Lag1 * Lag2
f10i8 <- lda(Direction ~ Lag1 * Lag2, data = Weekly, subset = Year <= 2008)
f10i8.probs <- predict(f10i8, f10i.test, type = "response")
f10i8.cm <- confusionMatrix(data = f10i8.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i8.cm$table
f10i8.cm$overall[1] Accuracy
0.5769231
Model 3: Lag1 * Lag2 + I(Volume)
f10i9 <- lda(Direction ~ Lag1 * Lag2 + I(Volume), data = Weekly, subset = Year <= 2008)
f10i9.probs <- predict(f10i9, f10i.test, type = "response")
f10i9.cm <- confusionMatrix(data = f10i9.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i9.cm$table
f10i9.cm$overall[1] Accuracy
0.5384615
Model 4: Lag1 * Lag2 + I(Volume^2)
f10i10 <- lda(Direction ~ Lag1 * Lag2 + I(Volume^2), data = Weekly, subset = Year <= 2008)
f10i10.probs <- predict(f10i10, f10i.test, type = "response")
f10i10.cm <- confusionMatrix(data = f10i10.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i10.cm$table
f10i10.cm$overall[1] Accuracy
0.5384615
Model 5: Lag2 + I(Volume)
f10i11 <- lda(Direction ~ Lag2 + I(Volume), data = Weekly, subset = Year <= 2008)
f10i11.probs <- predict(f10i11, f10i.test, type = "response")
f10i11.cm <- confusionMatrix(data = f10i11.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i11.cm$table
f10i11.cm$overall[1] Accuracy
0.5384615
Model 6: Lag2 + I(Volume^2)
f10i12 <- lda(Direction ~ Lag2 + I(Volume^2), data = Weekly, subset = Year <= 2008)
f10i12.probs <- predict(f10i12, f10i.test, type = "response")
f10i12.cm <- confusionMatrix(data = f10i12.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i12.cm$table
f10i12.cm$overall[1] Accuracy
0.5673077
QDA
Model 1: Lag1 + Lag2
f10i13 <- qda(Direction ~ Lag1 + Lag2, data = Weekly, subset = Year <= 2008)
f10i13.probs <- predict(f10i13, f10i.test, type = "response")
f10i13.cm <- confusionMatrix(data = f10i13.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i13.cm$table
f10i13.cm$overall[1] Accuracy
0.5576923
Model 2: Lag1 * Lag2
f10i14 <- qda(Direction ~ Lag1 * Lag2, data = Weekly, subset = Year <= 2008)
f10i14.probs <- predict(f10i14, f10i.test, type = "response")
f10i14.cm <- confusionMatrix(data = f10i14.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i14.cm$table
f10i14.cm$overall[1] Accuracy
0.4615385
Model 3: Lag1 * Lag2 + I(Volume)
f10i15 <- qda(Direction ~ Lag1 * Lag2 + I(Volume), data = Weekly, subset = Year <= 2008)
f10i15.probs <- predict(f10i15, f10i.test, type = "response")
f10i15.cm <- confusionMatrix(data = f10i15.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i15.cm$table
f10i15.cm$overall[1] Accuracy
0.4519231
Model 4: Lag1 * Lag2 + I(Volume^2)
f10i16 <- qda(Direction ~ Lag1 * Lag2 + I(Volume^2), data = Weekly, subset = Year <= 2008)
f10i16.probs <- predict(f10i16, f10i.test, type = "response")
f10i16.cm <- confusionMatrix(data = f10i16.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i16.cm$table
f10i16.cm$overall[1] Accuracy
0.4134615
Model 5: Lag2 + I(Volume)
f10i17 <- qda(Direction ~ Lag2 + I(Volume), data = Weekly, subset = Year <= 2008)
f10i17.probs <- predict(f10i17, f10i.test, type = "response")
f10i17.cm <- confusionMatrix(data = f10i17.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i17.cm$table
f10i17.cm$overall[1] Accuracy
0.4711538
Model 6: Lag2 + I(Volume^2)
f10i18 <- qda(Direction ~ Lag2 + I(Volume^2), data = Weekly, subset = Year <= 2008)
f10i18.probs <- predict(f10i18, f10i.test, type = "response")
f10i18.cm <- confusionMatrix(data = f10i18.probs$class, reference = f10i.test$Direction, positive = "Up")
#f10i18.cm$table
f10i18.cm$overall[1] Accuracy
0.4807692
KNN
As the predictor Volume has a different scale from the others, the data was standardize to avoid this issue;
Standardize begins here;
df10.stnd <- data.frame(scale(Weekly[,-9]))
attach(df10.stnd)
df10.boolean <- (Weekly$Year<=2008)
f10.label = Direction[Weekly$Year<=2008] #Because year will be in a different scale, we need to refer to the original dfModel 1: Lag1 + Lag2
set.seed(1)
f10i19.train <- cbind(Lag1,Lag2)[df10.boolean,]
f10i19.test <- cbind(Lag1,Lag2)[!df10.boolean,]
f10i19.pred <- knn(f10i19.train, f10i19.test, f10.label,k=25)
f10i19.cm <- confusionMatrix(data = f10i19.pred, reference = f10i.test$Direction, positive = "Up")
#f10i19.cm$table
f10i19.cm$overall[1]Accuracy
0.625
Model 2: Lag1 * Lag2
set.seed(1)
f10i20.train <- cbind(Lag1, Lag2, Lag1*Lag2)[df10.boolean,]
f10i20.test <- cbind(Lag1, Lag2, Lag1*Lag2)[!df10.boolean,]
f10i20.pred <- knn(f10i20.train, f10i20.test, f10.label,k=25)
f10i20.cm <- confusionMatrix(data = f10i20.pred, reference = f10i.test$Direction, positive = "Up")
#f10i20.cm$table
f10i20.cm$overall[1] Accuracy
0.5384615
Model 3: Lag1 * Lag2 + Volume
set.seed(1)
f10i21.train <- cbind(Lag1, Lag2, Lag1*Lag2, Volume)[df10.boolean,]
f10i21.test <- cbind(Lag1, Lag2, Lag1*Lag2, Volume)[!df10.boolean,]
f10i21.pred <- knn(f10i21.train, f10i21.test, f10.label,k=25)
f10i21.cm <- confusionMatrix(data = f10i21.pred, reference = f10i.test$Direction, positive = "Up")
#f10i21.cm$table
f10i21.cm$overall[1] Accuracy
0.4807692
Model 4: Lag1 * Lag2 + I(Volume^2)
set.seed(1)
f10i22.train <- cbind(Lag1, Lag2, Lag1*Lag2, I(Volume^2))[df10.boolean,]
f10i22.test <- cbind(Lag1, Lag2, Lag1*Lag2, I(Volume^2))[!df10.boolean,]
f10i22.pred <- knn(f10i21.train, f10i21.test, f10.label,k=25)
f10i22.cm <- confusionMatrix(data = f10i22.pred, reference = f10i.test$Direction, positive = "Up")
#f10i22.cm$table
f10i22.cm$overall[1] Accuracy
0.4807692
Model 5: Lag2 + Volume
set.seed(1)
f10i23.train <- cbind(Lag2, Volume)[df10.boolean,]
f10i23.test <- cbind(Lag2, Volume)[!df10.boolean,]
f10i23.pred <- knn(f10i23.train, f10i23.test, f10.label,k=25)
f10i23.cm <- confusionMatrix(data = f10i23.pred, reference = f10i.test$Direction, positive = "Up")
#f10i23.cm$table
f10i23.cm$overall[1]Accuracy
0.5
Model 6: Lag2 + I(Volume^2)
set.seed(1)
f10i24.train <- cbind(Lag2, I(Volume^2))[df10.boolean,]
f10i24.test <- cbind(Lag2, I(Volume^2))[!df10.boolean,]
f10i24.pred <- knn(f10i23.train, f10i24.test, f10.label,k=25)
f10i24.cm <- confusionMatrix(data = f10i24.pred, reference = f10i.test$Direction, positive = "Up")
#f10i24.cm$table
f10i24.cm$overall[1] Accuracy
0.5192308
Apparently, KNN, K=25 with Lag1 + Lag2 as predictors, achieved the best accuracy of 0.625
The plot below summarizes the fitted models accuracy found
f10i.sum <- c() #form each fit object names to be used in fit object data acquisition
f10i.all <- c() #get the accuracy values and later the fits names
f10i.fits <- c() #get the fit names objects
f10i.molds <- c()
f10i.exp <- c()
for(i in 1:24)
{
f10i.sum[i] <- paste("f10i", i, ".cm",sep = "") #form the object name, without the data
for (j in 1:length(ls()))
{
if (identical(ls()[j], f10i.sum[i])) #search in all objects and get those formed in f10i.sum
{
f10i.all[j] <- get0(ls()[j])$overall[1] #get the accuracy value
f10i.fits[j] <- i #get the fit number
}
}
}
f10i.all <- na.exclude(f10i.all)
f10i.all <- as.vector(f10i.all)
f10i.fits <- na.exclude(f10i.fits)
f10i.fits <- as.vector(f10i.fits)
f10i.all <- cbind(f10i.all,f10i.fits)
colnames(f10i.all)[1] <- "Accuracy"
colnames(f10i.all)[2] <- "Fit.Number"
f10i.all <- data.frame(f10i.all)
f10i.all$Fit.Number <- f10i.all$Fit.Number %>% as.factor()
f10i.all <- f10i.all[order(f10i.all$Fit.Number),]
rownames(f10i.all) <- 1:nrow(f10i.all)
f10i.molds <- vector(mode = "character",24)
f10i.molds[1:6] <- "LOGISTIC"
f10i.molds[7:12] <- "LDA"
f10i.molds[13:18] <- "QDA"
f10i.molds[19:24] <- "KNN (25)"
f10i.all$Fit.Number[1:6] <- c(1,2,3,4,5,6)
f10i.all$Fit.Number[7:12] <- c(1,2,3,4,5,6)
f10i.all$Fit.Number[13:18] <- c(1,2,3,4,5,6)
f10i.all$Fit.Number[19:24] <- c(1,2,3,4,5,6)
for (i in 1:nrow(f10i.all)) {
if(f10i.all$Fit.Number[i]==1){
f10i.exp[i] <- "Lag1+Lag2"
}
else if(f10i.all$Fit.Number[i]==2){
f10i.exp[i] <- "Lag1*Lag2"
}
else if(f10i.all$Fit.Number[i]==3){
f10i.exp[i] <- "Lag1*Lag2+Volume"
}
else if(f10i.all$Fit.Number[i]==4){
f10i.exp[i] <- "Lag1*Lag2+Volume^2"
}
else if(f10i.all$Fit.Number[i]==5){
f10i.exp[i] <- "Lag2+Volume"
}
else if(f10i.all$Fit.Number[i]==6){
f10i.exp[i] <- "Lag2+Volume^2"
}
}
f10i.all <- cbind(f10i.all, f10i.molds)
f10i.all <- cbind(f10i.all,f10i.exp)
colnames(f10i.all)[3] <- "Models"
colnames(f10i.all)[4] <- "Fit.Expression"ggplot(data=f10i.all, aes(x=Fit.Number, y=Accuracy, fill=Fit.Expression)) +
geom_bar(stat="identity") +
facet_grid(~Models, scales = "free_x") +
#scale_x_continuous("Fits",labels = as.numeric(f10i.all$Fits), breaks = f10i.all$Fits) +
scale_fill_brewer(palette="Dark2") +
theme_bw()One can observe that:
The Fit Number 1, with Expression \(Lag1 + Lag2\) obtained the highest Accuracy in all methods;
The KNN method, although obtained the highest Accuracy using Expression 1, had the others methods not very well fitted as LDA and Logistic methods, which achieved Accuracy \(>0.5\) in all the Expressions;
The LDA and Logistic methods obtained similar results;
The QDA method presented the lowest average result among the evaluated models.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
The binary variable mpg01 creation is described in the code block
f11.boolean <- (mpg>median(mpg)) # true for mpg > median and false for mpg < median
df.mpg01 <- as.factor(as.numeric(f11.boolean)) %>% data.frame() # convert to 0 (false) and 1 (true), put in a df
colnames(df.mpg01)[1] <- "mpg01" # give name to the column
df11 <- data.frame(c(Auto,df.mpg01)) # Concatenating the original data set to indicator variable
head(df11)Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
First, the column names was taken out because it was not relevant to the problem and it might interfere in the data manipulation;
Then, the column origin, which original data contain 1 for American, 2 for European and 3 for Japanese, was converted to the own respective information;
Using the glimpse() function, one can see the class of each variable and head data;
Rows: 392
Columns: 9
$ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 24, 22, 18, 21, 27, 26, ...
$ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, 4, 4, 4, 4, 4, 6, 8, 8,...
$ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 340, 400, 455, 113, 198, ...
$ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 160, 150, 225, 95, 95, 97...
$ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 3850, 3563, 3609, 3761, 30...
$ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, 10.0, 8.0, 9.5, 10.0, 1...
$ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, ...
$ origin <fct> American, American, American, American, American, American, American, American, ...
$ mpg01 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,...
Using the plot() function, one can see the last row where the created variable mpg01 is and how the others variables relate to it;
As an indicator variable with two levels, it is more suitable to use boxplots with the variables at the y-axis and mpg01 in co;
A barplot was used to plot the variable origin because boxplot was not very informative;
In the figure below one can see how the variables are correlated to mpg01;
p11b1.gg <- ggplot(df11, aes(x = mpg01, y = mpg, col = mpg01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none") +
geom_jitter()
p11b2.gg <- ggplot(df11, aes(x = mpg01, y = cylinders, col = mpg01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none") +
geom_jitter()
p11b3.gg <- ggplot(df11, aes(x = mpg01, y = displacement, col = mpg01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none") +
geom_jitter()
p11b4.gg <- ggplot(df11, aes(x = mpg01, y = horsepower, col = mpg01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none") +
geom_jitter()
p11b5.gg <- ggplot(df11, aes(x = mpg01, y = weight, col = mpg01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none") +
geom_jitter()
p11b6.gg <- ggplot(df11, aes(x = mpg01, y = acceleration, col = mpg01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none") +
geom_jitter()
p11b7.gg <- ggplot(df11, aes(x = mpg01, y = year, col = mpg01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none") +
geom_jitter()
p11b8.gg <- ggplot(df11, aes(x = origin, fill = mpg01)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
theme_light() +
theme(axis.title.y = element_blank())
p11b9 <- ggarrange(p11b1.gg,p11b2.gg, p11b3.gg, p11b4.gg,
p11b5.gg,p11b6.gg, p11b7.gg, p11b8.gg,
ncol=4, nrow=2,
legend = c("bottom"),
common.legend=TRUE) #still do not know why there is a white pane before the plotsannotate_figure(p11b9, top = text_grob("Fig. 11b - mpg01 choosen predictors", color = "red", face = "bold", size = 14),)The mpg vs mpg01 plot does not inform much because it is the own mpg variable median separated at \(22.75\);
The cylinders, displacement, horsepower, weight variables, vs mpg01 showed a clear division when the median was applied to the data, on both boxplots and jitter points. This could be an evidence that this variables could be a good candidates to be used in the mpg01 prediction;
The acceleration and year variables against mpg01 were not capable of separating the data so well as the previous mentioned, that might indicate that these variables might have poor effect in predicting mpg01;
The last plot relating the origin to each indicator variable was good to segregate the Japanese and European from American cars, that being an evidence of a good variable to predict mpg01.
Split the data into a training set and a test set.
Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
f11d <- lda(mpg01~cylinders+displacement+weight+horsepower, data = df11.train)
f11d.probs <- predict(f11d, df11.test, type="response")
f11d.cm <- confusionMatrix(data=f11d.probs$class, reference = df11.test$mpg01, positive = "1")
1 - f11d.cm[["overall"]][["Accuracy"]][1] 0.08673469
Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
f11e <- qda(mpg01~cylinders+displacement+weight+horsepower, data = df11.train)
f11e.probs <- predict(f11e, df11.test, type="response")
f11e.cm <- confusionMatrix(data=f11e.probs$class, reference = df11.test$mpg01, positive = "1")
1 - f11e.cm[["overall"]][["Accuracy"]][1] 0.09693878
Perform logistic regression on the training data in order to pre- dict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
f11f <- glm(mpg01~cylinders+displacement+weight+horsepower, family = binomial, data = df11.train)
f11f.probs <- predict(f11f, df11.test, type = "response")
f11f.pred <- rep("0",196)
f11f.pred[f11f.probs > .5] <- "1"
f11f.pred <- as.factor(f11f.pred)
f11f.cm <- confusionMatrix(data=f11f.pred, reference = df11.test$mpg01, positive = "1")
1 - f11f.cm[["overall"]][["Accuracy"]][1] 0.07653061
Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
df11.stnd <- df11[,-c(8:9)] #removing the factors types data to be able to use the scale() function
df11.stnd <- scale(df11.stnd) %>% data.frame() #scale and put into a new df, as the scale() produces a matrix
df11.stnd <- cbind(df11.stnd, df11[,9]) #concatenate the original mpg01 from the df where it was created
colnames(df11.stnd)[8] <- "mpg01" #changing the column name to mpg01df11.stnd.boolean <- (df11.stnd$mpg>median(df11.stnd$mpg)) #creating a column 0 1 according to the median value
df.mpg01 <- as.factor(as.numeric(df11.stnd.boolean)) %>% data.frame() # convert to 0 (false) and 1 (true), put in a df
colnames(df.mpg01)[1] <- "mpg01.after.stnd" # give name to the column
df11.stnd <- data.frame(c(df11.stnd,df.mpg01)) # Concatenating the original data set to indicator variable
var(df11.stnd)[,8:9] #checking the mpg01 variance before and after the standardization. the same value was found mpg01 mpg01.after.stnd
mpg 0.4190044 0.4190044
cylinders -0.3800820 -0.3800820
displacement -0.3772198 -0.3772198
horsepower -0.3339525 -0.3339525
weight -0.3793625 -0.3793625
acceleration 0.1736324 0.1736324
year 0.2152268 0.2152268
mpg01 0.2506394 0.2506394
mpg01.after.stnd 0.2506394 0.2506394
set.seed(11)
df11.stnd.boolean <- createDataPartition(y = df11.stnd$mpg01, p = 0.5, list = F)
df11.stnd.train <- df11.stnd[df11.stnd.boolean,]
df11.stnd.test <- df11.stnd[-df11.stnd.boolean,]\(K=1\)
f11g1 <- knn(df11.stnd.train, df11.stnd.test, df11.stnd.test$mpg01,k=1)
f11g1.cm <- confusionMatrix(data=f11g1, reference = df11.stnd.test$mpg01, positive = "1")
f11g1.cm$table Reference
Prediction 0 1
0 60 41
1 38 57
[1] 0.4030612
\(K=2\)
f11g1 <- knn(df11.stnd.train, df11.stnd.test, df11.stnd.test$mpg01,k=2)
f11g1.cm <- confusionMatrix(data=f11g1, reference = df11.stnd.test$mpg01, positive = "1")
f11g1.cm$table Reference
Prediction 0 1
0 58 35
1 40 63
[1] 0.3826531
\(K=5\)
f11g1 <- knn(df11.stnd.train, df11.stnd.test, df11.stnd.test$mpg01,k=5)
f11g1.cm <- confusionMatrix(data=f11g1, reference = df11.stnd.test$mpg01, positive = "1")
f11g1.cm$table Reference
Prediction 0 1
0 58 25
1 40 73
[1] 0.3316327
\(K=10\)
f11g1 <- knn(df11.stnd.train, df11.stnd.test, df11.stnd.test$mpg01,k=10)
f11g1.cm <- confusionMatrix(data=f11g1, reference = df11.stnd.test$mpg01, positive = "1")
f11g1.cm$table Reference
Prediction 0 1
0 64 17
1 34 81
[1] 0.2602041
\(K=50\)
f11g1 <- knn(df11.stnd.train, df11.stnd.test, df11.stnd.test$mpg01,k=50)
f11g1.cm <- confusionMatrix(data=f11g1, reference = df11.stnd.test$mpg01, positive = "1")
f11g1.cm$table Reference
Prediction 0 1
0 63 2
1 35 96
[1] 0.1887755
\(K=75\)
f11g1 <- knn(df11.stnd.train, df11.stnd.test, df11.stnd.test$mpg01,k=75)
f11g1.cm <- confusionMatrix(data=f11g1, reference = df11.stnd.test$mpg01, positive = "1")
f11g1.cm$table Reference
Prediction 0 1
0 70 0
1 28 98
[1] 0.1428571
\(K=100\)
f11g1 <- knn(df11.stnd.train, df11.stnd.test, df11.stnd.test$mpg01,k=100)
f11g1.cm <- confusionMatrix(data=f11g1, reference = df11.stnd.test$mpg01, positive = "1")
f11g1.cm$table Reference
Prediction 0 1
0 77 2
1 21 96
[1] 0.1173469
This problem involves writing functions.
Write a function, Power(), that prints out the result of raising 2 to the 3rd power. In other words, your function should compute \(2^3\) and print out the results.
Hint: Recall that x^a raises x to the power a. Use the print() function to output the result.
[1] 8
Create a new function, Power2(), that allows you to pass any two numbers, x and a, and prints out the value of x^a. You can do this by beginning your function with the line
> Power2 =function (x,a){
You should be able to call your function by entering, for instance,
> Power2 (3,8)
On the command line. This should output the value of \(3^8\), namely, \(6,561\).
[1] 6561
Using the Power2() function that you just wrote, compute \(10^3\), \(8^{17}\), and \(131^3\).
[1] 1000
[1] 2.2518e+15
[1] 2248091
Now create a new function, Power3(), that actually returns the result x^a as an R object, rather than simply printing it to the screen. That is, if you store the value x^a in an object called result within your function, then you can simply return() this result, using the following line:
return (result)
The line above should be the last line in your function, before the } symbol.
[1] 1 4 9 16 25 36 49 64 81 100
Now using the Power3() function, create a plot of \(f(x) = x^2\). The \(x\)-axis should display a range of integers from \(1\) to \(10\), and the \(y\)-axis should display \(x^2\). Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the \(x\)-axis, the \(y\)-axis, or both on the log-scale. You can do this by using log=‘‘x’’, log=‘‘y’’, or log=‘‘xy’’ as arguments to the plot() function.
ggplot(NULL,aes(1:10,Power3(1:10,2))) +
geom_point() +
geom_line() +
xlab("x") +
#scale_x_log10(n.breaks=10) +
scale_x_continuous(n.breaks = 10) +
scale_y_log10(n.breaks=10) +
ylab(expression(log[10](x^2))) +
labs(title = "Figure 12e", subtitle = expression(x~"vs"~log[10](x^2))) +
theme_bw()Create a function, PlotPower(), that allows you to create a plot of x against x^a for a fixed a and for a range of values of x. For instance, if you call
> PlotPower (1:10 ,3)
then a plot should be created with an \(x\)-axis taking on values \(1, 2, . . . , 10\), and a \(y\)-axis taking on values \(1^3, 2^3, . . . , 10^3\).
PlotPower=function(x,a)
{
ggplot(NULL,aes(x,Power3(x,a))) +
geom_point() +
geom_line() +
xlab("x") +
scale_x_continuous(n.breaks=10) +
scale_y_continuous(n.breaks=10) +
ylab(expression(x^{a})) +
labs(title = "Figure 12f", subtitle = expression(x~vs~x^3)) +
theme_bw()
}
PlotPower(1:10,3)Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various sub- sets of the predictors. Describe your findings.
The procedure used in this exercise will be similar to the one used in 10i;
First, a dataframe was created with the variable (crim01) that has a value of 1 or 0 if crim predictor value is above or below \(0.25651\)
data("Boston")
f13.boolean <- (Boston$crim>median(Boston$crim))
df.crim01 <- as.factor(as.numeric(f13.boolean)) %>% data.frame()
colnames(df.crim01)[1] <- "crim01" # give name to the column
df13 <- data.frame(c(Boston,df.crim01))Next, an analysis based on crim01 was made using boxplots of each predictor in the Boston;
The predictors were selected in most and least relevant according to the boxplot overlay;
p131.gg <- ggplot(df13, aes(x = crim01, y = crim, col = crim01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p132.gg <- ggplot(df13, aes(x = crim01, y = zn, col = crim01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p133.gg <- ggplot(df13, aes(x = crim01, y = indus, col = crim01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p134.gg <- ggplot(df13, aes(x = crim01, y = nox, col = crim01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p136.gg <- ggplot(df13, aes(x = crim01, y = age, col = crim01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p137.gg <- ggplot(df13, aes(x = crim01, y = dis, col = crim01)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p135.gg <- ggplot(df13, aes(x = crim01, y = rm, col = crim01)) + #poor statistic power
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p138.gg <- ggplot(df13, aes(x = crim01, y = rad, col = crim01)) + #poor statistic power
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p139.gg <- ggplot(df13, aes(x = crim01, y = tax, col = crim01)) + #poor statistic power
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p1310.gg <- ggplot(df13, aes(x = crim01, y = ptratio, col = crim01)) + #poor statistic power
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p1311.gg <- ggplot(df13, aes(x = crim01, y = black, col = crim01)) + #poor statistic power
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p1312.gg <- ggplot(df13, aes(x = crim01, y = lstat, col = crim01)) + #medium statistic power
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p1313.gg <- ggplot(df13, aes(x = crim01, y = chas, col = crim01)) + #poor statistic power
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p1314.gg <- ggplot(df13, aes(x = crim01, y = medv, col = crim01)) + #medium statistic power
geom_boxplot() +
theme_light() +
theme(legend.position = "none")
#geom_jitter()
p131 <- ggarrange(p1312.gg, p133.gg, p134.gg, p1311.gg,
p136.gg, p137.gg, p1314.gg, p1310.gg, p139.gg,
ncol=3, nrow=3,
legend = c("bottom"),
common.legend=TRUE) #still do not know why there is a white pane before the plots
p132 <- ggarrange(p138.gg, p132.gg, p1313.gg, p135.gg,
ncol=3, nrow=2,
legend = c("bottom"),
common.legend=TRUE) #still do not know why there is a white pane before the plotsannotate_figure(p131, top = text_grob("Fig. 13.1 - Most relevant choosen predictors", color = "red", face = "bold", size = 14),)annotate_figure(p132, top = text_grob("Fig. 13.2 - Least relevant choosen predictors", color = "red", face = "bold", size = 14),)The new dataframe was then split in test and train data;
set.seed(31)
df13.boolean <- createDataPartition(y = df13$crim01, p = 0.5, list = F)
df13.train <- df13[df13.boolean,]
df13.train <- df13.train[,-1] #removing crim predictor
df13.test <- df13[-df13.boolean,]
df13.test <- df13.test[,-1] #removing crim predictorAfter the analysis, 10 models were proposed and can be seen below;
Model 1
all predictors
Model 2
Predictors from Figure 13.1
Model 3
lstat + nox + age + medv + dis
Model 4
lstat + nox + age + medv + dis + black
Model 5
indus + black + ptratio + tax
Model 6
Predictors from Figure 13.2
Model 7
rad + rm
Model 8
zn + chas
Model 9
indus + tax
Model 10
zn + chas + indus + tax
After the division, the methods were applied to each model and the accuracy saved in a vector;
Logistic Regression
f13lgt.acc <- c()
for (i in 1:10) {
f13lgt <- glm(m13[i], data = df13.train, family = binomial)
f13lgt.probs <- predict(f13lgt,df13.test, type = "response")
f13lgt.pred <- rep("0",252)
f13lgt.pred[f13lgt.probs > .5] <- "1"
f13lgt.pred <- as.factor(f13lgt.pred)
f13lgt.cm <- confusionMatrix(data = f13lgt.pred, reference = df13.test$crim01, positive = "1")
f13lgt.acc[i] <- f13lgt.cm[["overall"]][["Accuracy"]]
}LDA
f13lda.acc <- c()
for (i in 1:10) {
f13lda <- lda(as.formula(m13[i]), data = df13.train)
f13lda.probs <- predict(f13lda,df13.test, type = "response")
f13lda.cm <- confusionMatrix(data = f13lda.probs$class, reference = df13.test$crim01, positive = "1")
f13lda.acc[i] <- f13lda.cm[["overall"]][["Accuracy"]]
}QDA
f13qda.acc <- c()
for (i in 1:10) {
f13qda <- qda(as.formula(m13[i]), data = df13.train)
f13qda.probs <- predict(f13qda,df13.test, type = "response")
f13qda.cm <- confusionMatrix(data = f13qda.probs$class, reference = df13.test$crim01, positive = "1")
f13qda.acc[i] <- f13qda.cm[["overall"]][["Accuracy"]]
}KNN
It was not possible to make a proper loop over all the models using the KNN method;
Then the models were applied separately using \(k=25\) neighbors (arbitrary value);
Standardization and removal of two rows from train data
#df13.test[,14] <- as.numeric(levels(df13.test[,14]))[df13.test[,14]]
df13stnd.test <- scale(df13.test[,-14]) %>% data.frame()
df13stnd.test <- cbind(df13stnd.test,df13.test$crim01)
colnames(df13stnd.test)[14] <- "crim01"
df13stnd.train <- scale(df13.train[-c(253:254),-14]) %>% data.frame()
df13stnd.train <- cbind(df13stnd.train,df13.train$crim01[-c(253:254)])
colnames(df13stnd.train)[14] <- "crim01"KNN MODEL 1
f13knn.acc <- c()
f13knn1 <- knn(df13stnd.train, df13stnd.test, df13stnd.test$crim01, k=25)
f13knn1.cm <- confusionMatrix(data = f13knn1, reference = df13stnd.test$crim01, positive = "1")
#f13knn1.cm[["overall"]][["Accuracy"]]
f13knn.acc[1] <- f13knn1.cm[["overall"]][["Accuracy"]]KNN MODEL 2
df13stnd2.train <- df13stnd.train[,-c(1,3,5,8)]
df13stnd2.test <- df13stnd.test[,-c(1,3,5,8)]
f13knn2 <- knn(df13stnd2.train, df13stnd2.test, df13stnd2.test$crim01, k=25)
f13knn2.cm <- confusionMatrix(data = f13knn2, reference = df13stnd2.test$crim01, positive = "1")
#f13knn2.cm[["overall"]][["Accuracy"]]
f13knn.acc[2] <- f13knn2.cm[["overall"]][["Accuracy"]]KNN MODEL 3
df13stnd3.train <- df13stnd.train[,c(4,6,7,12,13,14)]
df13stnd3.test <- df13stnd.test[,c(4,6,7,12,13,14)]
f13knn3 <- knn(df13stnd3.train, df13stnd3.test, df13stnd3.test$crim01, k=25)
f13knn3.cm <- confusionMatrix(data = f13knn3, reference = df13stnd3.test$crim01, positive = "1")
#f13knn3.cm[["overall"]][["Accuracy"]]
f13knn.acc[3] <- f13knn3.cm[["overall"]][["Accuracy"]]KNN MODEL 4
df13stnd4.train <- df13stnd.train[,c(4,6,7,12,13,11,14)]
df13stnd4.test <- df13stnd.test[,c(4,6,7,12,13,11,14)]
f13knn4 <- knn(df13stnd4.train, df13stnd4.test, df13stnd4.test$crim01, k=25)
f13knn4.cm <- confusionMatrix(data = f13knn4, reference = df13stnd4.test$crim01, positive = "1")
#f13knn4.cm[["overall"]][["Accuracy"]]
f13knn.acc[4] <- f13knn4.cm[["overall"]][["Accuracy"]]KNN MODEL 5
df13stnd5.train <- df13stnd.train[,c(2,9,10,11,14)]
df13stnd5.test <- df13stnd.test[,c(2,9,10,11,14)]
f13knn5 <- knn(df13stnd5.train, df13stnd5.test, df13stnd5.test$crim01, k=25)
f13knn5.cm <- confusionMatrix(data = f13knn5, reference = df13stnd5.test$crim01, positive = "1")
#f13knn5.cm[["overall"]][["Accuracy"]]
f13knn.acc[5] <- f13knn5.cm[["overall"]][["Accuracy"]]KNN MODEL 6
df13stnd6.train <- df13stnd.train[,c(1,3,5,8,14)]
df13stnd6.test <- df13stnd.test[,c(1,3,5,8,14)]
f13knn6 <- knn(df13stnd6.train, df13stnd6.test, df13stnd6.test$crim01, k=25)
f13knn6.cm <- confusionMatrix(data = f13knn6, reference = df13stnd6.test$crim01, positive = "1")
#f13knn6.cm[["overall"]][["Accuracy"]]
f13knn.acc[6] <- f13knn6.cm[["overall"]][["Accuracy"]]KNN MODEL 7
df13stnd7.train <- df13stnd.train[,c(1,8,14)]
df13stnd7.test <- df13stnd.test[,c(1,8,14)]
f13knn7 <- knn(df13stnd7.train, df13stnd7.test, df13stnd7.test$crim01, k=25)
f13knn7.cm <- confusionMatrix(data = f13knn7, reference = df13stnd7.test$crim01, positive = "1")
#f13knn7.cm[["overall"]][["Accuracy"]]
f13knn.acc[7] <- f13knn7.cm[["overall"]][["Accuracy"]]KNN MODEL 8
df13stnd8.train <- df13stnd.train[,c(1,3,14)]
df13stnd8.test <- df13stnd.test[,c(1,3,14)]
f13knn8 <- knn(df13stnd8.train, df13stnd8.test, df13stnd8.test$crim01, k=25)
f13knn8.cm <- confusionMatrix(data = f13knn8, reference = df13stnd8.test$crim01, positive = "1")
#f13knn8.cm[["overall"]][["Accuracy"]]
f13knn.acc[8] <- f13knn8.cm[["overall"]][["Accuracy"]]KNN MODEL 9
df13stnd9.train <- df13stnd.train[,c(2,9,14)]
df13stnd9.test <- df13stnd.test[,c(2,9,14)]
f13knn9 <- knn(df13stnd9.train, df13stnd9.test, df13stnd9.test$crim01, k=25)
f13knn9.cm <- confusionMatrix(data = f13knn9, reference = df13stnd9.test$crim01, positive = "1")
#f13knn9.cm[["overall"]][["Accuracy"]]
f13knn.acc[9] <- f13knn9.cm[["overall"]][["Accuracy"]]KNN MODEL 10
df13stnd10.train <- df13stnd.train[,c(1,2,3,9,14)]
df13stnd10.test <- df13stnd.test[,c(1,2,3,9,14)]
f13knn10 <- knn(df13stnd10.train, df13stnd10.test, df13stnd10.test$crim01, k=25)
f13knn10.cm <- confusionMatrix(data = f13knn10, reference = df13stnd10.test$crim01, positive = "1")
#f13knn10.cm[["overall"]][["Accuracy"]]
f13knn.acc[10] <- f13knn10.cm[["overall"]][["Accuracy"]]After the classification models fit, the results were organized in a dataframe and summarized in the Figure below;
f13.all <- cbind(f13lgt.acc,f13lda.acc,f13qda.acc,f13knn.acc)
colnames(f13.all)[1] <- "LOGISTIC"
colnames(f13.all)[2] <- "LDA"
colnames(f13.all)[3] <- "QDA"
colnames(f13.all)[4] <- "KNN (25)"
f13.all <- f13.all %>% melt()
colnames(f13.all)[1] <- "Model"
colnames(f13.all)[2] <- "Methods"
colnames(f13.all)[3] <- "Accuracy"
f13.all$Model <- f13.all$Model %>% as.factor()mycolors <- colorRampPalette(brewer.pal(8, "Dark2"))(10)
ggplot(data=f13.all, aes(x=Model, y=Accuracy, fill=Model)) +
geom_bar(stat="identity") +
facet_grid(~Methods, scales = "free_x") +
#scale_x_continuous("Fits",labels = as.numeric(f10i.all$Fits), breaks = f10i.all$Fits) +
#scale_fill_brewer(palette="Dark2") +
scale_fill_manual(values = mycolors) +
theme_bw()Observing the results, one can emphasize:
The Model 1 (all predictors) obtained the best results using the Logistic, LDA and QDA methods;
The KNN method obtained the best average results of all the methods;
The Model 8 (zn + chas) obtained the best accuracy in the KNN method, while achieved the worst result in the others methods. This result is dubious and should not be trusted;
For a detailed explanation consult the section 2.5 Local Methods in High Dimensions of the book The Elements of Statistical Learning, or the lecture notes on: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html↩︎