Warning: package 'ISLR2' was built under R version 4.5.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(ggplot2)library(GGally)library(mlbench)
Warning: package 'mlbench' was built under R version 4.5.2
library(tidyr)library(modelr)
Warning: package 'modelr' was built under R version 4.5.2
library(MASS)
Warning: package 'MASS' was built under R version 4.5.2
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
The following object is masked from 'package:ISLR2':
Boston
library(gtsummary)
Attaching package: 'gtsummary'
The following object is masked from 'package:MASS':
select
library(caret)
Warning: package 'caret' was built under R version 4.5.2
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
attach(Weekly)
13
From the describe function, volume is the one that has the highest skew at 1.62 which signifies a long tail. From the graphical representation we see that the volume is indeed showing a long tail. Volume also demonstrates a non linear pattern with year from the graphical output, but the two features also have the highest correlation value of .8
Year Lag1 Lag2 Lag3
Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
Lag4 Lag5 Volume Today
Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
Direction
Down:484
Up :605
ggpairs(Weekly, progress =FALSE)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
From the model summary we see that only Lag2 is statistically significant with a P value of .0296. Holding all other variables constant one unit increase in Lag2, the odds of the market moving up increase by \(6.02\%\approx (e^{.05844}-1)*100\).
log_model <-glm(Direction ~ . -Year - Today, data = Weekly, family ="binomial")summary(log_model)
Call:
glm(formula = Direction ~ . - Year - Today, family = "binomial",
data = Weekly)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
set.seed(42)partitions<-resample_partition(Weekly, c(train=.8, test=.2))train_df<-as.data.frame(partitions$train)test_df<-as.data.frame(partitions$test)train_df|>tbl_summary( by = Direction, include =c(Lag1, Lag2, Lag3, Lag4, Lag5, Volume))|>add_p()
Characteristic
Down
N = 3911
Up
N = 4801
p-value2
Lag1
0.37 (-0.94, 1.59)
0.13 (-1.21, 1.21)
0.033
Lag2
0.15 (-1.45, 1.32)
0.33 (-0.98, 1.44)
0.036
Lag3
0.24 (-1.16, 1.41)
0.23 (-1.07, 1.45)
>0.9
Lag4
0.34 (-1.00, 1.52)
0.24 (-1.15, 1.32)
0.2
Lag5
0.33 (-1.11, 1.44)
0.13 (-1.18, 1.32)
0.3
Volume
1.08 (0.35, 1.89)
0.94 (0.33, 2.23)
0.6
1 Median (Q1, Q3)
2 Wilcoxon rank sum test
From the corrected labeling as positive = “Up”, we see that the matrix points at a false positive having a count. This would cause a client to purchase a stock that will eventually lose value and lose money altogether. Sensitivity has a high percentage of \(81\%\) meaning that a large percentage of Up days were successfully identified. Unfortunately with a specificity of \(15\%\) this points to many Down days not being identified, reinforced by the matrix count.
From the output below all models performed equally as bad at around \(54\%\) accuracy. Having to choose a specific model I would choose to optimize the KNN. Though it has the lowest accuracy of \(44\%\) it does have the advantage of having a high specificity.
MODEL : Log Reg
Reference
Prediction Down Up
Down 7 4
Up 87 100
Overall Accuracy: 0.5404
Test Error: 0.4596
Sensitivity: 0.9615
Specificity: 0.0745
MODEL : LDA
Reference
Prediction Down Up
Down 7 4
Up 87 100
Overall Accuracy: 0.5404
Test Error: 0.4596
Sensitivity: 0.9615
Specificity: 0.0745
MODEL : Qda
Reference
Prediction Down Up
Down 0 0
Up 94 104
Overall Accuracy: 0.5253
Test Error: 0.4747
Sensitivity: 1
Specificity: 0
MODEL : Bayes
Reference
Prediction Down Up
Down 0 0
Up 94 104
Overall Accuracy: 0.5253
Test Error: 0.4747
Sensitivity: 1
Specificity: 0
MODEL : KNN
Reference
Prediction Down Up
Down 42 55
Up 52 49
Overall Accuracy: 0.4596
Test Error: 0.5404
Sensitivity: 0.4712
Specificity: 0.4468
MODEL : Optimal - KNN
Reference
Prediction Down Up
Down 27 52
Up 67 52
Overall Accuracy: 0.399
Test Error: 0.601
Sensitivity: 0.5
Specificity: 0.2872
14
The auto data set is loaded and Cylinders and Origin were changed to type Factor since Cylinders (3-8) is categorical on the number of cylinders an engine has and Origin is a location (1-3). Once the mpg01 indicator was created there was no need for mpg since it will be directly highly correlated to the indicator variable. From the pairs plot one can see that cylinders looks like a good candidate as it is able to differentiate quite well between the mpg01 categories. Displacement, horsepower and weight are also showing clear differences of mean among between the two groups of mpg01 which could indicate good predictors.
attach(Auto)
The following object is masked from package:lubridate:
origin
The following object is masked from package:ggplot2:
mpg
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
set.seed(42)auto_partition<-resample_partition(AutoClean, c(train=.8, test=.2))auto_train_df<-as.data.frame(auto_partition$train)auto_test_df<-as.data.frame(auto_partition$test)auto_train_df|>tbl_summary( by = mpg01, include =c(origin, cylinders, displacement, horsepower, weight, acceleration))|>add_p()
Characteristic
0
N = 1561
1
N = 1571
p-value2
origin
<0.001
1
135 (87%)
58 (37%)
2
12 (7.7%)
42 (27%)
3
9 (5.8%)
57 (36%)
cylinders
<0.001
3
3 (1.9%)
1 (0.6%)
4
18 (12%)
143 (91%)
5
1 (0.6%)
1 (0.6%)
6
55 (35%)
10 (6.4%)
8
79 (51%)
2 (1.3%)
displacement
260 (225, 350)
105 (91, 122)
<0.001
horsepower
121 (100, 150)
76 (67, 88)
<0.001
weight
3,617 (3,094, 4,132)
2,220 (2,020, 2,595)
<0.001
acceleration
14.50 (13.00, 16.30)
16.40 (14.70, 17.80)
<0.001
1 n (%); Median (Q1, Q3)
2 Pearson’s Chi-squared test; Fisher’s exact test; Wilcoxon rank sum test
The QDA had the lowest test error .05 out of all models. The KNN model for various K had K=3 performing the best with a test error of : .08
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
cat('***********OPTION 1 FEATURES ',paste0(all_of(opt_1)),'***********')
Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
<https://tidyselect.r-lib.org/reference/faq-selection-context.html>
***********OPTION 1 FEATURES zn nox dis tax black medv ***********
For both models KNN where K=3 proved to be the best choice of model regardless of the feature set chosen. Sensitivity in option 2 of features was pushed to 98%.