mtcars
#Load libraries
library(psych)
library(tidyverse)
library(tidymodels)
library(vip)
library(ISLR2)
# Load up dataset
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#summary data
sum <- apply(mtcars,2,summary)
#calculate standard deviation
sds <- apply(mtcars,2,sd)
#calculate standard error
se <- sds/sqrt(2)
#combine to have the descriptive statistics
descriptive <- rbind(sum,sds,se)
descriptive <- t(descriptive)
descriptive
## Min. 1st Qu. Median Mean 3rd Qu. Max. sds se
## mpg 10.400 15.42500 19.200 20.090625 22.80 33.900 6.0269481 4.2616958
## cyl 4.000 4.00000 6.000 6.187500 8.00 8.000 1.7859216 1.2628373
## disp 71.100 120.82500 196.300 230.721875 326.00 472.000 123.9386938 87.6378909
## hp 52.000 96.50000 123.000 146.687500 180.00 335.000 68.5628685 48.4812692
## drat 2.760 3.08000 3.695 3.596563 3.92 4.930 0.5346787 0.3780750
## wt 1.513 2.58125 3.325 3.217250 3.61 5.424 0.9784574 0.6918739
## qsec 14.500 16.89250 17.710 17.848750 18.90 22.900 1.7869432 1.2635597
## vs 0.000 0.00000 0.000 0.437500 1.00 1.000 0.5040161 0.3563932
## am 0.000 0.00000 0.000 0.406250 1.00 1.000 0.4989909 0.3528399
## gear 3.000 3.00000 4.000 3.687500 4.00 5.000 0.7378041 0.5217063
## carb 1.000 2.00000 2.000 2.812500 4.00 8.000 1.6152000 1.1421189
mpg
and other
predictors using visualization. Based on the plot, which predictors may
have transformations like log(x), sqrt(x) or
x^2.# Scatterplot between mpg and the other predictors
par(mfrow = c(3, 4))
for (i in 2:ncol(mtcars)) {
par(mar = c(3, 3, 2, 2)) # reduce the margin size
plot(mtcars$mpg, mtcars[,i], main = paste("mpg vs.", names(mtcars)[i]))
}
Because we use the linear model for our data, to identify whether we
have to transform our data or not, we check whether the linearity
between mpg with predictors is violated or not.
To check the violation of linearity, we use the diagnostic plots including:
1.Residuals vs. Fitted Plot: This plot shows the relationship between the fitted values (predicted values) and the residuals (difference between observed and predicted values). It is used to check if the assumption of linearity is valid. If there is a clear pattern or curve in the plot, it suggests that the linearity assumption has been violated.
2.Normal Q-Q Plot: This plot shows if the residuals are normally distributed. If the residuals fall close to the diagonal line, then the assumption of normality is met.
3.Scale-Location Plot: This plot checks if the residuals are spread out equally along the range of predictors. If the plot shows a horizontal line, then the assumption of equal variance is met.
4.Residuals vs. Leverage Plot: This plot shows the leverage of each observation and the corresponding residuals. High leverage points are those that have extreme values of predictors, and can have a large impact on the regression line. This plot is used to identify any influential points, which are those that have both high leverage and high residuals.
cyl
mtcars %>% ggplot(aes(mpg,cyl)) + geom_point() + geom_smooth(method= "lm")
par(mfrow=c(2,2))
cyl_lm <- lm(mpg~cyl,data= mtcars)
plot(cyl_lm)
Because
cyl
is categorical data, there is only weak linear
relation between cyl
and mpg.
The
transformation of cyl
into other kinds cannot help to fit
the linearity relationship, even make it more complicated for the model.
Hence, we do not need to transfer the data. Other model or dummy
variable may better fit to consider the relation between
cyl
and mpg
disp
par(mfrow=c(2,1))
hist(mtcars$disp)
hist(log(mtcars$disp))
par(mfrow=c(2,2))
disp_lm <- lm(mpg~log(disp),data= mtcars)
plot(disp_lm)
Because
disp
is continuous data, I check the distribution
of the data. I see that disp
does not present a normal
distribution. Hence, I transform data disp
into
log(disp)
. After transformation, I check the linearity
validation between mpg
and log(disp)
. The
diagnostic plots show a linear relation between mpg
and
log(disp)
hp
par(mfrow=c(2,1))
hist(mtcars$hp)
hist(log(mtcars$hp))
par(mfrow=c(2,2))
hp_lm <- lm(mpg~log(hp),data= mtcars)
plot(hp_lm)
Same with disp
, after we check the distribution of
hp
data, we can see a huge improvement after transforming
hp
to log(hp)
. The linearity validation
between mpg
and log(hp)
is also confirmed
through 4 diagnostic charts when observations are followed the straight
line in Normal Q-Q and the residuals are spread out equally along the
range of predictors in other 3 charts.
drat
par(mfrow=c(2,2))
drat_lm <- lm(mpg~drat,data= mtcars)
plot(drat_lm)
After seeing four figures, we can see in Normal Q-Q, the observations
follow the straight line and do not go too much far from the straight
line, indicating a linear relation. Additionally, three other charts
also show a random distribution of data around red lines with no clear
pattern. Hence, we do not need to transform
drat.
wt
par(mfrow=c(2,2))
wt_lm <- lm(mpg~wt,data= mtcars)
plot(wt_lm)
From the 4 charts we can see a linear relation between
wt
and mpg
; hence, we do not need to transform wt
data.
qsec
par(mfrow=c(2,2))
qsec_lm <- lm(mpg~qsec,data= mtcars)
plot(qsec_lm)
From the 4 charts we can see a linear relation between
qsec
and mpg
; hence, we do not need to transform
qsec
data.
vs
mtcars %>% ggplot(aes(mpg,vs)) + geom_point() + geom_smooth(method= "lm")
par(mfrow=c(2,2))
vs_lm <- lm(mpg~vs,data= mtcars)
plot(vs_lm)
Same with
cyl
, vs
is categorical data type so
it does not present a perfect linearity. However, looking at the
diagnostic plots, it is acceptable to use linear relation between these
two variables.
am
mtcars %>% ggplot(aes(mpg,am)) + geom_point() + geom_smooth(method= "lm")
par(mfrow=c(2,2))
am_lm <- lm(mpg~am,data= mtcars)
plot(am_lm)
Same analysis with
vs
is pointed out for am
predictor.
gear
par(mfrow=c(2,2))
gear_lm <- lm(mpg~gear,data= mtcars)
plot(gear_lm)
Same analysis with
vs
and am
is pointed out
for gear
predictor.
carb
par(mfrow=c(1,2))
hist(mtcars$carb)
hist(log(mtcars$carb))
par(mfrow=c(2,2))
carb_lm <- lm(mpg~log(carb),data= mtcars)
plot(carb_lm)
Because the distribution of
carb
is not normal, I take the
logarit to normalize the data. Then, we check the linearity between
carb
and mpg
through diagnostic plots. The
linearity is confirmed after data transformation.
Among predictors, I take logarit for disp
,
hp
, and carb
to transform them into more
predictable. I keep other variables as their original forms.
mpg
across all
predictors and show the estimated resultslin_reg <- lm(mpg ~ ., data = mtcars)
summary(lin_reg)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
As we can see from the p-value, none of the variables has
statistically significant impact on the variable mpg
,
indicating that there’s some problems with the chosen model for
predicting mpg
A VIF value of 1 indicates no multicollinearity, while values greater than 1 suggest increasing levels of multicollinearity. Generally, a VIF value greater than 5 or 10 is considered high and may indicate that the corresponding predictor variable should be removed from the model.
car::vif(lin_reg)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
From the result, we can see that there are multicollinearity problems among predictors in our model: 3 variables having vif > 10 and 7 variables having vif > 5.
3 variables cyl
, disp
, and wt
having vif larger than 10.
disp
, and (2) excluding disp
and
cyl
from predictors. Are there any improvements observed
from the regression results?disp
reg1 <- lm(mpg ~ . - disp, data = mtcars)
summary(reg1)
##
## Call:
## lm(formula = mpg ~ . - disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7863 -1.4055 -0.2635 1.2029 4.4753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.55052 18.52585 0.677 0.5052
## cyl 0.09627 0.99715 0.097 0.9240
## hp -0.01295 0.01834 -0.706 0.4876
## drat 0.92864 1.60794 0.578 0.5694
## wt -2.62694 1.19800 -2.193 0.0392 *
## qsec 0.66523 0.69335 0.959 0.3478
## vs 0.16035 2.07277 0.077 0.9390
## am 2.47882 2.03513 1.218 0.2361
## gear 0.74300 1.47360 0.504 0.6191
## carb -0.61686 0.60566 -1.018 0.3195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.623 on 22 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8105
## F-statistic: 15.73 on 9 and 22 DF, p-value: 1.183e-07
After removing disp
as a predictor, the linear
regression result shows that wt
is a significant variable
and the adjusted R-squared increased from 0.8066 to 0.8105
disp
and cyl
reg2 <- lm(mpg ~ . -disp -cyl, data = mtcars)
summary(reg2)
##
## Call:
## lm(formula = mpg ~ . - disp - cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8187 -1.3903 -0.3045 1.2269 4.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.80810 12.88582 1.072 0.2950
## hp -0.01225 0.01649 -0.743 0.4650
## drat 0.88894 1.52061 0.585 0.5645
## wt -2.60968 1.15878 -2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185
## vs 0.08786 1.88992 0.046 0.9633
## am 2.42418 1.91227 1.268 0.2176
## gear 0.69390 1.35294 0.513 0.6129
## carb -0.61286 0.59109 -1.037 0.3106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8187
## F-statistic: 18.5 on 8 and 23 DF, p-value: 2.627e-08
After removing disp
and cyl
as predictors,
the linear regression result shows that wt
is a significant
variable and the adjusted R-squared increased from 0.8066 to 0.8187
Out of the 3 models, it seems like the reg2
model
provides best fit to the value mpg
Carseats
Sales
using Price
, Urban
, and US
data("Carseats")
lm.fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
\(Sales = 13.04 - (0.05*Price) -
(0.02*Urban) + (1.2*US)\)
+ Urban = 1 if the store is in an urban location and 0 otherwise
+ US = 1 if the store is based in the US and 0 otherwise
Based on the p-value, we can reject the null hypothesis for the following predictors: Price and US
lm.fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
We can compare fits by comparing some diagnostics that are available
at the bottom of the summary()
output.
lm.fit | lm.fit2 | Better Fit | |
---|---|---|---|
Residual standard error | 2.472 | 2.469 | lm.fit2 |
Multiple R-squared | 0.2393 | 0.2393 | draw |
Adjusted R-squared | 0.2335 | 0.2354 | lm.fit2 |
F-statistic, p-value | 41.52, < 2e-16 | 62.43, < 2e_16 | lm.fit2 |
So the model from (e),
lm.fit2
, seems to be a better fit.
confint(lm.fit2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(lm.fit2, which = 5)
> From the plots, there is no evidence that there exists such
points.
library(quantmod)
library(TTR)
library(xts)
#Load the data
symbols <- "SPY"
portfolioPrices <- NULL
for (symbol in symbols) {
SP500 <- cbind(portfolioPrices,
getSymbols.yahoo(symbol,
from = '2019-01-01',
to = '2023-12-31',
auto.assign = FALSE)[,6])
}
colnames(SP500) <- "Price"
head(SP500)
## Price
## 2019-01-02 229.8434
## 2019-01-03 224.3586
## 2019-01-04 231.8737
## 2019-01-07 233.7020
## 2019-01-08 235.8977
## 2019-01-09 237.0001
# SMA
sma5 = lag(SMA(SP500, n = 5)) #notice the use of the lag function to take lagged values
# EMA
ema5 = lag(EMA(SP500, n = 5))
# MACD
macd1 = lag(MACD(SP500))
# RSI
rsi1 = lag(RSI(SP500, 5))
# log returns
ret1 = lag(dailyReturn(SP500, type = "log"))
# price director indicator
dir = ifelse(SP500$Price >= lag(SP500$Price, 5), 1, 0) #direction variable compared to 5 day before price
d_ex1 = cbind(dir, ret1, sma5, ema5, macd1, rsi1)
# change column names
colnames(d_ex1) = c("Direction", "Ret", "SMA", "EMA", "MACD", "Signal", "RSI")
quantmod
packagechartSeries(SP500, theme = "white", name = "S&P500 Closing Prices and Indicators")
addTA(d_ex1[, 1], col = 1,legend = "Direction" )
addTA(d_ex1[, -1], on = NA, col = rainbow(6), legend = TRUE)
* The above graph can be improved using ggplot2
library(tidyr)
library(ggplot2)
# create a dataset and convert data to long
d_plot = merge.xts(SP500, d_ex1)
# remove NAs and then convert to long
d_plot = na.omit(d_plot)
# convert to dataframe
d_plot = data.frame(Date = index(d_plot), coredata(d_plot))
d_plot_long = pivot_longer(d_plot, -c(Date, Direction), values_to = "value",
names_to = "Indicator")
# change direction to a factor
d_plot_long$Direction = as.factor(d_plot_long$Direction)
(p2_ex = ggplot(d_plot_long, aes(Date, value, color = Indicator)) + geom_path(stat = "identity") +
facet_grid(Indicator ~ ., scale = "free") + theme_minimal())
* The above is a line chart of indicators and prices. We can also create
a box plot to visualise the variance in the values relative to the
direction
p2_ex = ggplot(d_plot_long, aes(value, Indicator, fill = Direction)) +
geom_boxplot()
p2_ex + theme_minimal() + labs(title = "TA Indicators vs Price Direction") +
scale_fill_manual(name = "Price Direction", values = c("orange", "lightblue"))
* Some differences per category can be noticed
# remove NAs
d_ex1 = na.omit(d_ex1)
# convert to data frame
d_ex1 = as.data.frame(d_ex1)
# convert direction to a factor for classification
d_ex1$Direction = as.factor(d_ex1$Direction)
idx1 = c(1:round(nrow(d_ex1) * 0.7)) #create index for first 70% values to be in the testing set
d_train1 = d_ex1[idx1, ] #training set
d_test1 = d_ex1[-idx1, ] #testing set
Training Setup
library(caret)
set.seed(999)
# control
cntrl1 = trainControl(method = "timeslice", initialWindow = 250, horizon = 30,
fixedWindow = TRUE)
# preprocesing
prep1 = c("center", "scale")
# logistic regression
logit_ex1 = train(Direction ~ ., data = d_train1, method = "glm", family = "binomial",
trControl = cntrl1, preProcess = prep1)
logit_ex1 #final model accuracy
## Generalized Linear Model
##
## 857 samples
## 6 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (6), scaled (6)
## Resampling: Rolling Forecasting Origin Resampling (30 held-out with a fixed window)
## Summary of sample sizes: 250, 250, 250, 250, 250, 250, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8212226 0.5973125
summary(logit_ex1$finalModel) #summary of the final model
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7096 0.1047 6.775 1.24e-11 ***
## Ret -0.1581 0.1321 -1.197 0.23126
## SMA -70.4855 9.8538 -7.153 8.48e-13 ***
## EMA 70.4190 9.8455 7.152 8.52e-13 ***
## MACD 1.3354 0.5435 2.457 0.01400 *
## Signal -1.4259 0.4964 -2.872 0.00408 **
## RSI 1.7737 0.2015 8.802 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1139.51 on 856 degrees of freedom
## Residual deviance: 610.51 on 850 degrees of freedom
## AIC: 624.51
##
## Number of Fisher Scoring iterations: 6
library(vip)
vip(logit_ex1, geom = "point") + theme_minimal()
Predictive Accuracy
* ML models should be checked for predictive accuracy on the test
set
* caret
provides predict
function to create
predictions
* These predictions can be assessed based on the confusion matrix
Confusion Matrix - When applying classification
models, we often use a confusion matrix to evaluate certain performance
measures.
* A confusion matrix is simply a matrix that compares actual categorical
levels (or events) to the predicted categorical levels.
* Prediction of the right level; refer to this as a true positive.
* Prediction of a level or event that did not happen this is called a
false positive (i.e. we predicted a customer would redeem a coupon and
they did not).
* Alternatively, when we do not predict a level or event and it does
happen that this is called a false negative (i.e. a customer that we did
not predict to redeem a coupon does).
* Let’s predict using the final model and assess using the confusion
matrix.
pred1 = predict(logit_ex1, newdata = d_test1) #prediction on the test data
confusionMatrix(data = pred1, reference = d_test1$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 117 28
## 1 32 190
##
## Accuracy : 0.8365
## 95% CI : (0.7946, 0.8729)
## No Information Rate : 0.594
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6596
##
## Mcnemar's Test P-Value : 0.6985
##
## Sensitivity : 0.7852
## Specificity : 0.8716
## Pos Pred Value : 0.8069
## Neg Pred Value : 0.8559
## Prevalence : 0.4060
## Detection Rate : 0.3188
## Detection Prevalence : 0.3951
## Balanced Accuracy : 0.8284
##
## 'Positive' Class : 0
##