Question 1: Answer the following using the dataset `mtcars`

(a) Load up data mtcars and generate its descriptive statistics

#Load libraries
library(psych)
library(tidyverse)
library(tidymodels)
library(vip)
library(ISLR2)

# Load up dataset
data("mtcars")
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#summary data
sum <- apply(mtcars,2,summary)

#calculate standard deviation
sds <- apply(mtcars,2,sd)

#calculate standard error
se <- sds/sqrt(2)

#combine to have the descriptive statistics
descriptive <- rbind(sum,sds,se)
descriptive <- t(descriptive)
descriptive

##        Min.   1st Qu.  Median       Mean 3rd Qu.    Max.         sds         se
## mpg  10.400  15.42500  19.200  20.090625   22.80  33.900   6.0269481  4.2616958
## cyl   4.000   4.00000   6.000   6.187500    8.00   8.000   1.7859216  1.2628373
## disp 71.100 120.82500 196.300 230.721875  326.00 472.000 123.9386938 87.6378909
## hp   52.000  96.50000 123.000 146.687500  180.00 335.000  68.5628685 48.4812692
## drat  2.760   3.08000   3.695   3.596563    3.92   4.930   0.5346787  0.3780750
## wt    1.513   2.58125   3.325   3.217250    3.61   5.424   0.9784574  0.6918739
## qsec 14.500  16.89250  17.710  17.848750   18.90  22.900   1.7869432  1.2635597
## vs    0.000   0.00000   0.000   0.437500    1.00   1.000   0.5040161  0.3563932
## am    0.000   0.00000   0.000   0.406250    1.00   1.000   0.4989909  0.3528399
## gear  3.000   3.00000   4.000   3.687500    4.00   5.000   0.7378041  0.5217063
## carb  1.000   2.00000   2.000   2.812500    4.00   8.000   1.6152000  1.1421189

(b) Assess the relationship between `mpg` and other predictors using visualization. Based on the plot, which predictors may have transformations like log(x), sqrt(x) or x^2.

# Scatterplot between mpg and the other predictors
par(mfrow = c(3, 4))

for (i in 2:ncol(mtcars)) {
    par(mar = c(3, 3, 2, 2)) # reduce the margin size
  plot(mtcars$mpg, mtcars[,i], main = paste("mpg vs.", names(mtcars)[i]))
}

Because we use the linear model for our data, to identify whether we have to transform our data or not, we check whether the linearity between mpg with predictors is violated or not.

To check the violation of linearity, we use the diagnostic plots including:

1.Residuals vs. Fitted Plot: This plot shows the relationship between the fitted values (predicted values) and the residuals (difference between observed and predicted values). It is used to check if the assumption of linearity is valid. If there is a clear pattern or curve in the plot, it suggests that the linearity assumption has been violated.

2.Normal Q-Q Plot: This plot shows if the residuals are normally distributed. If the residuals fall close to the diagonal line, then the assumption of normality is met.

3.Scale-Location Plot: This plot checks if the residuals are spread out equally along the range of predictors. If the plot shows a horizontal line, then the assumption of equal variance is met.

4.Residuals vs. Leverage Plot: This plot shows the leverage of each observation and the corresponding residuals. High leverage points are those that have extreme values of predictors, and can have a large impact on the regression line. This plot is used to identify any influential points, which are those that have both high leverage and high residuals.

Variable: `cyl`

mtcars %>% ggplot(aes(mpg,cyl)) + geom_point() + geom_smooth(method= "lm")

par(mfrow=c(2,2))
cyl_lm <- lm(mpg~cyl,data= mtcars)
plot(cyl_lm)

Because cyl is categorical data, there is only weak linear relation between cyl and mpg. The transformation of cyl into other kinds cannot help to fit the linearity relationship, even make it more complicated for the model. Hence, we do not need to transfer the data. Other model or dummy variable may better fit to consider the relation between cyl and mpg

Variable: `disp`

par(mfrow=c(2,1))
hist(mtcars$disp)
hist(log(mtcars$disp))

par(mfrow=c(2,2))
disp_lm <- lm(mpg~log(disp),data= mtcars)
plot(disp_lm)

Because disp is continuous data, I check the distribution of the data. I see that disp does not present a normal distribution. Hence, I transform data disp into log(disp). After transformation, I check the linearity validation between mpg and log(disp). The diagnostic plots show a linear relation between mpg and log(disp)

Variable: `hp`

par(mfrow=c(2,1))
hist(mtcars$hp)
hist(log(mtcars$hp))

par(mfrow=c(2,2))
hp_lm <- lm(mpg~log(hp),data= mtcars)
plot(hp_lm)

Same with disp, after we check the distribution of hp data, we can see a huge improvement after transforming hp to log(hp). The linearity validation between mpg and log(hp) is also confirmed through 4 diagnostic charts when observations are followed the straight line in Normal Q-Q and the residuals are spread out equally along the range of predictors in other 3 charts.

Variable: `drat`

par(mfrow=c(2,2))
drat_lm <- lm(mpg~drat,data= mtcars)
plot(drat_lm)

After seeing four figures, we can see in Normal Q-Q, the observations follow the straight line and do not go too much far from the straight line, indicating a linear relation. Additionally, three other charts also show a random distribution of data around red lines with no clear pattern. Hence, we do not need to transform drat.

Variable: `wt`

par(mfrow=c(2,2))
wt_lm <- lm(mpg~wt,data= mtcars)
plot(wt_lm)

From the 4 charts we can see a linear relation between wt and mpg; hence, we do not need to transform wt data.

Variable: `qsec`

par(mfrow=c(2,2))
qsec_lm <- lm(mpg~qsec,data= mtcars)
plot(qsec_lm)

From the 4 charts we can see a linear relation between qsec and mpg; hence, we do not need to transform qsec data.

Variable: `vs`

mtcars %>% ggplot(aes(mpg,vs)) + geom_point() + geom_smooth(method= "lm")

par(mfrow=c(2,2))
vs_lm <- lm(mpg~vs,data= mtcars)
plot(vs_lm)

Same with cyl, vs is categorical data type so it does not present a perfect linearity. However, looking at the diagnostic plots, it is acceptable to use linear relation between these two variables.

Variable: `am`

mtcars %>% ggplot(aes(mpg,am)) + geom_point() + geom_smooth(method= "lm")

par(mfrow=c(2,2))
am_lm <- lm(mpg~am,data= mtcars)
plot(am_lm)

Same analysis with vs is pointed out for am predictor.

Variable: `gear`

par(mfrow=c(2,2))
gear_lm <- lm(mpg~gear,data= mtcars)
plot(gear_lm)

Same analysis with vs and am is pointed out for gear predictor.

Variable: `carb`

par(mfrow=c(1,2))
hist(mtcars$carb)
hist(log(mtcars$carb))

par(mfrow=c(2,2))
carb_lm <- lm(mpg~log(carb),data= mtcars)
plot(carb_lm)

Because the distribution of carb is not normal, I take the logarit to normalize the data. Then, we check the linearity between carb and mpg through diagnostic plots. The linearity is confirmed after data transformation.

In conclusion:

Among predictors, I take logarit for disp, hp, and carb to transform them into more predictable. I keep other variables as their original forms.

(c). Run the multiple regression on `mpg` across all predictors and show the estimated results

lin_reg <- lm(mpg ~ ., data = mtcars)
summary(lin_reg)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

As we can see from the p-value, none of the variables has statistically significant impact on the variable mpg, indicating that there’s some problems with the chosen model for predicting mpg

(d) Using car::vif to estimate if there is any multicollinearity among predictors. Find predictors with vif higher than 10.

A VIF value of 1 indicates no multicollinearity, while values greater than 1 suggest increasing levels of multicollinearity. Generally, a VIF value greater than 5 or 10 is considered high and may indicate that the corresponding predictor variable should be removed from the model.

car::vif(lin_reg)

##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873  4.648487 
##      gear      carb 
##  5.357452  7.908747

From the result, we can see that there are multicollinearity problems among predictors in our model: 3 variables having vif > 10 and 7 variables having vif > 5.

3 variables cyl, disp, and wt having vif larger than 10.

(e) Rerun the multiple regressions by (1) excluding `disp`, and (2) excluding `disp` and `cyl` from predictors. Are there any improvements observed from the regression results?

1. Excluding `disp`

reg1 <- lm(mpg ~ . - disp, data = mtcars)
summary(reg1)

## 
## Call:
## lm(formula = mpg ~ . - disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7863 -1.4055 -0.2635  1.2029  4.4753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.55052   18.52585   0.677   0.5052  
## cyl          0.09627    0.99715   0.097   0.9240  
## hp          -0.01295    0.01834  -0.706   0.4876  
## drat         0.92864    1.60794   0.578   0.5694  
## wt          -2.62694    1.19800  -2.193   0.0392 *
## qsec         0.66523    0.69335   0.959   0.3478  
## vs           0.16035    2.07277   0.077   0.9390  
## am           2.47882    2.03513   1.218   0.2361  
## gear         0.74300    1.47360   0.504   0.6191  
## carb        -0.61686    0.60566  -1.018   0.3195  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.623 on 22 degrees of freedom
## Multiple R-squared:  0.8655, Adjusted R-squared:  0.8105 
## F-statistic: 15.73 on 9 and 22 DF,  p-value: 1.183e-07

After removing disp as a predictor, the linear regression result shows that wt is a significant variable and the adjusted R-squared increased from 0.8066 to 0.8105

2. Excluding `disp` and `cyl`

reg2 <- lm(mpg ~ . -disp -cyl, data = mtcars)
summary(reg2)

## 
## Call:
## lm(formula = mpg ~ . - disp - cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8187 -1.3903 -0.3045  1.2269  4.5183 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 13.80810   12.88582   1.072   0.2950  
## hp          -0.01225    0.01649  -0.743   0.4650  
## drat         0.88894    1.52061   0.585   0.5645  
## wt          -2.60968    1.15878  -2.252   0.0342 *
## qsec         0.63983    0.62752   1.020   0.3185  
## vs           0.08786    1.88992   0.046   0.9633  
## am           2.42418    1.91227   1.268   0.2176  
## gear         0.69390    1.35294   0.513   0.6129  
## carb        -0.61286    0.59109  -1.037   0.3106  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared:  0.8655, Adjusted R-squared:  0.8187 
## F-statistic:  18.5 on 8 and 23 DF,  p-value: 2.627e-08

After removing disp and cyl as predictors, the linear regression result shows that wt is a significant variable and the adjusted R-squared increased from 0.8066 to 0.8187

In conclusion:

Out of the 3 models, it seems like the reg2 model provides best fit to the value mpg

Question 2: Answer the following using the dataset `Carseats`

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`

data("Carseats")
lm.fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefcient in the model.

On average, when the price of the child car seats increases by $1 then sales will drop by about $54, ceteris paribus.
There is not a statistically significant difference in sales between stores in urban location and stores in rural location, ceteris paribus.
On average, if a store is based in the US, sales will increase by $12,000, ceteris paribus.

(c) Write out the model in equation form

$Sales = 13.04 - (0.05*Price) - (0.02*Urban) + (1.2*US)$
+ Urban = 1 if the store is in an urban location and 0 otherwise
+ US = 1 if the store is based in the US and 0 otherwise

(d) For which of the predictors can you reject the null hypothesis $H_0$ : $β_j$ = 0?

Based on the p-value, we can reject the null hypothesis for the following predictors: Price and US

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

We can compare fits by comparing some diagnostics that are available at the bottom of the summary() output.

	lm.fit	lm.fit2	Better Fit
Residual standard error	2.472	2.469	lm.fit2
Multiple R-squared	0.2393	0.2393	draw
Adjusted R-squared	0.2335	0.2354	lm.fit2
F-statistic, p-value	41.52, < 2e-16	62.43, < 2e_16	lm.fit2

So the model from (e), lm.fit2, seems to be a better fit.

g) Using the model from (e), obtain 95 % confidence intervals for the coeficient(s).

confint(lm.fit2, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h) Is there evidence of outliers or high leverage observations in the model from (e)?

plot(lm.fit2, which = 5)

> From the plots, there is no evidence that there exists such points.

Question 3: Use data of the S&P500 daily index starting from 2019 to 2023

14.3.1 Data & Indicators

library(quantmod)
library(TTR)
library(xts)

#Load the data
symbols <- "SPY"
portfolioPrices <- NULL
for (symbol in symbols) {
  SP500 <- cbind(portfolioPrices, 
                           getSymbols.yahoo(symbol,
                                            from = '2019-01-01',
                                            to = '2023-12-31',
                                            auto.assign = FALSE)[,6])
}

colnames(SP500) <- "Price"

head(SP500)

##               Price
## 2019-01-02 229.8434
## 2019-01-03 224.3586
## 2019-01-04 231.8737
## 2019-01-07 233.7020
## 2019-01-08 235.8977
## 2019-01-09 237.0001

# SMA
sma5 = lag(SMA(SP500, n = 5))  #notice the use of the lag function to take lagged values
# EMA
ema5 = lag(EMA(SP500, n = 5))
# MACD
macd1 = lag(MACD(SP500))
# RSI
rsi1 = lag(RSI(SP500, 5))
# log returns
ret1 = lag(dailyReturn(SP500, type = "log"))

# price director indicator
dir = ifelse(SP500$Price >= lag(SP500$Price, 5), 1, 0)  #direction variable compared to  5 day before price

Combine all the indicators and response variable in a data frame

d_ex1 = cbind(dir, ret1, sma5, ema5, macd1, rsi1)

# change column names
colnames(d_ex1) = c("Direction", "Ret", "SMA", "EMA", "MACD", "Signal", "RSI")

14.3.2 Visualise the data

Using the quantmod package

chartSeries(SP500, theme = "white", name = "S&P500 Closing Prices and Indicators")

addTA(d_ex1[, 1], col = 1,legend = "Direction" )

addTA(d_ex1[, -1], on = NA, col = rainbow(6), legend = TRUE)

* The above graph can be improved using ggplot2

library(tidyr)
library(ggplot2)
# create a dataset and convert data to long
d_plot = merge.xts(SP500, d_ex1)
# remove NAs and then convert to long

d_plot = na.omit(d_plot)
# convert to dataframe
d_plot = data.frame(Date = index(d_plot), coredata(d_plot))
d_plot_long = pivot_longer(d_plot, -c(Date, Direction), values_to = "value",
    names_to = "Indicator")

# change direction to a factor
d_plot_long$Direction = as.factor(d_plot_long$Direction)

(p2_ex = ggplot(d_plot_long, aes(Date, value, color = Indicator)) + geom_path(stat = "identity") +
    facet_grid(Indicator ~ ., scale = "free") + theme_minimal())

* The above is a line chart of indicators and prices. We can also create a box plot to visualise the variance in the values relative to the direction

p2_ex = ggplot(d_plot_long, aes(value, Indicator, fill = Direction)) +
    geom_boxplot()

p2_ex + theme_minimal() + labs(title = "TA Indicators vs Price Direction") +
    scale_fill_manual(name = "Price Direction", values = c("orange", "lightblue"))

* Some differences per category can be noticed

14.3.3 Using Logistic Regression

70:30 data split
Time series sampling
Stratified sampling will not keep the time order and hence its avoided

# remove NAs
d_ex1 = na.omit(d_ex1)
# convert to data frame
d_ex1 = as.data.frame(d_ex1)
# convert direction to a factor for classification

d_ex1$Direction = as.factor(d_ex1$Direction)

idx1 = c(1:round(nrow(d_ex1) * 0.7))  #create index for first 70% values to be in the testing set
d_train1 = d_ex1[idx1, ]  #training set
d_test1 = d_ex1[-idx1, ]  #testing set

Training Setup

library(caret)
set.seed(999)
# control
cntrl1 = trainControl(method = "timeslice", initialWindow = 250, horizon = 30,
    fixedWindow = TRUE)
# preprocesing
prep1 = c("center", "scale")
# logistic regression
logit_ex1 = train(Direction ~ ., data = d_train1, method = "glm", family = "binomial",
    trControl = cntrl1, preProcess = prep1)
logit_ex1  #final model accuracy

## Generalized Linear Model 
## 
## 857 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Rolling Forecasting Origin Resampling (30 held-out with a fixed window) 
## Summary of sample sizes: 250, 250, 250, 250, 250, 250, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8212226  0.5973125

summary(logit_ex1$finalModel)  #summary of the final model

## 
## Call:
## NULL
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.7096     0.1047   6.775 1.24e-11 ***
## Ret          -0.1581     0.1321  -1.197  0.23126    
## SMA         -70.4855     9.8538  -7.153 8.48e-13 ***
## EMA          70.4190     9.8455   7.152 8.52e-13 ***
## MACD          1.3354     0.5435   2.457  0.01400 *  
## Signal       -1.4259     0.4964  -2.872  0.00408 ** 
## RSI           1.7737     0.2015   8.802  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1139.51  on 856  degrees of freedom
## Residual deviance:  610.51  on 850  degrees of freedom
## AIC: 624.51
## 
## Number of Fisher Scoring iterations: 6

Variable importance
In the case of multiple predictor variables, we want to understand which variable is the most influential in predicting the response variable.

library(vip)
vip(logit_ex1, geom = "point") + theme_minimal()

Predictive Accuracy
* ML models should be checked for predictive accuracy on the test set
* caret provides predict function to create predictions
* These predictions can be assessed based on the confusion matrix

Confusion Matrix - When applying classification models, we often use a confusion matrix to evaluate certain performance measures.
* A confusion matrix is simply a matrix that compares actual categorical levels (or events) to the predicted categorical levels.
* Prediction of the right level; refer to this as a true positive.
* Prediction of a level or event that did not happen this is called a false positive (i.e. we predicted a customer would redeem a coupon and they did not).
* Alternatively, when we do not predict a level or event and it does happen that this is called a false negative (i.e. a customer that we did not predict to redeem a coupon does).
* Let’s predict using the final model and assess using the confusion matrix.

pred1 = predict(logit_ex1, newdata = d_test1)  #prediction on the test data
confusionMatrix(data = pred1, reference = d_test1$Direction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 117  28
##          1  32 190
##                                           
##                Accuracy : 0.8365          
##                  95% CI : (0.7946, 0.8729)
##     No Information Rate : 0.594           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6596          
##                                           
##  Mcnemar's Test P-Value : 0.6985          
##                                           
##             Sensitivity : 0.7852          
##             Specificity : 0.8716          
##          Pos Pred Value : 0.8069          
##          Neg Pred Value : 0.8559          
##              Prevalence : 0.4060          
##          Detection Rate : 0.3188          
##    Detection Prevalence : 0.3951          
##       Balanced Accuracy : 0.8284          
##                                           
##        'Positive' Class : 0               
##

The model provides decent accuracy which is higher than the No Information Rate and also statistically significant.
There are other analyses which can be conducted on logistic regression, for example, ROC (Receiver Operating Characteristic) curve analysis for the Area Under the Curve.

Midterm: Financial Software Package

PhamMinhTam

2024-04-25

Question 1: Answer the following using the dataset `mtcars`

(a) Load up data mtcars and generate its descriptive statistics

(b) Assess the relationship between `mpg` and other predictors using visualization. Based on the plot, which predictors may have transformations like log(x), sqrt(x) or x^2.

Variable: `cyl`

Variable: `disp`

Variable: `hp`

Variable: `drat`

Variable: `wt`

Variable: `qsec`

Variable: `vs`

Variable: `am`

Variable: `gear`

Variable: `carb`

In conclusion:

(c). Run the multiple regression on `mpg` across all predictors and show the estimated results

(d) Using car::vif to estimate if there is any multicollinearity among predictors. Find predictors with vif higher than 10.

(e) Rerun the multiple regressions by (1) excluding `disp`, and (2) excluding `disp` and `cyl` from predictors. Are there any improvements observed from the regression results?

1. Excluding `disp`

2. Excluding `disp` and `cyl`

In conclusion:

Question 2: Answer the following using the dataset `Carseats`

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`

(b) Provide an interpretation of each coefcient in the model.

(c) Write out the model in equation form

(d) For which of the predictors can you reject the null hypothesis \(H_0\) : \(β_j\) = 0?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

(f) How well do the models in (a) and (e) fit the data?

g) Using the model from (e), obtain 95 % confidence intervals for the coeficient(s).

h) Is there evidence of outliers or high leverage observations in the model from (e)?

Question 3: Use data of the S&P500 daily index starting from 2019 to 2023

14.3.1 Data & Indicators

14.3.2 Visualise the data

14.3.3 Using Logistic Regression

Midterm: Financial Software Package

PhamMinhTam

2024-04-25

Question 1: Answer the following using the dataset mtcars

(a) Load up data mtcars and generate its descriptive statistics

(b) Assess the relationship between mpg and other predictors using visualization. Based on the plot, which predictors may have transformations like log(x), sqrt(x) or x^2.

Variable: cyl

Variable: disp

Variable: hp

Variable: drat

Variable: wt

Variable: qsec

Variable: vs

Variable: am

Variable: gear

Variable: carb

In conclusion:

(c). Run the multiple regression on mpg across all predictors and show the estimated results

(d) Using car::vif to estimate if there is any multicollinearity among predictors. Find predictors with vif higher than 10.

(e) Rerun the multiple regressions by (1) excluding disp, and (2) excluding disp and cyl from predictors. Are there any improvements observed from the regression results?

1. Excluding disp

2. Excluding disp and cyl

In conclusion:

Question 2: Answer the following using the dataset Carseats

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US

(b) Provide an interpretation of each coefcient in the model.

(c) Write out the model in equation form

(d) For which of the predictors can you reject the null hypothesis \(H_0\) : \(β_j\) = 0?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

(f) How well do the models in (a) and (e) fit the data?

g) Using the model from (e), obtain 95 % confidence intervals for the coeficient(s).

h) Is there evidence of outliers or high leverage observations in the model from (e)?

Question 3: Use data of the S&P500 daily index starting from 2019 to 2023

14.3.1 Data & Indicators

14.3.2 Visualise the data

14.3.3 Using Logistic Regression

Question 1: Answer the following using the dataset `mtcars`

(b) Assess the relationship between `mpg` and other predictors using visualization. Based on the plot, which predictors may have transformations like log(x), sqrt(x) or x^2.

Variable: `cyl`

Variable: `disp`

Variable: `hp`

Variable: `drat`

Variable: `wt`

Variable: `qsec`

Variable: `vs`

Variable: `am`

Variable: `gear`

Variable: `carb`

(c). Run the multiple regression on `mpg` across all predictors and show the estimated results

(e) Rerun the multiple regressions by (1) excluding `disp`, and (2) excluding `disp` and `cyl` from predictors. Are there any improvements observed from the regression results?

1. Excluding `disp`

2. Excluding `disp` and `cyl`

Question 2: Answer the following using the dataset `Carseats`

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`