Homework #2

Regression Analysis

Use the Brokerage Satisfaction data set to predict the Overall_Satisfaction_with_Electronic_Trades.

bro_lm <- lm(Overall_Satisfaction_with_Electronic_Trades ~ Satisfaction_with_Trade_Price + Satisfaction_with_Speed_of_Execution , data = Brokerage)
summary(bro_lm)

## 
## Call:
## lm(formula = Overall_Satisfaction_with_Electronic_Trades ~ Satisfaction_with_Trade_Price + 
##     Satisfaction_with_Speed_of_Execution, data = Brokerage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58886 -0.13863 -0.09120  0.05781  0.64613 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           -0.6633     0.8248  -0.804 0.438318    
## Satisfaction_with_Trade_Price          0.7746     0.1521   5.093 0.000348 ***
## Satisfaction_with_Speed_of_Execution   0.4897     0.2016   2.429 0.033469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157

A decision was made to remove the firm from the data set. There are 14 different firms and 14 rows of data. There is not enough information to determine whether or not the firm has an impact on the overall satisfaction of trade prices.

1a.

Based upon p-values alone there is statistically significant relationship between the overall satisfaction and the Speed of execution.

1b.

To determine the significance of Speed of execution I utilized the p-value that corresponds to the t-statistic. This p-value is equal to .033469. Additionally, I made the assumption of that \(\alpha\) = .05. A p-value less than that would be the first indicator of significance.

1c.

Similarly there is likely a relationship between trade price and overall execution based upon the p-value.

1d.

Similar assumptions were made when determining variable significance. The p-value corresponding to the t-statistic is equal to .000348.

1e.

The amount of variance that is explained by the model is exemplified by the multiple R-squared value. Meaning, that 72.56% of the variance is explained. Considering the size of the data itself, this model is performing exceptionally well and this would be another indicator of a strong linear relationship between the dependent and independent variables.

1f.

Calculate the residual values.

bro_resid <- resid(bro_lm)
bro_resid

##            1            2            3            4            5            6 
## -0.133395541 -0.210856526  0.646131773  0.480581057 -0.149979422 -0.062674498 
##            7            8            9           10           11           12 
## -0.588858465  0.392491325  0.076204545 -0.140379336 -0.112435149  0.002632365 
##           13           14 
## -0.129506736 -0.069955392

1g.

Check whether or not the residuals are normally distributed

plot(bro_lm, which = 2)

The QQ-plot does resemble that of a normal distribution with only slight deviations towards the end points. This would be another indicator of a linear relationship between variables and that a transformation of the variables is not needed for further analysis.

Examining Leverage

2a.

Determine average leverage:

leverage <- hatvalues(bro_lm)
m_lev <- mean(leverage)
m_lev_round <- round(m_lev, 4)
m_lev

## [1] 0.2142857

The mean of the leverages is equal to 0.2143

2b.

Identify data points that would be considered to have high leverage.

lev_df <- cbind(Brokerage, leverage)
lev_df$High_lev <- if_else(lev_df$leverage >= 2 * m_lev, 1,0)
pretty_tab <- kable(lev_df[lev_df$High_lev == 1, -c(6)], "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
pretty_tab

	Brokerage	Satisfaction_with_Trade_Price	Satisfaction_with_Speed_of_Execution	Overall_Satisfaction_with_Electronic_Trades	leverage
11	Interactive Brokers	3.7	3.9	4	0.4508002
14	Banc of America Investment Services	1.0	4.0	2	0.7462128

The data points that are considered to have high leverages are 11 & 14. The firm names for these rows, are Interactive Brokers and Banc of America Investment Services. The corresponding leverage for these points are .4508 and .7462, while the mean leverage is equivalent to 0.2142857.

2c.

Analyze Cook’s distance and determine if there are any significant outliers.

cooks_dist <- cooks.distance(bro_lm)
cooks_dist

##            1            2            3            4            5            6 
## 0.0079790547 0.0243407712 0.1518660850 0.0756708677 0.0059271795 0.0012425789 
##            7            8            9           10           11           12 
## 0.2471814307 0.0888808107 0.0062442356 0.0210623352 0.0533835032 0.0000111262 
##           13           14 
## 0.0062752299 0.1601935254

plot(bro_lm, which = 4)

Based upon the cooks distance values, there is not a point that would be considered significant enough to remove from the data set. There are some that are higher than others, but do not convey a large enough issue, examples of this are points 3, 7, and 14.

Predictions

3a through 3d.

Satisfaction_with_Speed_of_Execution = c(2, 3, 3, 2) 
Satisfaction_with_Trade_Price = c(4, 5, 4, 4)

pred_df <- as.data.frame(cbind(Satisfaction_with_Speed_of_Execution, Satisfaction_with_Trade_Price))

Pred_Satisfaction <- predict(bro_lm,pred_df)

Predicted_Values_DF <- cbind(pred_df, Pred_Satisfaction)

pretty_tab2 <- kable(Predicted_Values_DF, "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center", full_width = TRUE)
pretty_tab2

Satisfaction_with_Speed_of_Execution	Satisfaction_with_Trade_Price	Pred_Satisfaction
2	4	3.414448
3	5	4.678726
3	4	3.904117
2	4	3.414448

3a through 3d with Confint

predict(bro_lm,pred_df, interval = "confidence", level = .95, type = "response" )

##        fit      lwr      upr
## 1 3.414448 2.724739 4.104158
## 2 4.678726 3.891330 5.466123
## 3 3.904117 3.426445 4.381789
## 4 3.414448 2.724739 4.104158

Extrapolation

4a.

max_l <-max(leverage)

The maximum leverage for the model is 0.7462128

Testing Predicted Values

5a.

x_new  = c(1, 2, 4)

X = model.matrix(bro_lm)

t(x_new)%*%solve(t(X)%*%X)%*%x_new

##           [,1]
## [1,] 0.3236157

Because the calculated leverage is smaller than the max leverage of those within the data, this point is not considered extrapolation.

5b.

x_new2 = c(1, 3, 5)
t(x_new2)%*%solve(t(X)%*%X)%*%x_new2

##          [,1]
## [1,] 1.169531

Because the calculated leverage is greater than the max leverage of those within the data, this point is considered extrapolation.

5c.

x_new3 = c(1, 3, 4)
t(x_new3)%*%solve(t(X)%*%X)%*%x_new3

##           [,1]
## [1,] 0.2932324

Because the calculated leverage is smaller than the max leverage of those within the data, this point is not considered extrapolation.

5d.

x_new4 = c(1, 2, 4)
t(x_new4)%*%solve(t(X)%*%X)%*%x_new4

##           [,1]
## [1,] 0.3236157

Because the calculated leverage is smaller than the max leverage of those within the data, this point is not considered extrapolation.

Homework #2

Elyse Stasil

2023-09-20

Regression Analysis

1a.

1b.

1c.

1d.

1e.

1f.

1g.

Examining Leverage

2a.

2b.

2c.

Predictions

3a through 3d.

3a through 3d with Confint

Extrapolation

4a.

Testing Predicted Values

5a.

5b.

5c.

5d.