Regression Analysis
Use the Brokerage Satisfaction data set to predict the Overall_Satisfaction_with_Electronic_Trades.
bro_lm <- lm(Overall_Satisfaction_with_Electronic_Trades ~ Satisfaction_with_Trade_Price + Satisfaction_with_Speed_of_Execution , data = Brokerage)
summary(bro_lm)##
## Call:
## lm(formula = Overall_Satisfaction_with_Electronic_Trades ~ Satisfaction_with_Trade_Price +
## Satisfaction_with_Speed_of_Execution, data = Brokerage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## Satisfaction_with_Trade_Price 0.7746 0.1521 5.093 0.000348 ***
## Satisfaction_with_Speed_of_Execution 0.4897 0.2016 2.429 0.033469 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
A decision was made to remove the firm from the data set. There are 14 different firms and 14 rows of data. There is not enough information to determine whether or not the firm has an impact on the overall satisfaction of trade prices.
1a.
Based upon p-values alone there is statistically significant relationship between the overall satisfaction and the Speed of execution.
1b.
To determine the significance of Speed of execution I utilized the p-value that corresponds to the t-statistic. This p-value is equal to .033469. Additionally, I made the assumption of that \(\alpha\) = .05. A p-value less than that would be the first indicator of significance.
1c.
Similarly there is likely a relationship between trade price and overall execution based upon the p-value.
1d.
Similar assumptions were made when determining variable significance. The p-value corresponding to the t-statistic is equal to .000348.
1e.
The amount of variance that is explained by the model is exemplified by the multiple R-squared value. Meaning, that 72.56% of the variance is explained. Considering the size of the data itself, this model is performing exceptionally well and this would be another indicator of a strong linear relationship between the dependent and independent variables.
1f.
Calculate the residual values.
## 1 2 3 4 5 6
## -0.133395541 -0.210856526 0.646131773 0.480581057 -0.149979422 -0.062674498
## 7 8 9 10 11 12
## -0.588858465 0.392491325 0.076204545 -0.140379336 -0.112435149 0.002632365
## 13 14
## -0.129506736 -0.069955392
1g.
Check whether or not the residuals are normally distributed
The QQ-plot does resemble that of a normal distribution with only slight deviations towards the end points. This would be another indicator of a linear relationship between variables and that a transformation of the variables is not needed for further analysis.
Examining Leverage
2a.
Determine average leverage:
## [1] 0.2142857
The mean of the leverages is equal to 0.2143
2b.
Identify data points that would be considered to have high leverage.
lev_df <- cbind(Brokerage, leverage)
lev_df$High_lev <- if_else(lev_df$leverage >= 2 * m_lev, 1,0)
pretty_tab <- kable(lev_df[lev_df$High_lev == 1, -c(6)], "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
pretty_tab| Brokerage | Satisfaction_with_Trade_Price | Satisfaction_with_Speed_of_Execution | Overall_Satisfaction_with_Electronic_Trades | leverage | |
|---|---|---|---|---|---|
| 11 | Interactive Brokers | 3.7 | 3.9 | 4 | 0.4508002 |
| 14 | Banc of America Investment Services | 1.0 | 4.0 | 2 | 0.7462128 |
The data points that are considered to have high leverages are 11 & 14. The firm names for these rows, are Interactive Brokers and Banc of America Investment Services. The corresponding leverage for these points are .4508 and .7462, while the mean leverage is equivalent to 0.2142857.
2c.
Analyze Cook’s distance and determine if there are any significant outliers.
## 1 2 3 4 5 6
## 0.0079790547 0.0243407712 0.1518660850 0.0756708677 0.0059271795 0.0012425789
## 7 8 9 10 11 12
## 0.2471814307 0.0888808107 0.0062442356 0.0210623352 0.0533835032 0.0000111262
## 13 14
## 0.0062752299 0.1601935254
Based upon the cooks distance values, there is not a point that would be
considered significant enough to remove from the data set. There are
some that are higher than others, but do not convey a large enough
issue, examples of this are points 3, 7, and 14.
Predictions
3a through 3d.
Satisfaction_with_Speed_of_Execution = c(2, 3, 3, 2)
Satisfaction_with_Trade_Price = c(4, 5, 4, 4)
pred_df <- as.data.frame(cbind(Satisfaction_with_Speed_of_Execution, Satisfaction_with_Trade_Price))
Pred_Satisfaction <- predict(bro_lm,pred_df)
Predicted_Values_DF <- cbind(pred_df, Pred_Satisfaction)
pretty_tab2 <- kable(Predicted_Values_DF, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center", full_width = TRUE)
pretty_tab2| Satisfaction_with_Speed_of_Execution | Satisfaction_with_Trade_Price | Pred_Satisfaction |
|---|---|---|
| 2 | 4 | 3.414448 |
| 3 | 5 | 4.678726 |
| 3 | 4 | 3.904117 |
| 2 | 4 | 3.414448 |
Testing Predicted Values
5a.
## [,1]
## [1,] 0.3236157
Because the calculated leverage is smaller than the max leverage of those within the data, this point is not considered extrapolation.
5b.
## [,1]
## [1,] 1.169531
Because the calculated leverage is greater than the max leverage of those within the data, this point is considered extrapolation.
5c.
## [,1]
## [1,] 0.2932324
Because the calculated leverage is smaller than the max leverage of those within the data, this point is not considered extrapolation.