1. The excel file titled Brokerage Satisfaction was read into R below. Regression will be used to predict the Overall Satisfaction with Electronic Trades. Column(s) will be removed from the dataset that will likely not be useful in performing this task, and all remaining columns of the dataset will be used to perform a single regression to predict the overall satisfaction. Because the dataset is small, all rows of the dataset will be used as the training set (i.e., use all data rows to build the regression model)
The first library was used to read the excel document and the second library was used to aid in making the table for the dataset. The Brokerage column was removed as it is not useful to answer questions regarding this dataset. The head of the dataset (first few datapoints) with kable was performed to observe the columns left and verify the Brokerage column has been removed.
library(readxl)
library(knitr)
brokerage <- read_excel("G:/Other computers/My Laptop/Documents/Richard 621/DATA/Brokerage Satisfaction.xlsx")
brokerage$Brokerage=NULL
kable(head(brokerage))
| Satisfaction_with_Trade_Price | Satisfaction_with_Speed_of_Execution | Overall_Satisfaction_with_Electronic_Trades |
|---|---|---|
| 3.2 | 3.1 | 3.2 |
| 3.3 | 3.1 | 3.2 |
| 3.1 | 3.3 | 4.0 |
| 2.8 | 3.5 | 3.7 |
| 2.9 | 3.2 | 3.0 |
| 2.4 | 3.2 | 2.7 |
Below is a regression model developed including all data except the Brokerage column that was removed
brokerage_1 <- lm(Overall_Satisfaction_with_Electronic_Trades ~ ., data = brokerage)
summary(brokerage_1)
##
## Call:
## lm(formula = Overall_Satisfaction_with_Electronic_Trades ~ .,
## data = brokerage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## Satisfaction_with_Trade_Price 0.7746 0.1521 5.093 0.000348 ***
## Satisfaction_with_Speed_of_Execution 0.4897 0.2016 2.429 0.033469 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
A. Based on the P value in the regression model there is a relationship with overall and the satisfaction with speed of execution
B. The P value used to determine the relationship is 0.033469
C. Based on the P value in the regression model there is a relationship with overall and the satisfaction with trade price
D. The P value used to determine the relationship is 0.000348
E. The variation in the sample is explained by R^2 which is 73%
F. The residuals of the datapoints are calculated below
brokerage_1$residuals
## 1 2 3 4 5 6
## -0.133395541 -0.210856526 0.646131773 0.480581057 -0.149979422 -0.062674498
## 7 8 9 10 11 12
## -0.588858465 0.392491325 0.076204545 -0.140379336 -0.112435149 0.002632365
## 13 14
## -0.129506736 -0.069955392
G. A QQ plot is generated below to observe the normality of the residuals. The plot shows that the datapoints are not all normally distributed because there are some datapoints that are far away from the plotted line.
qqnorm(brokerage_1$residuals, main= "brokerage_1")
qqline(brokerage_1$residuals)
2A. The leverage of data points is contained below
hatvalues(brokerage_1)
## 1 2 3 4 5 6 7
## 0.12226809 0.14248379 0.10348052 0.09498002 0.07909281 0.09225511 0.17268548
## 8 9 10 11 12 13 14
## 0.14817354 0.22725402 0.22639250 0.45080022 0.28805237 0.10586879 0.74621276
The average leverage is calculated below
mean(hatvalues(brokerage_1))
## [1] 0.2142857
2B. If a data point’s leverage is more than twice the average, it is generally considered to be a high leverage point. Using the average found in part a, determine which datapoints are considered to have high leverage. Below is the calculation for double the average. Data points 11 and 14 have double the average as can bee seen above.
mean(hatvalues(brokerage_1))*2
## [1] 0.4285714
2C. Cook’s distance is calculated to determine if any of the leverage points are greater than 1, which means it may be influential and should be considered for removal from the data set or treated differently. Below Cook’s distance is calculated for each data point and plotted. None of the data points are above 1 and the fourth graph below supports that none of the data points should be removed.
cooks.distance(brokerage_1)
## 1 2 3 4 5 6
## 0.0079790547 0.0243407712 0.1518660850 0.0756708677 0.0059271795 0.0012425789
## 7 8 9 10 11 12
## 0.2471814307 0.0888808107 0.0062442356 0.0210623352 0.0533835032 0.0000111262
## 13 14
## 0.0062752299 0.1601935254
plot(brokerage_1)
3A-D. Below are the predictions for overall satisfaction with electronic trades when speed of execution is 2,3,3,2 and trade price is 4,5,4,3, respectively. A data frame is shown first followed by the prediction with new values. Speed of execution is moved to the right in the vector since that is the way it is reflected in the orginal data.
Overall_Sat_predict=data.frame(Satisfaction_with_Trade_Price=c(4,5,4,3), Satisfaction_with_Speed_of_Execution=c(2,3,3,2))
Overall_Sat_predict
## Satisfaction_with_Trade_Price Satisfaction_with_Speed_of_Execution
## 1 4 2
## 2 5 3
## 3 4 3
## 4 3 2
predict(brokerage_1,Overall_Sat_predict, type="response")
## 1 2 3 4
## 3.414448 4.678726 3.904117 2.639838
4.
To determine if a prediction will involve extrapolation, the largest
leverage of all datapoints needs to be determined. The maximum leverage
is datapoint number 14 and its value can be observed below.
max(hatvalues(brokerage_1))
## [1] 0.7462128
5. If a new datapoint has a leverage higher than the maximum (0.74621276), then the prediction will involve extrapolation. The design matrix is calculated first and can be observed below. The design matrix is used in the equation to calculate leverages based on new data points.
X=model.matrix(brokerage_1)
kable(X)
| (Intercept) | Satisfaction_with_Trade_Price | Satisfaction_with_Speed_of_Execution |
|---|---|---|
| 1 | 3.2 | 3.1 |
| 1 | 3.3 | 3.1 |
| 1 | 3.1 | 3.3 |
| 1 | 2.8 | 3.5 |
| 1 | 2.9 | 3.2 |
| 1 | 2.4 | 3.2 |
| 1 | 2.7 | 3.8 |
| 1 | 2.4 | 3.7 |
| 1 | 2.6 | 2.6 |
| 1 | 2.3 | 2.7 |
| 1 | 3.7 | 3.9 |
| 1 | 2.5 | 2.5 |
| 1 | 3.0 | 3.0 |
| 1 | 1.0 | 4.0 |
Below are the leverages of the new data points from question 3
x_newa=c(1,4,2)
t(x_newa)%*%solve(t(X)%*%X)%*%x_newa
## [,1]
## [1,] 0.8323367
x_newb=c(1,5,3)
t(x_newb)%*%solve(t(X)%*%X)%*%x_newb
## [,1]
## [1,] 1.08481
x_newc=c(1,4,3)
t(x_newc)%*%solve(t(X)%*%X)%*%x_newc
## [,1]
## [1,] 0.3992325
x_newd=c(1,3,2)
t(x_newd)%*%solve(t(X)%*%X)%*%x_newd
## [,1]
## [1,] 0.6074396
New data points in 3A and 3B would be considered extrapolation since they are greater than the maximum leverage.