1. The excel file titled Brokerage Satisfaction was read into R below. Regression will be used to predict the Overall Satisfaction with Electronic Trades. Column(s) will be removed from the dataset that will likely not be useful in performing this task, and all remaining columns of the dataset will be used to perform a single regression to predict the overall satisfaction. Because the dataset is small, all rows of the dataset will be used as the training set (i.e., use all data rows to build the regression model)

The first library was used to read the excel document and the second library was used to aid in making the table for the dataset. The Brokerage column was removed as it is not useful to answer questions regarding this dataset. The head of the dataset (first few datapoints) with kable was performed to observe the columns left and verify the Brokerage column has been removed.

library(readxl)
library(knitr)
brokerage <- read_excel("G:/Other computers/My Laptop/Documents/Richard 621/DATA/Brokerage Satisfaction.xlsx")
brokerage$Brokerage=NULL
kable(head(brokerage))
Satisfaction_with_Trade_Price Satisfaction_with_Speed_of_Execution Overall_Satisfaction_with_Electronic_Trades
3.2 3.1 3.2
3.3 3.1 3.2
3.1 3.3 4.0
2.8 3.5 3.7
2.9 3.2 3.0
2.4 3.2 2.7

Below is a regression model developed including all data except the Brokerage column that was removed

brokerage_1 <- lm(Overall_Satisfaction_with_Electronic_Trades ~ ., data = brokerage)
summary(brokerage_1)
## 
## Call:
## lm(formula = Overall_Satisfaction_with_Electronic_Trades ~ ., 
##     data = brokerage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58886 -0.13863 -0.09120  0.05781  0.64613 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           -0.6633     0.8248  -0.804 0.438318    
## Satisfaction_with_Trade_Price          0.7746     0.1521   5.093 0.000348 ***
## Satisfaction_with_Speed_of_Execution   0.4897     0.2016   2.429 0.033469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157

A. Based on the P value in the regression model there is a relationship with overall and the satisfaction with speed of execution

B. The P value used to determine the relationship is 0.033469

C. Based on the P value in the regression model there is a relationship with overall and the satisfaction with trade price

D. The P value used to determine the relationship is 0.000348

E. The variation in the sample is explained by R^2 which is 73%

F. The residuals of the datapoints are calculated below

brokerage_1$residuals
##            1            2            3            4            5            6 
## -0.133395541 -0.210856526  0.646131773  0.480581057 -0.149979422 -0.062674498 
##            7            8            9           10           11           12 
## -0.588858465  0.392491325  0.076204545 -0.140379336 -0.112435149  0.002632365 
##           13           14 
## -0.129506736 -0.069955392

G. A QQ plot is generated below to observe the normality of the residuals. The plot shows that the datapoints are not all normally distributed because there are some datapoints that are far away from the plotted line.

qqnorm(brokerage_1$residuals, main= "brokerage_1")
qqline(brokerage_1$residuals)

2A. The leverage of data points is contained below

hatvalues(brokerage_1)
##          1          2          3          4          5          6          7 
## 0.12226809 0.14248379 0.10348052 0.09498002 0.07909281 0.09225511 0.17268548 
##          8          9         10         11         12         13         14 
## 0.14817354 0.22725402 0.22639250 0.45080022 0.28805237 0.10586879 0.74621276

The average leverage is calculated below

mean(hatvalues(brokerage_1))
## [1] 0.2142857

2B. If a data point’s leverage is more than twice the average, it is generally considered to be a high leverage point. Using the average found in part a, determine which datapoints are considered to have high leverage. Below is the calculation for double the average. Data points 11 and 14 have double the average as can bee seen above.

mean(hatvalues(brokerage_1))*2
## [1] 0.4285714

2C. Cook’s distance is calculated to determine if any of the leverage points are greater than 1, which means it may be influential and should be considered for removal from the data set or treated differently. Below Cook’s distance is calculated for each data point and plotted. None of the data points are above 1 and the fourth graph below supports that none of the data points should be removed.

cooks.distance(brokerage_1)
##            1            2            3            4            5            6 
## 0.0079790547 0.0243407712 0.1518660850 0.0756708677 0.0059271795 0.0012425789 
##            7            8            9           10           11           12 
## 0.2471814307 0.0888808107 0.0062442356 0.0210623352 0.0533835032 0.0000111262 
##           13           14 
## 0.0062752299 0.1601935254
plot(brokerage_1)

3A-D. Below are the predictions for overall satisfaction with electronic trades when speed of execution is 2,3,3,2 and trade price is 4,5,4,3, respectively. A data frame is shown first followed by the prediction with new values. Speed of execution is moved to the right in the vector since that is the way it is reflected in the orginal data.

Overall_Sat_predict=data.frame(Satisfaction_with_Trade_Price=c(4,5,4,3), Satisfaction_with_Speed_of_Execution=c(2,3,3,2))
Overall_Sat_predict
##   Satisfaction_with_Trade_Price Satisfaction_with_Speed_of_Execution
## 1                             4                                    2
## 2                             5                                    3
## 3                             4                                    3
## 4                             3                                    2
predict(brokerage_1,Overall_Sat_predict, type="response")
##        1        2        3        4 
## 3.414448 4.678726 3.904117 2.639838

4.
To determine if a prediction will involve extrapolation, the largest leverage of all datapoints needs to be determined. The maximum leverage is datapoint number 14 and its value can be observed below.

max(hatvalues(brokerage_1))
## [1] 0.7462128

5. If a new datapoint has a leverage higher than the maximum (0.74621276), then the prediction will involve extrapolation. The design matrix is calculated first and can be observed below. The design matrix is used in the equation to calculate leverages based on new data points.

X=model.matrix(brokerage_1)
kable(X)
(Intercept) Satisfaction_with_Trade_Price Satisfaction_with_Speed_of_Execution
1 3.2 3.1
1 3.3 3.1
1 3.1 3.3
1 2.8 3.5
1 2.9 3.2
1 2.4 3.2
1 2.7 3.8
1 2.4 3.7
1 2.6 2.6
1 2.3 2.7
1 3.7 3.9
1 2.5 2.5
1 3.0 3.0
1 1.0 4.0

Below are the leverages of the new data points from question 3

x_newa=c(1,4,2)
t(x_newa)%*%solve(t(X)%*%X)%*%x_newa
##           [,1]
## [1,] 0.8323367
x_newb=c(1,5,3)
t(x_newb)%*%solve(t(X)%*%X)%*%x_newb
##         [,1]
## [1,] 1.08481
x_newc=c(1,4,3)
t(x_newc)%*%solve(t(X)%*%X)%*%x_newc
##           [,1]
## [1,] 0.3992325
x_newd=c(1,3,2)
t(x_newd)%*%solve(t(X)%*%X)%*%x_newd
##           [,1]
## [1,] 0.6074396

New data points in 3A and 3B would be considered extrapolation since they are greater than the maximum leverage.