library(readxl)
broker <- read_xlsx("C:/Users/justt/Desktop/School/621/Assignment/Homework 2/Brokerage Satisfaction.xlsx")
model1 <- lm(Overall_Satisfaction_with_Electronic_Trades ~  Satisfaction_with_Trade_Price + Satisfaction_with_Speed_of_Execution, data = broker)
summary(model1)
## 
## Call:
## lm(formula = Overall_Satisfaction_with_Electronic_Trades ~ Satisfaction_with_Trade_Price + 
##     Satisfaction_with_Speed_of_Execution, data = broker)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58886 -0.13863 -0.09120  0.05781  0.64613 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           -0.6633     0.8248  -0.804 0.438318    
## Satisfaction_with_Trade_Price          0.7746     0.1521   5.093 0.000348 ***
## Satisfaction_with_Speed_of_Execution   0.4897     0.2016   2.429 0.033469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157
  1. Read the excel file titled Brokerage Satisfaction into R. You will use regression to predict the Overall_Satisfaction_with_Electronic_Trades. Remove any column(s) from the dataset that will likely not be useful in performing this task, and use all remaining columns of the dataset to perform a single regression to predict the overall satisfaction. Because the dataset is small, use all rows of the dataset as your training set (i.e., use all data rows to build the regression model).
  1. Based on your resulting p-values, is there likely a relationship between the overall satisfaction and the Satisfaction_with_Speed_of_Execution?

Yes, due to the p-value being less than the .05 threshold, there is likely a statistical significance in a relationship between Overall_Satisfaction_with_Electronic_Trades and the Satisfaction_with_Speed_of_Execution.

  1. What p-value did you use to answer part a?

For part a, I used the p-value of 0.033469.

  1. Is there likely a relationship between the overall satisfaction and the Satisfaction_with_Trade_Price?

Yes, due to the p-value being less than the .05 threshold, there is likely a statistical significance in a relationship between Overall_Satisfaction_with_Electronic_Trades and the Satisfaction_with_Trade_Price. With this p-value being much smaller than for the other variable, the relationship might be greater for Satisfaction_with_Trade_Price.

  1. What p-value did you use to answer part c?

For part c, I used the p-value of 0.000348.

  1. How much of the variation in the sample values of Overall_Satisfaction_with_Electronic_Trades does the estimated regression model explain?

This current variation model is 72.56% (Multiple R-squared) accurate.

  1. Calculate the residuals of the datapoints in the Brokerage Satisfaction excel file.

The calculated residuals are:

model1$residuals
##            1            2            3            4            5            6 
## -0.133395541 -0.210856526  0.646131773  0.480581057 -0.149979422 -0.062674498 
##            7            8            9           10           11           12 
## -0.588858465  0.392491325  0.076204545 -0.140379336 -0.112435149  0.002632365 
##           13           14 
## -0.129506736 -0.069955392

The plots of the residuals, by variable, are:

plot(broker$Satisfaction_with_Trade_Price, model1$residuals, pch = 20)
abline(h = 0, col = "grey")

plot(broker$Satisfaction_with_Speed_of_Execution, model1$residuals, pch = 20)
abline(h = 0, col = "grey")

  1. Check the normality of the residuals by creating a QQ-plot. Based on the plot, are the residuals normally distributed?
qqnorm(model1$residuals, main = "model1")
qqline(model1$residuals)

  1. Examine the leverage of the datapoints by doing the following:
  1. Determine the average leverage of the datapoints in the Brokerage Satisfaction excel file.
mean(hatvalues(model1))
## [1] 0.2142857

The average of the leverages of the data points is 0.2142857.

  1. If a datapoint’s leverage is more than twice the average, it is generally considered to be a high leverage point. Using the average that you found in part a, determine which datapoints are considered to have high leverage.

twice the average

mean(hatvalues(model1))*2
## [1] 0.4285714

show the leverages of the data set

cbind(broker, leverage = hatvalues(model1))
##                              Brokerage Satisfaction_with_Trade_Price
## 1                      Scottrade, Inc.                           3.2
## 2                       Charles Schwab                           3.3
## 3          Fidelity Brokerage Services                           3.1
## 4                        TD Ameritrade                           2.8
## 5                    E*Trade Financial                           2.9
## 6                         (Not listed)                           2.4
## 7          Vanguard Brokerage Services                           2.7
## 8              USAA Brokerage Services                           2.4
## 9                          Thinkorswim                           2.6
## 10             Wells Fargo Investments                           2.3
## 11                 Interactive Brokers                           3.7
## 12                           Zecco.com                           2.5
## 13                Firstrade Securities                           3.0
## 14 Banc of America Investment Services                           1.0
##    Satisfaction_with_Speed_of_Execution
## 1                                   3.1
## 2                                   3.1
## 3                                   3.3
## 4                                   3.5
## 5                                   3.2
## 6                                   3.2
## 7                                   3.8
## 8                                   3.7
## 9                                   2.6
## 10                                  2.7
## 11                                  3.9
## 12                                  2.5
## 13                                  3.0
## 14                                  4.0
##    Overall_Satisfaction_with_Electronic_Trades   leverage
## 1                                          3.2 0.12226809
## 2                                          3.2 0.14248379
## 3                                          4.0 0.10348052
## 4                                          3.7 0.09498002
## 5                                          3.0 0.07909281
## 6                                          2.7 0.09225511
## 7                                          2.7 0.17268548
## 8                                          3.4 0.14817354
## 9                                          2.7 0.22725402
## 10                                         2.3 0.22639250
## 11                                         4.0 0.45080022
## 12                                         2.5 0.28805237
## 13                                         3.0 0.10586879
## 14                                         2.0 0.74621276

Data point of 11 for Interactive Brokers and point 14 for Banc of America Investment Services have high leverages where the leverage is higher than the double of the average of 0.4285714.

  1. If a datapoint with unusually high leverage changes our regression or other predictive model greatly, we may possibly consider re-building the model without that abnormal datapoint. To determine the extent to which a datapoint impacts our model, we need to calculate the influence of the datapoints on our model. One measure for this influence is Cook’s Distance. If Cook’s Distance is greater than 1, the point is influential enough that we may want to consider either removing it from the model, or at the very least, treating it differently than the other data. Based on Cook’s Distance, are there outliers that may possibly need to be removed and/or treated differently in the dataset? Create a plot using R to support your answer.
cooks.distance(model1)
##            1            2            3            4            5            6 
## 0.0079790547 0.0243407712 0.1518660850 0.0756708677 0.0059271795 0.0012425789 
##            7            8            9           10           11           12 
## 0.2471814307 0.0888808107 0.0062442356 0.0210623352 0.0533835032 0.0000111262 
##           13           14 
## 0.0062752299 0.1601935254
plot(model1)

There are none of the Cook’s Distances that are larger than 1. The 4th chart in the plot confirms that there are no points that are on or outside 1.

  1. Use the regression model that you created in #1 to predict the Overall_Satisfaction_with_Electronic_Trades in each of the following scenarios:
  1. Satisfaction_with_Speed_of_Execution = 2, Satisfaction_with_Trade_Price = 4
X= model.matrix(model1)
X
##    (Intercept) Satisfaction_with_Trade_Price
## 1            1                           3.2
## 2            1                           3.3
## 3            1                           3.1
## 4            1                           2.8
## 5            1                           2.9
## 6            1                           2.4
## 7            1                           2.7
## 8            1                           2.4
## 9            1                           2.6
## 10           1                           2.3
## 11           1                           3.7
## 12           1                           2.5
## 13           1                           3.0
## 14           1                           1.0
##    Satisfaction_with_Speed_of_Execution
## 1                                   3.1
## 2                                   3.1
## 3                                   3.3
## 4                                   3.5
## 5                                   3.2
## 6                                   3.2
## 7                                   3.8
## 8                                   3.7
## 9                                   2.6
## 10                                  2.7
## 11                                  3.9
## 12                                  2.5
## 13                                  3.0
## 14                                  4.0
## attr(,"assign")
## [1] 0 1 2
x_new = c(1, 4, 2)
t(x_new)%*%solve(t(X)%*%X)%*%x_new
##           [,1]
## [1,] 0.8323367

The leverage of this new point is predicted to be 0.8323367.

  1. Satisfaction_with_Speed_of_Execution = 3, Satisfaction_with_Trade_Price = 5
X= model.matrix(model1)
X
##    (Intercept) Satisfaction_with_Trade_Price
## 1            1                           3.2
## 2            1                           3.3
## 3            1                           3.1
## 4            1                           2.8
## 5            1                           2.9
## 6            1                           2.4
## 7            1                           2.7
## 8            1                           2.4
## 9            1                           2.6
## 10           1                           2.3
## 11           1                           3.7
## 12           1                           2.5
## 13           1                           3.0
## 14           1                           1.0
##    Satisfaction_with_Speed_of_Execution
## 1                                   3.1
## 2                                   3.1
## 3                                   3.3
## 4                                   3.5
## 5                                   3.2
## 6                                   3.2
## 7                                   3.8
## 8                                   3.7
## 9                                   2.6
## 10                                  2.7
## 11                                  3.9
## 12                                  2.5
## 13                                  3.0
## 14                                  4.0
## attr(,"assign")
## [1] 0 1 2
x_newb = c(1, 5, 3)
t(x_newb)%*%solve(t(X)%*%X)%*%x_newb
##         [,1]
## [1,] 1.08481

The leverage of this new point is predicted to be 1.08481.

  1. Satisfaction_with_Speed_of_Execution = 3, Satisfaction_with_Trade_Price = 4
X= model.matrix(model1)
X
##    (Intercept) Satisfaction_with_Trade_Price
## 1            1                           3.2
## 2            1                           3.3
## 3            1                           3.1
## 4            1                           2.8
## 5            1                           2.9
## 6            1                           2.4
## 7            1                           2.7
## 8            1                           2.4
## 9            1                           2.6
## 10           1                           2.3
## 11           1                           3.7
## 12           1                           2.5
## 13           1                           3.0
## 14           1                           1.0
##    Satisfaction_with_Speed_of_Execution
## 1                                   3.1
## 2                                   3.1
## 3                                   3.3
## 4                                   3.5
## 5                                   3.2
## 6                                   3.2
## 7                                   3.8
## 8                                   3.7
## 9                                   2.6
## 10                                  2.7
## 11                                  3.9
## 12                                  2.5
## 13                                  3.0
## 14                                  4.0
## attr(,"assign")
## [1] 0 1 2
x_newc = c(1, 4, 3)
t(x_newc)%*%solve(t(X)%*%X)%*%x_newc
##           [,1]
## [1,] 0.3992325

The leverage of this new point is predicted to be 0.3992325.

  1. Satisfaction_with_Speed_of_Execution = 2, Satisfaction_with_Trade_Price = 3
X= model.matrix(model1)
X
##    (Intercept) Satisfaction_with_Trade_Price
## 1            1                           3.2
## 2            1                           3.3
## 3            1                           3.1
## 4            1                           2.8
## 5            1                           2.9
## 6            1                           2.4
## 7            1                           2.7
## 8            1                           2.4
## 9            1                           2.6
## 10           1                           2.3
## 11           1                           3.7
## 12           1                           2.5
## 13           1                           3.0
## 14           1                           1.0
##    Satisfaction_with_Speed_of_Execution
## 1                                   3.1
## 2                                   3.1
## 3                                   3.3
## 4                                   3.5
## 5                                   3.2
## 6                                   3.2
## 7                                   3.8
## 8                                   3.7
## 9                                   2.6
## 10                                  2.7
## 11                                  3.9
## 12                                  2.5
## 13                                  3.0
## 14                                  4.0
## attr(,"assign")
## [1] 0 1 2
x_newd = c(1, 3, 2)
t(x_newd)%*%solve(t(X)%*%X)%*%x_newd
##           [,1]
## [1,] 0.6074396

The leverage of this new point is predicted to be 0.6074396.

  1. Extrapolation occurs when a prediction is made outside of the region of data used to train the model. This is “risky” because we do not have any prior data to see what tends to happen in the particular situation that we are trying to predict. To determine if a prediction will involve extrapolation, we need to first determine the largest leverage of all datapoints. What is the maximum leverage of the datapoints in the Brokerage Satisfaction excel file?
max(hatvalues(model1))
## [1] 0.7462128

The maximum leverage of the datapoints in the Brokerage Satisfaction excel file is 0.7462128.

  1. If a new datapoint has a leverage higher than the maximum (calculated in #4), then the prediction will involve extrapolation.
  1. Was the prediction made in part a of #3 considered to be extrapolation? (Note that predictions made with extrapolation are not extremely trustworthy.)

The max leverage of the original data set is 0.7462128. The leverage of this new data point is 0.8323367. This is larger than the max leverage of the original data set so this is an extrapolation.

  1. Was the prediction made in part b of #3 considered to be extrapolation?

The max leverage of the original data set is 0.7462128. The leverage of this new data point is 1.08481. This is larger than the max leverage of the original data set so this is an extrapolation.

  1. Was the prediction made in part c of #3 considered to be extrapolation?

The max leverage of the original data set is 0.7462128. The leverage of this new data point is 0.3992325. This is smaller than the max leverage of the original data set so this is not an extrapolation.

  1. Was the prediction made in part d of #3 considered to be extrapolation

The max leverage of the original data set is 0.7462128. The leverage of this new data point is 0.6074396. This is smaller than the max leverage of the original data set so this is not an extrapolation.