Use the Brokerage Satisfaction excel file to answer the following questions in R. Create an R Markdown file to answer the questions, and then “knit” your file to create an HTML document. Your HTML document should contain both textual explanations of your answers, as well as all R code needed to support your work. Submit both your HTML document and original R Markdown file (i.e., your Rmd file).
#import the read excel library
library(readxl)
Brokerage <- read_excel("/Users/kamriefoster/Downloads/BrokerageSatisfaction.xlsx")
model <- lm(ElectronicTrades ~ ., data = Brokerage)
summary(model)
##
## Call:
## lm(formula = ElectronicTrades ~ ., data = Brokerage)
##
## Residuals:
## ALL 14 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 2.700e+00 NaN NaN
## ` Brokerage`Banc of America Investment Services -7.000e-01 NaN NaN
## ` Brokerage`Charles Schwab 5.000e-01 NaN NaN
## ` Brokerage`E*Trade Financial 3.000e-01 NaN NaN
## ` Brokerage`Fidelity Brokerage Services 1.300e+00 NaN NaN
## ` Brokerage`Firstrade Securities 3.000e-01 NaN NaN
## ` Brokerage`Interactive Brokers 1.300e+00 NaN NaN
## ` Brokerage`Scottrade, Inc. 5.000e-01 NaN NaN
## ` Brokerage`TD Ameritrade 1.000e+00 NaN NaN
## ` Brokerage`Thinkorswim -2.736e-16 NaN NaN
## ` Brokerage`USAA Brokerage Services 7.000e-01 NaN NaN
## ` Brokerage`Vanguard Brokerage Services -3.686e-16 NaN NaN
## ` Brokerage`Wells Fargo Investments -4.000e-01 NaN NaN
## ` Brokerage`Zecco.com -2.000e-01 NaN NaN
## TradePrice NA NA NA
## SpeedofExecution NA NA NA
## Pr(>|t|)
## (Intercept) NaN
## ` Brokerage`Banc of America Investment Services NaN
## ` Brokerage`Charles Schwab NaN
## ` Brokerage`E*Trade Financial NaN
## ` Brokerage`Fidelity Brokerage Services NaN
## ` Brokerage`Firstrade Securities NaN
## ` Brokerage`Interactive Brokers NaN
## ` Brokerage`Scottrade, Inc. NaN
## ` Brokerage`TD Ameritrade NaN
## ` Brokerage`Thinkorswim NaN
## ` Brokerage`USAA Brokerage Services NaN
## ` Brokerage`Vanguard Brokerage Services NaN
## ` Brokerage`Wells Fargo Investments NaN
## ` Brokerage`Zecco.com NaN
## TradePrice NA
## SpeedofExecution NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 13 and 0 DF, p-value: NA
#linear regression cannot use a column with characters
model1 <- lm(ElectronicTrades ~ SpeedofExecution + TradePrice, data = Brokerage)
summary(model1)
##
## Call:
## lm(formula = ElectronicTrades ~ SpeedofExecution + TradePrice,
## data = Brokerage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## SpeedofExecution 0.4897 0.2016 2.429 0.033469 *
## TradePrice 0.7746 0.1521 5.093 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
-Based on your resulting p-values, is there likely a relationship between the overall satisfaction and the Satisfaction_with_Speed_of_Execution?
Based on my resulting p-values there is likely a relationship between the overall satisfaction and the Satisfaction_with_Speed_of_Execution.
-What p-value did you use to answer part a?
The p-value used is 0.033469 (which is under the 0.05 value needed for significance).
-Is there likely a relationship between the overall satisfaction and the Satisfaction_with_Trade_Price?
Based on my resulting p-values there is likely a relationship between the overall satisfaction and the the Satsifaction_with_Trade_Price.
-What p-value did you use to answer part c?
The p-value used is 0.000348 (which is under the 0.05 value needed for significance).
-How much of the variation in the sample values of Overall_Satisfaction_with_Electronic_Trades does the estimated regression model explain?
0.6757 of the variation in the sample values is estimated by the regression model (using the adjusted r-squared value since there are more than more than one independent variable).
-Calculate the residuals of the datapoints in the Brokerage Satisfaction excel file.
model1$residuals
## 1 2 3 4 5 6
## -0.133395541 -0.210856526 0.646131773 0.480581057 -0.149979422 -0.062674498
## 7 8 9 10 11 12
## -0.588858465 0.392491325 0.076204545 -0.140379336 -0.112435149 0.002632365
## 13 14
## -0.129506736 -0.069955392
-Check the normality of the residuals by creating a QQ-plot. Based on the plot, are the residuals normally distributed?
qqnorm(model1$residuals, main = "model1")
qqline(model1$residuals)
The QQ-plot shows that the residuals are not normally distributed. The distribution appears to be “heavy-tailed.”
-Determine the average leverage of the datapoints in the Brokerage Satisfaction excel file
leverage = hatvalues(model1)
leverage
## 1 2 3 4 5 6 7
## 0.12226809 0.14248379 0.10348052 0.09498002 0.07909281 0.09225511 0.17268548
## 8 9 10 11 12 13 14
## 0.14817354 0.22725402 0.22639250 0.45080022 0.28805237 0.10586879 0.74621276
mean(hatvalues(model1))
## [1] 0.2142857
-If a datapoint’s leverage is more than twice the average, it is generally considered to be a high leverage point. Using the average that you found in part a, determine which datapoints are considered to have high leverage.
2*mean(hatvalues(model1))
## [1] 0.4285714
There are two datapoints that are considered to be high leverage points. The two datapoints are number 11 (value of 0.45080022) and 14 (0.74621276).
-If a datapoint with unusually high leverage changes our regression or other predictive model greatly, we may possibly consider re-building the model without that abnormal datapoint. To determine the extent to which a datapoint impacts our model, we need to calculate the influence of the datapoints on our model. One measure for this influence is Cook’s Distance. If Cook’s Distance is greater than 1, the point is influential enough that we may want to consider either removing it from the model, or at the very least, treating it differently than the other data. Based on Cook’s Distance, are there outliers that may possibly need to be removed and/or treated differently in the dataset? Create a plot using R to support your answer.
cooks.distance(model1)
## 1 2 3 4 5 6
## 0.0079790547 0.0243407712 0.1518660850 0.0756708677 0.0059271795 0.0012425789
## 7 8 9 10 11 12
## 0.2471814307 0.0888808107 0.0062442356 0.0210623352 0.0533835032 0.0000111262
## 13 14
## 0.0062752299 0.1601935254
plot(model1)
There are no datapoints that have values over 1, thus there appear to be no outliers that should be removed from the dataset (due to that data points influence on the model). The plot (the last one of the 4) also supports this conclusion from the data that was calculated in R.
-Satisfaction_with_Speed_of_Execution = 2, Satisfaction_with_Trade_Price = 4 3.414448
-Satisfaction_with_Speed_of_Execution = 3, Satisfaction_with_Trade_Price = 5 4.678726
-Satisfaction_with_Speed_of_Execution = 3, Satisfaction_with_Trade_Price = 4 3.904117
-Satisfaction_with_Speed_of_Execution = 2, Satisfaction_with_Trade_Price = 3 2.639838
observations_for_pred = data.frame(SpeedofExecution = c(2, 3, 3, 2), TradePrice = c(4, 5, 4, 3))
observations_for_pred
## SpeedofExecution TradePrice
## 1 2 4
## 2 3 5
## 3 3 4
## 4 2 3
predict(model1, observations_for_pred, type = "response")
## 1 2 3 4
## 3.414448 4.678726 3.904117 2.639838
max(hatvalues(model1))
## [1] 0.7462128
The maximum leverage of the datapoints in the Brokerage Satisfaction excel file is 0.7462128.
X = model.matrix(model1)
X
## (Intercept) SpeedofExecution TradePrice
## 1 1 3.1 3.2
## 2 1 3.1 3.3
## 3 1 3.3 3.1
## 4 1 3.5 2.8
## 5 1 3.2 2.9
## 6 1 3.2 2.4
## 7 1 3.8 2.7
## 8 1 3.7 2.4
## 9 1 2.6 2.6
## 10 1 2.7 2.3
## 11 1 3.9 3.7
## 12 1 2.5 2.5
## 13 1 3.0 3.0
## 14 1 4.0 1.0
## attr(,"assign")
## [1] 0 1 2
x1_new = c(1,2,4)
t(x1_new)%*%solve(t(X)%*%X)%*%x1_new
## [,1]
## [1,] 0.8323367
x2_new = c(1,3,5)
t(x2_new)%*%solve(t(X)%*%X)%*%x2_new
## [,1]
## [1,] 1.08481
x3_new = c(1,3,4)
t(x3_new)%*%solve(t(X)%*%X)%*%x3_new
## [,1]
## [1,] 0.3992325
x4_new = c(1,2,3)
t(x4_new)%*%solve(t(X)%*%X)%*%x4_new
## [,1]
## [1,] 0.6074396
-Was the prediction made in part a of #3 considered to be extrapolation? (Note that predictions made with extrapolation are not extremely trustworthy.)
Yes the prediction made in part a of #3 is considered to be extrapolation. The value calculated is higher than the value calculated for the max leverage (0.8323367 > 0.7462128)
-Was the prediction made in part b of #3 considered to be extrapolation?
Yes the prediction made in part b of #3 is considered to be extrapolation. The value calculated is higher than the value calculated for the max leverage (1.08481 > 0.7462128)
-Was the prediction made in part c of #3 considered to be extrapolation?
No the prediction made in part c of #3 is not considered to be extrapolation. The value calculated is lower than the value calculated for the max leverage (0.3992325 < 0.7462128)
-Was the prediction made in part d of #3 considered to be extrapolation?
No the prediction made in part d of #3 is not considered to be extrapolation. The value calculated is lower than the value calculated for the max leverage (0.6074396 < 0.7462128)