Week 2 Homework

Use the Brokerage Satisfaction excel file to answer the following questions in R. Create an R Markdown file to answer the questions, and then “knit” your file to create an HTML document. Your HTML document should contain both textual explanations of your answers, as well as all R code needed to support your work. Submit both your HTML document and original R Markdown file (i.e., your Rmd file).

Read the excel file titled Brokerage Satisfaction into R. You will use regression to predict the Overall_Satisfaction_with_Electronic_Trades. Remove any column(s) from the dataset that will likely not be useful in performing this task, and use all remaining columns of the dataset to perform a single regression to predict the overall satisfaction. Because the dataset is small, use all rows of the dataset as your training set (i.e., use all data rows to build the regression model).

#import the read excel library
library(readxl)

Brokerage <- read_excel("/Users/kamriefoster/Downloads/BrokerageSatisfaction.xlsx")

model <- lm(ElectronicTrades ~ ., data = Brokerage)
summary(model)

## 
## Call:
## lm(formula = ElectronicTrades ~ ., data = Brokerage)
## 
## Residuals:
## ALL 14 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (2 not defined because of singularities)
##                                                    Estimate Std. Error t value
## (Intercept)                                       2.700e+00        NaN     NaN
## `  Brokerage`Banc of America Investment Services -7.000e-01        NaN     NaN
## `  Brokerage`Charles Schwab                       5.000e-01        NaN     NaN
## `  Brokerage`E*Trade Financial                    3.000e-01        NaN     NaN
## `  Brokerage`Fidelity Brokerage Services          1.300e+00        NaN     NaN
## `  Brokerage`Firstrade Securities                 3.000e-01        NaN     NaN
## `  Brokerage`Interactive Brokers                  1.300e+00        NaN     NaN
## `  Brokerage`Scottrade, Inc.                      5.000e-01        NaN     NaN
## `  Brokerage`TD Ameritrade                        1.000e+00        NaN     NaN
## `  Brokerage`Thinkorswim                         -2.736e-16        NaN     NaN
## `  Brokerage`USAA Brokerage Services              7.000e-01        NaN     NaN
## `  Brokerage`Vanguard Brokerage Services         -3.686e-16        NaN     NaN
## `  Brokerage`Wells Fargo Investments             -4.000e-01        NaN     NaN
## `  Brokerage`Zecco.com                           -2.000e-01        NaN     NaN
## TradePrice                                               NA         NA      NA
## SpeedofExecution                                         NA         NA      NA
##                                                  Pr(>|t|)
## (Intercept)                                           NaN
## `  Brokerage`Banc of America Investment Services      NaN
## `  Brokerage`Charles Schwab                           NaN
## `  Brokerage`E*Trade Financial                        NaN
## `  Brokerage`Fidelity Brokerage Services              NaN
## `  Brokerage`Firstrade Securities                     NaN
## `  Brokerage`Interactive Brokers                      NaN
## `  Brokerage`Scottrade, Inc.                          NaN
## `  Brokerage`TD Ameritrade                            NaN
## `  Brokerage`Thinkorswim                              NaN
## `  Brokerage`USAA Brokerage Services                  NaN
## `  Brokerage`Vanguard Brokerage Services              NaN
## `  Brokerage`Wells Fargo Investments                  NaN
## `  Brokerage`Zecco.com                                NaN
## TradePrice                                             NA
## SpeedofExecution                                       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 13 and 0 DF,  p-value: NA

#linear regression cannot use a column with characters
model1 <- lm(ElectronicTrades ~ SpeedofExecution + TradePrice, data = Brokerage)
summary(model1)

## 
## Call:
## lm(formula = ElectronicTrades ~ SpeedofExecution + TradePrice, 
##     data = Brokerage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58886 -0.13863 -0.09120  0.05781  0.64613 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.6633     0.8248  -0.804 0.438318    
## SpeedofExecution   0.4897     0.2016   2.429 0.033469 *  
## TradePrice         0.7746     0.1521   5.093 0.000348 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157

-Based on your resulting p-values, is there likely a relationship between the overall satisfaction and the Satisfaction_with_Speed_of_Execution?

Based on my resulting p-values there is likely a relationship between the overall satisfaction and the Satisfaction_with_Speed_of_Execution.

-What p-value did you use to answer part a?

The p-value used is 0.033469 (which is under the 0.05 value needed for significance).

-Is there likely a relationship between the overall satisfaction and the Satisfaction_with_Trade_Price?

Based on my resulting p-values there is likely a relationship between the overall satisfaction and the the Satsifaction_with_Trade_Price.

-What p-value did you use to answer part c?

The p-value used is 0.000348 (which is under the 0.05 value needed for significance).

-How much of the variation in the sample values of Overall_Satisfaction_with_Electronic_Trades does the estimated regression model explain?

0.6757 of the variation in the sample values is estimated by the regression model (using the adjusted r-squared value since there are more than more than one independent variable).

-Calculate the residuals of the datapoints in the Brokerage Satisfaction excel file.

model1$residuals

##            1            2            3            4            5            6 
## -0.133395541 -0.210856526  0.646131773  0.480581057 -0.149979422 -0.062674498 
##            7            8            9           10           11           12 
## -0.588858465  0.392491325  0.076204545 -0.140379336 -0.112435149  0.002632365 
##           13           14 
## -0.129506736 -0.069955392

-Check the normality of the residuals by creating a QQ-plot. Based on the plot, are the residuals normally distributed?

qqnorm(model1$residuals, main = "model1")
qqline(model1$residuals)

The QQ-plot shows that the residuals are not normally distributed. The distribution appears to be “heavy-tailed.”

Examine the leverage of the datapoints by doing the following:

-Determine the average leverage of the datapoints in the Brokerage Satisfaction excel file

leverage = hatvalues(model1)
leverage

##          1          2          3          4          5          6          7 
## 0.12226809 0.14248379 0.10348052 0.09498002 0.07909281 0.09225511 0.17268548 
##          8          9         10         11         12         13         14 
## 0.14817354 0.22725402 0.22639250 0.45080022 0.28805237 0.10586879 0.74621276

mean(hatvalues(model1))

## [1] 0.2142857

-If a datapoint’s leverage is more than twice the average, it is generally considered to be a high leverage point. Using the average that you found in part a, determine which datapoints are considered to have high leverage.

2*mean(hatvalues(model1))

## [1] 0.4285714

There are two datapoints that are considered to be high leverage points. The two datapoints are number 11 (value of 0.45080022) and 14 (0.74621276).

-If a datapoint with unusually high leverage changes our regression or other predictive model greatly, we may possibly consider re-building the model without that abnormal datapoint. To determine the extent to which a datapoint impacts our model, we need to calculate the influence of the datapoints on our model. One measure for this influence is Cook’s Distance. If Cook’s Distance is greater than 1, the point is influential enough that we may want to consider either removing it from the model, or at the very least, treating it differently than the other data. Based on Cook’s Distance, are there outliers that may possibly need to be removed and/or treated differently in the dataset? Create a plot using R to support your answer.

cooks.distance(model1)

##            1            2            3            4            5            6 
## 0.0079790547 0.0243407712 0.1518660850 0.0756708677 0.0059271795 0.0012425789 
##            7            8            9           10           11           12 
## 0.2471814307 0.0888808107 0.0062442356 0.0210623352 0.0533835032 0.0000111262 
##           13           14 
## 0.0062752299 0.1601935254

plot(model1)

There are no datapoints that have values over 1, thus there appear to be no outliers that should be removed from the dataset (due to that data points influence on the model). The plot (the last one of the 4) also supports this conclusion from the data that was calculated in R.

Use the regression model that you created in #1 to predict the Overall_Satisfaction_with_Electronic_Trades in each of the following scenarios:

-Satisfaction_with_Speed_of_Execution = 2, Satisfaction_with_Trade_Price = 4 3.414448

-Satisfaction_with_Speed_of_Execution = 3, Satisfaction_with_Trade_Price = 5 4.678726

-Satisfaction_with_Speed_of_Execution = 3, Satisfaction_with_Trade_Price = 4 3.904117

-Satisfaction_with_Speed_of_Execution = 2, Satisfaction_with_Trade_Price = 3 2.639838

observations_for_pred = data.frame(SpeedofExecution = c(2, 3, 3, 2), TradePrice = c(4, 5, 4, 3))
observations_for_pred

##   SpeedofExecution TradePrice
## 1                2          4
## 2                3          5
## 3                3          4
## 4                2          3

predict(model1, observations_for_pred, type = "response")

##        1        2        3        4 
## 3.414448 4.678726 3.904117 2.639838

Extrapolation occurs when a prediction is made outside of the region of data used to train the model. This is “risky” because we do not have any prior data to see what tends to happen in the particular situation that we are trying to predict. To determine if a prediction will involve extrapolation, we need to first determine the largest leverage of all datapoints. What is the maximum leverage of the datapoints in the Brokerage Satisfaction excel file?

max(hatvalues(model1))

## [1] 0.7462128

The maximum leverage of the datapoints in the Brokerage Satisfaction excel file is 0.7462128.

If a new datapoint has a leverage higher than the maximum (calculated in #4), then the prediction will involve extrapolation.

X = model.matrix(model1)
X

##    (Intercept) SpeedofExecution TradePrice
## 1            1              3.1        3.2
## 2            1              3.1        3.3
## 3            1              3.3        3.1
## 4            1              3.5        2.8
## 5            1              3.2        2.9
## 6            1              3.2        2.4
## 7            1              3.8        2.7
## 8            1              3.7        2.4
## 9            1              2.6        2.6
## 10           1              2.7        2.3
## 11           1              3.9        3.7
## 12           1              2.5        2.5
## 13           1              3.0        3.0
## 14           1              4.0        1.0
## attr(,"assign")
## [1] 0 1 2

x1_new = c(1,2,4)
t(x1_new)%*%solve(t(X)%*%X)%*%x1_new

##           [,1]
## [1,] 0.8323367

x2_new = c(1,3,5)
t(x2_new)%*%solve(t(X)%*%X)%*%x2_new

##         [,1]
## [1,] 1.08481

x3_new = c(1,3,4)
t(x3_new)%*%solve(t(X)%*%X)%*%x3_new

##           [,1]
## [1,] 0.3992325

x4_new = c(1,2,3)
t(x4_new)%*%solve(t(X)%*%X)%*%x4_new

##           [,1]
## [1,] 0.6074396

-Was the prediction made in part a of #3 considered to be extrapolation? (Note that predictions made with extrapolation are not extremely trustworthy.)

Yes the prediction made in part a of #3 is considered to be extrapolation. The value calculated is higher than the value calculated for the max leverage (0.8323367 > 0.7462128)

-Was the prediction made in part b of #3 considered to be extrapolation?

Yes the prediction made in part b of #3 is considered to be extrapolation. The value calculated is higher than the value calculated for the max leverage (1.08481 > 0.7462128)

-Was the prediction made in part c of #3 considered to be extrapolation?

No the prediction made in part c of #3 is not considered to be extrapolation. The value calculated is lower than the value calculated for the max leverage (0.3992325 < 0.7462128)

-Was the prediction made in part d of #3 considered to be extrapolation?

No the prediction made in part d of #3 is not considered to be extrapolation. The value calculated is lower than the value calculated for the max leverage (0.6074396 < 0.7462128)

Week 2 Homework

Kamrie Foster

2022-09-21