Week 3 Homework Part 1

Use the Brokerage Satisfaction excel file to answer the following questions in R. Create an R Markdown file to answer the questions, and then “knit” your file to create an HTML document. Your HTML document should contain both textual explanations of your answers, as well as all R code needed to support your work.

  1. Read the excel file titled Brokerage Satisfaction into R. You will use regression to predict the Overall_Satisfaction_with_Electronic_Trades. Remove any column(s) from the dataset that will likely not be useful in performing this task, and use all remaining columns of the dataset to perform a single regression to predict the overall satisfaction. Because the dataset is small, use all rows of the dataset as your training set (i.e., use all data rows to build the regression model). Using your results, you should be able to write an equation to predict Overall_Satisfaction_with_Electronic_Trades as:

B1(Satisfaction_with_Speed_of_Execution) + B2(Satisfaction_with_Trade_Price) + B3, where B1, B2, and B3 are real numbers

#import the read excel library
library(readxl)

brokerage <- read_excel("/Users/kamriefoster/Downloads/BrokerageSatisfaction.xlsx")
brokerage <- brokerage[,-1]
brokerage
## # A tibble: 14 × 3
##    TradePrice SpeedofExecution ElectronicTrades
##         <dbl>            <dbl>            <dbl>
##  1        3.2              3.1              3.2
##  2        3.3              3.1              3.2
##  3        3.1              3.3              4  
##  4        2.8              3.5              3.7
##  5        2.9              3.2              3  
##  6        2.4              3.2              2.7
##  7        2.7              3.8              2.7
##  8        2.4              3.7              3.4
##  9        2.6              2.6              2.7
## 10        2.3              2.7              2.3
## 11        3.7              3.9              4  
## 12        2.5              2.5              2.5
## 13        3                3                3  
## 14        1                4                2
#linear regression cannot use a column with characters
model1 <- lm(ElectronicTrades ~ SpeedofExecution + TradePrice, data = brokerage)
summary(model1)
## 
## Call:
## lm(formula = ElectronicTrades ~ SpeedofExecution + TradePrice, 
##     data = brokerage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58886 -0.13863 -0.09120  0.05781  0.64613 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.6633     0.8248  -0.804 0.438318    
## SpeedofExecution   0.4897     0.2016   2.429 0.033469 *  
## TradePrice         0.7746     0.1521   5.093 0.000348 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157
# Confidence intervals for coefficients
confint(model1, level = 0.80)
##                        10 %      90 %
## (Intercept)      -1.7879306 0.4612749
## SpeedofExecution  0.2148115 0.7645252
## TradePrice        0.5672241 0.9819956

Fill-in the blanks for the following statements:

  1. Fill in the blanks:
# Create a data frame with new observations in order to make predictions 
observations_for_pred = data.frame(SpeedofExecution = c(3), TradePrice = c(4))

observations_for_pred
##   SpeedofExecution TradePrice
## 1                3          4
# Obtain prediction for new observations
predict(model1, observations_for_pred, type = "response")
##        1 
## 3.904117
# Obtain prediction interval for new observations
predict(model1, observations_for_pred, interval = "prediction", level = 0.90, type = "response")
##        fit      lwr      upr
## 1 3.904117 3.174452 4.633781
# Obtain confidence Interval for mean response for new observations
predict(model1, observations_for_pred, interval = "confidence", level = 0.90, type = "response")
##        fit      lwr      upr
## 1 3.904117 3.514362 4.293871
# Create a data frame with new observations in order to make predictions 
observations_for_pred = data.frame(SpeedofExecution = c(2), TradePrice = c(3))

observations_for_pred
##   SpeedofExecution TradePrice
## 1                2          3
# Obtain prediction for new observations
predict(model1, observations_for_pred, type = "response")
##        1 
## 2.639838
# Obtain prediction interval for new observations
predict(model1, observations_for_pred, interval = "prediction", level = 0.85, type = "response")
##        fit      lwr      upr
## 1 2.639838 1.965909 3.313768
# Obtain confidence Interval for mean response for new observations
predict(model1, observations_for_pred, interval = "confidence", level = 0.85, type = "response")
##        fit      lwr      upr
## 1 2.639838 2.225554 3.054123
  1. Use unit normal scaling to calculate standardized regression coefficients for the model that you created in #1. Based on these coefficients, which covariate is more influential in predicting overall satisfaction? Is the Satisfaction_with_Speed_of_Execution more influential than the Satisfaction_with_Trade_Price? Or is the Satisfaction_with_Trade_Price more influential than the Satisfaction_with_Speed_of_Execution?
# Need for Standardized regression coefficients and unit normal scaling
head(brokerage)
## # A tibble: 6 × 3
##   TradePrice SpeedofExecution ElectronicTrades
##        <dbl>            <dbl>            <dbl>
## 1        3.2              3.1              3.2
## 2        3.3              3.1              3.2
## 3        3.1              3.3              4  
## 4        2.8              3.5              3.7
## 5        2.9              3.2              3  
## 6        2.4              3.2              2.7
summary(model1)
## 
## Call:
## lm(formula = ElectronicTrades ~ SpeedofExecution + TradePrice, 
##     data = brokerage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58886 -0.13863 -0.09120  0.05781  0.64613 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.6633     0.8248  -0.804 0.438318    
## SpeedofExecution   0.4897     0.2016   2.429 0.033469 *  
## TradePrice         0.7746     0.1521   5.093 0.000348 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157
# transform the data using unit normal scaling
brokerage_unit_normal = as.data.frame(apply(brokerage, 2, function(x){(x - mean(x))/sd(x)}))

# redo regression
model1_unit_normal <- lm(ElectronicTrades ~ SpeedofExecution + TradePrice, data = brokerage_unit_normal)

#obtain standardized regression coefficients
model1_unit_normal
## 
## Call:
## lm(formula = ElectronicTrades ~ SpeedofExecution + TradePrice, 
##     data = brokerage_unit_normal)
## 
## Coefficients:
##      (Intercept)  SpeedofExecution        TradePrice  
##        4.115e-16         3.870e-01         8.115e-01
summary(model1_unit_normal)
## 
## Call:
## lm(formula = ElectronicTrades ~ SpeedofExecution + TradePrice, 
##     data = brokerage_unit_normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97638 -0.22987 -0.15121  0.09586  1.07134 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.115e-16  1.522e-01   0.000 1.000000    
## SpeedofExecution 3.870e-01  1.593e-01   2.429 0.033469 *  
## TradePrice       8.115e-01  1.593e-01   5.093 0.000348 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5695 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157

The Satisfaction_with_Trade_Price is more influential than the Satisfaction_with_Speed_of_Execution.

Week 3 Homework Part 2

Use the data_RocketProp csv file to answer the following questions in R. Create an R Markdown file to answer the questions, and then “knit” your file to create an HTML document. Your HTML document should contain both textual explanations of your answers, as well as all R code needed to support your work.

# Read the excel file into R
rocket <- read.csv("/Users/kamriefoster/Downloads/data_RocketProp.csv")
      1. Create a linear regression to predict y based on x.
model <- lm(y ~ ., data = rocket)
summary(model)
## 
## Call:
## lm(formula = y ~ ., data = rocket)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215.98  -50.68   28.74   66.61  106.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2627.822     44.184   59.48  < 2e-16 ***
## x            -37.154      2.889  -12.86 1.64e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared:  0.9018, Adjusted R-squared:  0.8964 
## F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10
X = model.matrix(model)
X
##    (Intercept)     x
## 1            1 15.50
## 2            1 23.75
## 3            1  8.00
## 4            1 17.00
## 5            1  5.50
## 6            1 19.00
## 7            1 24.00
## 8            1  2.50
## 9            1  7.50
## 10           1 11.00
## 11           1 13.00
## 12           1  3.75
## 13           1 25.00
## 14           1  9.75
## 15           1 22.00
## 16           1 18.00
## 17           1  6.00
## 18           1 12.50
## 19           1  2.00
## 20           1 21.50
## attr(,"assign")
## [1] 0 1
  1. Calculate the leverage of all datapoints in the data_RocketProp csv file.
hatvalues(model)
##          1          2          3          4          5          6          7 
## 0.05412893 0.14750959 0.07598722 0.06195725 0.10586587 0.07872092 0.15225968 
##          8          9         10         11         12         13         14 
## 0.15663134 0.08105925 0.05504393 0.05011875 0.13350221 0.17238964 0.06179345 
##         15         16         17         18         19         20 
## 0.11742196 0.06943538 0.09898644 0.05067227 0.16667373 0.10984216
  1. What is the maximum leverage calculated in part a?
max(hatvalues(model))
## [1] 0.1723896

The maximum leverage calculated in part a is 0.17238964.

x1_new = c(1,25.5)
t(x1_new)%*%solve(t(X)%*%X)%*%x1_new
##           [,1]
## [1,] 0.1831324

Yes the value of y when x is 25.5 is extrapolation because the value calculated is higher than the value calculated for maximum leverage. (0.1831324 > 0.17238964)

  1. Suppose that we want to use the regression created in #1 to predict the value of y when x is 15. Would this prediction be considered extrapolation?
x2_new = c(1,15)
t(x2_new)%*%solve(t(X)%*%X)%*%x2_new
##            [,1]
## [1,] 0.05242319

No the value of y when x is 15 is not extrapolation because the value calculated is lower than the value calculated for maximum leverage. (0.05242319 < 0.17238964)

      1. Calculate Cook’s Distance for all datapoints in the data_RocketProp csv file.
cooks.distance(model)
##            1            2            3            4            5            6 
## 0.0373281981 0.0497291858 0.0010260760 0.0161482719 0.3343768993 0.2290842436 
##            7            8            9           10           11           12 
## 0.0270491200 0.0191323748 0.0003959877 0.0047094549 0.0012482345 0.0761514881 
##           13           14           15           16           17           18 
## 0.0889892211 0.0192517639 0.0166302585 0.0387158541 0.0005955991 0.0041888627 
##           19           20 
## 0.1317143774 0.0425721512
max(cooks.distance(model))
## [1] 0.3343769

Max Cook’s distance is 0.3343769.

plot(model)

Based on the answer in part b. there are no specific data points that are of concern as outliers in the dataset, since all Cook’s Distance values are under a value of 1.