Week 3 Homework Part 1

Use the Brokerage Satisfaction excel file to answer the following questions in R. Create an R Markdown file to answer the questions, and then “knit” your file to create an HTML document. Your HTML document should contain both textual explanations of your answers, as well as all R code needed to support your work.

  1. Read the excel file titled Brokerage Satisfaction into R. You will use regression to predict the Overall_Satisfaction_with_Electronic_Trades. Remove any column(s) from the dataset that will likely not be useful in performing this task, and use all remaining columns of the dataset to perform a single regression to predict the overall satisfaction. Because the dataset is small, use all rows of the dataset as your training set (i.e., use all data rows to build the regression model). Using your results, you should be able to write an equation to predict Overall_Satisfaction_with_Electronic_Trades as:

B1(Satisfaction_with_Speed_of_Execution) + B2(Satisfaction_with_Trade_Price) + B3, where B1, B2, and B3 are real numbers

brokerage <- read.csv("C:/Users/raze1/OneDrive/Desktop/UIndy/MSDA 621/Homework/2/Brokerage Satisfaction.csv")
summary(brokerage)
##   Brokerage             Price           Speed          Overall     
##  Length:14          Min.   :1.000   Min.   :2.500   Min.   :2.000  
##  Class :character   1st Qu.:2.425   1st Qu.:3.025   1st Qu.:2.700  
##  Mode  :character   Median :2.750   Median :3.200   Median :3.000  
##                     Mean   :2.707   Mean   :3.257   Mean   :3.029  
##                     3rd Qu.:3.075   3rd Qu.:3.650   3rd Qu.:3.350  
##                     Max.   :3.700   Max.   :4.000   Max.   :4.000
head(brokerage)
##                     Brokerage Price Speed Overall
## 1             Scottrade, Inc.   3.2   3.1     3.2
## 2              Charles Schwab   3.3   3.1     3.2
## 3 Fidelity Brokerage Services   3.1   3.3     4.0
## 4               TD Ameritrade   2.8   3.5     3.7
## 5           E*Trade Financial   2.9   3.2     3.0
## 6                (Not listed)   2.4   3.2     2.7
colnames(brokerage)
## [1] "Brokerage" "Price"     "Speed"     "Overall"
brokerage$Brokerage <- NULL
colnames(brokerage)
## [1] "Price"   "Speed"   "Overall"
brokerage_model <- lm(Overall~., data = brokerage)
summary(brokerage_model)
## 
## Call:
## lm(formula = Overall ~ ., data = brokerage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58886 -0.13863 -0.09120  0.05781  0.64613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.6633     0.8248  -0.804 0.438318    
## Price         0.7746     0.1521   5.093 0.000348 ***
## Speed         0.4897     0.2016   2.429 0.033469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157
confint(brokerage_model, level = .8)
##                   10 %      90 %
## (Intercept) -1.7879306 0.4612749
## Price        0.5672241 0.9819956
## Speed        0.2148115 0.7645252

Fill-in the blanks for the following statements: a) There is an 80% probability that the number B1 will fall between 0.2148115 and 0.7645252. b) There is an 80% probability that the number B2 will fall between 0.5672241 and 0.9819956.

Q2A = data.frame(Speed = c(3), Price = c(4))
Q2C = data.frame(Speed = c(2), Price = c(3))
predictA<-predict(brokerage_model, Q2A, type = "response")
predictC<-predict(brokerage_model, Q2C, type = "response")
predictA
##        1 
## 3.904117
predictC
##        1 
## 2.639838
predictA_prediction<-predict(brokerage_model, Q2A , interval = "prediction", level = .9, type = "response")
predictA_prediction
##        fit      lwr      upr
## 1 3.904117 3.174452 4.633781
predictA_confidence<-predict(brokerage_model, Q2A, interval = "confidence", level = .9, type = "response")
predictA_confidence
##        fit      lwr      upr
## 1 3.904117 3.514362 4.293871
predictC_prediction<-predict(brokerage_model, Q2C, interval = "prediction", level = .85, type = "response")
predictC_prediction
##        fit      lwr      upr
## 1 2.639838 1.965909 3.313768
predictC_confidence<-predict(brokerage_model, Q2C, interval = "confidence", level = .85, type = "response")
predictC_confidence
##        fit      lwr      upr
## 1 2.639838 2.225554 3.054123
  1. Fill in the blanks:
  1. Suppose that we want to use the regression model created in the preceding question to predict the Overall_Satisfaction_with_Electronic_Trades when the Satisfaction_with_Speed_of_Execution is 3, and the Satisfaction_with_Trade_Price is 4. There is a 90% chance that this prediction will fall between 3.174452 and 4.633781.
  2. When the Satisfaction_with_Speed_of_Execution is 3 and the Satisfaction_with_Trade_Price is 4, there is a 90% chance that the mean response (i.e., mean value of the target variable) will fall between 3.514362 and 4.293871.
  3. Suppose that we want to use the regression model created in the preceding question to predict the Overall_Satisfaction_with_Electronic_Trades when the Satisfaction_with_Speed_of_Execution is 2, and the Satisfaction_with_Trade_Price is 3. There is an 85% chance that this prediction will fall between 1.965909 and 3.313768.
  4. When the Satisfaction_with_Speed_of_Execution is 2 and the Satisfaction_with_Trade_Price is 3, there is an 85% chance that the mean response will fall between 2.225554 and 3.054123.
  1. Use unit normal scaling to calculate standardized regression coefficients for the model that you created in #1. Based on these coefficients, which covariate is more influential in predicting overall satisfaction? Is the Satisfaction_with_Speed_of_Execution more influential than the Satisfaction_with_Trade_Price? Or is the Satisfaction_with_Trade_Price more influential than the Satisfaction_with_Speed_of_Execution?
lm(formula = Overall ~ Price + Speed, data = brokerage)
## 
## Call:
## lm(formula = Overall ~ Price + Speed, data = brokerage)
## 
## Coefficients:
## (Intercept)        Price        Speed  
##     -0.6633       0.7746       0.4897
brokerage_unit_normal = as.data.frame(apply(brokerage, 2, function(x){(x-mean(x))/sd(x)}))
brokerage_model_unit_normal <- lm(Overall ~ Price + Speed, data = brokerage_unit_normal)
summary(brokerage_model_unit_normal)
## 
## Call:
## lm(formula = Overall ~ Price + Speed, data = brokerage_unit_normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97638 -0.22987 -0.15121  0.09586  1.07134 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.115e-16  1.522e-01   0.000 1.000000    
## Price       8.115e-01  1.593e-01   5.093 0.000348 ***
## Speed       3.870e-01  1.593e-01   2.429 0.033469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5695 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157
lm(formula = Overall ~ Price + Speed, data = brokerage_unit_normal)
## 
## Call:
## lm(formula = Overall ~ Price + Speed, data = brokerage_unit_normal)
## 
## Coefficients:
## (Intercept)        Price        Speed  
##   4.115e-16    8.115e-01    3.870e-01
summary(brokerage_model_unit_normal)
## 
## Call:
## lm(formula = Overall ~ Price + Speed, data = brokerage_unit_normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97638 -0.22987 -0.15121  0.09586  1.07134 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.115e-16  1.522e-01   0.000 1.000000    
## Price       8.115e-01  1.593e-01   5.093 0.000348 ***
## Speed       3.870e-01  1.593e-01   2.429 0.033469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5695 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157
summary(brokerage_model)
## 
## Call:
## lm(formula = Overall ~ ., data = brokerage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58886 -0.13863 -0.09120  0.05781  0.64613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.6633     0.8248  -0.804 0.438318    
## Price         0.7746     0.1521   5.093 0.000348 ***
## Speed         0.4897     0.2016   2.429 0.033469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.6757 
## F-statistic: 14.54 on 2 and 11 DF,  p-value: 0.0008157

Satisfaction_with_Trade_Price is more influential than the Satisfaction_with_Speed_of_Execution.

Week 3 Homework Part 2

Use the data_RocketProp csv file to answer the following questions in R. Create an R Markdown file to answer the questions, and then “knit” your file to create an HTML document. Your HTML document should contain both textual explanations of your answers, as well as all R code needed to support your work.

    1. Create a linear regression to predict y based on x.
rocket <- read.csv("C:/Users/raze1/OneDrive/Desktop/UIndy/MSDA 621/Homework/3/data_RocketProp.csv")
summary(rocket)
##        y              x         
##  Min.   :1678   Min.   : 2.000  
##  1st Qu.:1783   1st Qu.: 7.125  
##  Median :2183   Median :12.750  
##  Mean   :2131   Mean   :13.363  
##  3rd Qu.:2342   3rd Qu.:19.625  
##  Max.   :2654   Max.   :25.000
head(rocket)
##         y     x
## 1 2158.70 15.50
## 2 1678.15 23.75
## 3 2316.00  8.00
## 4 2061.30 17.00
## 5 2207.50  5.50
## 6 1708.30 19.00
rocket_model<-lm(y ~ x, data = rocket)
summary(rocket_model)
## 
## Call:
## lm(formula = y ~ x, data = rocket)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215.98  -50.68   28.74   66.61  106.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2627.822     44.184   59.48  < 2e-16 ***
## x            -37.154      2.889  -12.86 1.64e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared:  0.9018, Adjusted R-squared:  0.8964 
## F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10
  1. Create the design matrix for your regression model.
x=model.matrix(rocket_model)
x
##    (Intercept)     x
## 1            1 15.50
## 2            1 23.75
## 3            1  8.00
## 4            1 17.00
## 5            1  5.50
## 6            1 19.00
## 7            1 24.00
## 8            1  2.50
## 9            1  7.50
## 10           1 11.00
## 11           1 13.00
## 12           1  3.75
## 13           1 25.00
## 14           1  9.75
## 15           1 22.00
## 16           1 18.00
## 17           1  6.00
## 18           1 12.50
## 19           1  2.00
## 20           1 21.50
## attr(,"assign")
## [1] 0 1
    1. Calculate the leverage of all datapoints in the data_RocketProp csv file.
mean(hatvalues(rocket_model))
## [1] 0.1
hatvalues(rocket_model)
##          1          2          3          4          5          6          7 
## 0.05412893 0.14750959 0.07598722 0.06195725 0.10586587 0.07872092 0.15225968 
##          8          9         10         11         12         13         14 
## 0.15663134 0.08105925 0.05504393 0.05011875 0.13350221 0.17238964 0.06179345 
##         15         16         17         18         19         20 
## 0.11742196 0.06943538 0.09898644 0.05067227 0.16667373 0.10984216
cbind(rocket, leverage = hatvalues(rocket_model))
##          y     x   leverage
## 1  2158.70 15.50 0.05412893
## 2  1678.15 23.75 0.14750959
## 3  2316.00  8.00 0.07598722
## 4  2061.30 17.00 0.06195725
## 5  2207.50  5.50 0.10586587
## 6  1708.30 19.00 0.07872092
## 7  1784.70 24.00 0.15225968
## 8  2575.00  2.50 0.15663134
## 9  2357.90  7.50 0.08105925
## 10 2256.70 11.00 0.05504393
## 11 2165.20 13.00 0.05011875
## 12 2399.55  3.75 0.13350221
## 13 1779.80 25.00 0.17238964
## 14 2336.75  9.75 0.06179345
## 15 1765.30 22.00 0.11742196
## 16 2053.50 18.00 0.06943538
## 17 2414.40  6.00 0.09898644
## 18 2200.50 12.50 0.05067227
## 19 2654.20  2.00 0.16667373
## 20 1753.70 21.50 0.10984216
  1. What is the maximum leverage calculated in part a? Maximum Leverage is 0.17238964.
    1. Suppose that we want to use the regression created in #1 to predict the value of y when x is 25.5. Would this prediction be considered extrapolation?
x_newA=c(1, 25.5)
t(x_newA)%*%solve(t(x)%*%x)%*%x_newA
##           [,1]
## [1,] 0.1831324

This is not considered extrapolation because it is not more than twice the average leverage.

  1. Suppose that we want to use the regression created in #1 to predict the value of y when x is 15. Would this prediction be considered extrapolation?
x_newB=c(1, 15)
t(x_newB)%*%solve(t(x)%*%x)%*%x_newB
##            [,1]
## [1,] 0.05242319

This is not considered extrapolation because it is not more than twice the average leverage.

    1. Calculate Cook’s Distance for all datapoints in the data_RocketProp csv file.
cooks.distance(rocket_model)
##            1            2            3            4            5            6 
## 0.0373281981 0.0497291858 0.0010260760 0.0161482719 0.3343768993 0.2290842436 
##            7            8            9           10           11           12 
## 0.0270491200 0.0191323748 0.0003959877 0.0047094549 0.0012482345 0.0761514881 
##           13           14           15           16           17           18 
## 0.0889892211 0.0192517639 0.0166302585 0.0387158541 0.0005955991 0.0041888627 
##           19           20 
## 0.1317143774 0.0425721512
cbind(rocket, leverage = hatvalues(rocket_model), Cooks = cooks.distance(rocket_model))
##          y     x   leverage        Cooks
## 1  2158.70 15.50 0.05412893 0.0373281981
## 2  1678.15 23.75 0.14750959 0.0497291858
## 3  2316.00  8.00 0.07598722 0.0010260760
## 4  2061.30 17.00 0.06195725 0.0161482719
## 5  2207.50  5.50 0.10586587 0.3343768993
## 6  1708.30 19.00 0.07872092 0.2290842436
## 7  1784.70 24.00 0.15225968 0.0270491200
## 8  2575.00  2.50 0.15663134 0.0191323748
## 9  2357.90  7.50 0.08105925 0.0003959877
## 10 2256.70 11.00 0.05504393 0.0047094549
## 11 2165.20 13.00 0.05011875 0.0012482345
## 12 2399.55  3.75 0.13350221 0.0761514881
## 13 1779.80 25.00 0.17238964 0.0889892211
## 14 2336.75  9.75 0.06179345 0.0192517639
## 15 1765.30 22.00 0.11742196 0.0166302585
## 16 2053.50 18.00 0.06943538 0.0387158541
## 17 2414.40  6.00 0.09898644 0.0005955991
## 18 2200.50 12.50 0.05067227 0.0041888627
## 19 2654.20  2.00 0.16667373 0.1317143774
## 20 1753.70 21.50 0.10984216 0.0425721512
plot(rocket_model)

  1. What is the maximum Cook’s Distance calculated in part a? 0.2290842436
  2. Based on your answer to part b, are there any outliers in the dataset that we should be concerned about? No, none of the calculated Cook’s Distances remotely approacheds 1.