Read the excel file titled Brokerage Satisfaction into R. You will use regression to predict the Overall_Satisfaction_with_Electronic_Trades. Remove any column(s) from the dataset that will likely not be useful in performing this task, and use all remaining columns of the dataset to perform a single regression to predict the overall satisfaction. Because the dataset is small, use all rows of the dataset as your training set (i.e., use all data rows to build the regression model). Using your results, you should be able to write an equation to predict Overall_Satisfaction_with_Electronic_Trades as: B1(Satisfaction_with_Speed_of_Execution) + B2(Satisfaction_with_Trade_Price) + B3, where B1, B2, and B3 are real numbers Fill-in the blanks for the following statements:
library(readxl)
broker <- read_xlsx("C:/Users/justt/Desktop/School/621/Assignment/Homework 3/Brokerage Satisfaction.xlsx")
colnames(broker)
## [1] " Brokerage"
## [2] "Satisfaction_with_Trade_Price"
## [3] "Satisfaction_with_Speed_of_Execution"
## [4] "Overall_Satisfaction_with_Electronic_Trades"
# Brokerage is not relevant as it functions as more of a label and will not be used in these models.
broker <- broker[,-1]
colnames(broker)
## [1] "Satisfaction_with_Trade_Price"
## [2] "Satisfaction_with_Speed_of_Execution"
## [3] "Overall_Satisfaction_with_Electronic_Trades"
B1 <- broker$Satisfaction_with_Speed_of_Execution
B2 <- broker$Satisfaction_with_Trade_Price
B3 <- broker$Overall_Satisfaction_with_Electronic_Trades
model1 <- lm(B3 ~ B1 + B2, data = broker)
summary(model1)
##
## Call:
## lm(formula = B3 ~ B1 + B2, data = broker)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## B1 0.4897 0.2016 2.429 0.033469 *
## B2 0.7746 0.1521 5.093 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
Both coefficients, B1 and B2, are statistically significant to Overall_Satisfaction_with_Electronic_Trades.
confint(model1, level = 0.80)
## 10 % 90 %
## (Intercept) -1.7879306 0.4612749
## B1 0.2148115 0.7645252
## B2 0.5672241 0.9819956
There is an 80% probability that the number B1 will fall between 0.2148115 and 0.7645252.
There is an 80% probability that the number B2 will fall between 0.5672241 and 0.9819956.
B3_pred = data.frame(B1 = c(3), B2 = c(4))
B3_pred
## B1 B2
## 1 3 4
predict(model1, B3_pred, type = "response")
## 1
## 3.904117
predict(model1, B3_pred, interval = "prediction", level = 0.90, type = "response")
## fit lwr upr
## 1 3.904117 3.174452 4.633781
There is a 90% chance that this prediction will fall between 3.174452 and 4.633781.
predict(model1, B3_pred, interval = "confidence", level = 0.90, type = "response")
## fit lwr upr
## 1 3.904117 3.514362 4.293871
When the Satisfaction_with_Speed_of_Execution is 3 and the Satisfaction_with_Trade_Price is 4, there is a 90% chance that the mean response will fall between 3.514362 and 4.293871.
B3_pred = data.frame(B1 = c(2), B2 = c(3))
B3_pred
## B1 B2
## 1 2 3
predict(model1, B3_pred, type = "response")
## 1
## 2.639838
predict(model1, B3_pred, interval = "prediction", level = 0.85, type = "response")
## fit lwr upr
## 1 2.639838 1.965909 3.313768
There is an 85% chance that this prediction will fall between 1.965909 and 3.313768.
predict(model1, B3_pred, interval = "confidence", level = 0.85, type = "response")
## fit lwr upr
## 1 2.639838 2.225554 3.054123
There is an 85% chance that the mean response will fall between 2.225554 and 3.054123.
# Need for Standardized regression coefficients and unit normal scaling
head(broker)
## # A tibble: 6 × 3
## Satisfaction_with_Trade_Price Satisfaction_with_Speed_of_Exe… Overall_Satisfa…
## <dbl> <dbl> <dbl>
## 1 3.2 3.1 3.2
## 2 3.3 3.1 3.2
## 3 3.1 3.3 4
## 4 2.8 3.5 3.7
## 5 2.9 3.2 3
## 6 2.4 3.2 2.7
summary(model1)
##
## Call:
## lm(formula = B3 ~ B1 + B2, data = broker)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## B1 0.4897 0.2016 2.429 0.033469 *
## B2 0.7746 0.1521 5.093 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
# transform the data using unit normal scaling
broker_unit_normal = as.data.frame(apply(broker, 2, function(x){(x - mean(x))/sd(x)}))
# redo regression
model1_unit_normal <- lm(B3 ~ B1 + B2, data = broker_unit_normal)
#obtain standardized regression coefficients
model1_unit_normal
##
## Call:
## lm(formula = B3 ~ B1 + B2, data = broker_unit_normal)
##
## Coefficients:
## (Intercept) B1 B2
## -0.6633 0.4897 0.7746
Based on these coefficients, which covariate is more influential in predicting overall satisfaction? Is the Satisfaction_with_Speed_of_Execution more influential than the Satisfaction_with_Trade_Price? Or is the Satisfaction_with_Trade_Price more influential than the Satisfaction_with_Speed_of_Execution?
Based on these normalized coefficients, we can see that the Satisfaction_with_Trade_Price, at 0.7746, is more influential in predicting overall satisfaction than Satisfaction_with_Speed_of_Execution, at 0.4897.
Use the data_RocketProp csv file to answer the following questions in R. Create an R Markdown file to answer the questions, and then “knit” your file to create an HTML document. Your HTML document should contain both textual explanations of your answers, as well as all R code needed to support your work.
rocket <- read.csv("C:/Users/justt/Desktop/School/621/Assignment/Homework 3/data_RocketProp.csv")
colnames(rocket)
## [1] "y" "x"
model2 <- lm(Overall_Satisfaction_with_Electronic_Trades ~ Satisfaction_with_Trade_Price + Satisfaction_with_Speed_of_Execution, data = broker)
summary(model2)
##
## Call:
## lm(formula = Overall_Satisfaction_with_Electronic_Trades ~ Satisfaction_with_Trade_Price +
## Satisfaction_with_Speed_of_Execution, data = broker)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## Satisfaction_with_Trade_Price 0.7746 0.1521 5.093 0.000348 ***
## Satisfaction_with_Speed_of_Execution 0.4897 0.2016 2.429 0.033469 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
model2 <- lm(y ~ x, data = rocket)
summary(model2)
##
## Call:
## lm(formula = y ~ x, data = rocket)
##
## Residuals:
## Min 1Q Median 3Q Max
## -215.98 -50.68 28.74 66.61 106.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2627.822 44.184 59.48 < 2e-16 ***
## x -37.154 2.889 -12.86 1.64e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared: 0.9018, Adjusted R-squared: 0.8964
## F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-10
X = model.matrix(model2)
X
## (Intercept) x
## 1 1 15.50
## 2 1 23.75
## 3 1 8.00
## 4 1 17.00
## 5 1 5.50
## 6 1 19.00
## 7 1 24.00
## 8 1 2.50
## 9 1 7.50
## 10 1 11.00
## 11 1 13.00
## 12 1 3.75
## 13 1 25.00
## 14 1 9.75
## 15 1 22.00
## 16 1 18.00
## 17 1 6.00
## 18 1 12.50
## 19 1 2.00
## 20 1 21.50
## attr(,"assign")
## [1] 0 1
hatvalues(model2)
## 1 2 3 4 5 6 7
## 0.05412893 0.14750959 0.07598722 0.06195725 0.10586587 0.07872092 0.15225968
## 8 9 10 11 12 13 14
## 0.15663134 0.08105925 0.05504393 0.05011875 0.13350221 0.17238964 0.06179345
## 15 16 17 18 19 20
## 0.11742196 0.06943538 0.09898644 0.05067227 0.16667373 0.10984216
The leverage of all the data points in this data set are in the table above.
max(hatvalues(model2))
## [1] 0.1723896
The max leverage is 0.17238964, and is rounded up to 0.1723896 in the above code.
x_new = c(1, 25.5)
t(x_new)%*%solve(t(X)%*%X)%*%x_new
## [,1]
## [1,] 0.1831324
Yes. The predicted value of y, when x = 25.5, is 0.1831324. Since the max leverage is 0.17238964, and this new predicted value is larger than the max leverage, then this prediction is considered extrapolation.
x_new = c(1, 15)
t(x_new)%*%solve(t(X)%*%X)%*%x_new
## [,1]
## [1,] 0.05242319
Yes. The predicted value of y, when x = 15, is 0.05242319. Since the max leverage is 0.17238964, and this new predicted value is smaller than the max leverage, then this prediction is not considered extrapolation.
cooks.distance(model2)
## 1 2 3 4 5 6
## 0.0373281981 0.0497291858 0.0010260760 0.0161482719 0.3343768993 0.2290842436
## 7 8 9 10 11 12
## 0.0270491200 0.0191323748 0.0003959877 0.0047094549 0.0012482345 0.0761514881
## 13 14 15 16 17 18
## 0.0889892211 0.0192517639 0.0166302585 0.0387158541 0.0005955991 0.0041888627
## 19 20
## 0.1317143774 0.0425721512
The Cook’s Distance for all datapoints in this data set are in the table above.
max(cooks.distance(model2))
## [1] 0.3343769
The max Cook’s Distance is 0.2471814307, and is rounded up to 0.2471814 in the above code.
No, all points are below the 1.0 value for Cook’s Distance, so there are no outliers that we need to be concerned about.
plot(model2)