Read the excel file titled Brokerage Satisfaction into R. You will use regression to predict the Overall_Satisfaction_with_Electronic_Trades. Remove any column(s) from the dataset that will likely not be useful in performing this task, and use all remaining columns of the dataset to perform a single regression to predict the overall satisfaction. Because the dataset is small, use all rows of the dataset as your training set (i.e., use all data rows to build the regression model). Using your results, you should be able to write an equation to predict Overall_Satisfaction_with_Electronic_Trades as: \[B_1(Satisfaction\_with\_Speed\_of\_Execution) + B_2(Satisfaction\_with\_Trade\_Price) + B_3\] where B1, B2, and B3 are real numbers.
library(readxl)
Brokerage_Satisfaction <- read_xlsx("C:/Users/Lynx/Documents/MSDA/621/Brokerage Satisfaction.xlsx")
str(Brokerage_Satisfaction)
## tibble [14 × 4] (S3: tbl_df/tbl/data.frame)
## $ Brokerage : chr [1:14] "Scottrade, Inc." "Charles Schwab" "Fidelity Brokerage Services" "TD Ameritrade" ...
## $ Satisfaction_with_Trade_Price : num [1:14] 3.2 3.3 3.1 2.8 2.9 2.4 2.7 2.4 2.6 2.3 ...
## $ Satisfaction_with_Speed_of_Execution : num [1:14] 3.1 3.1 3.3 3.5 3.2 3.2 3.8 3.7 2.6 2.7 ...
## $ Overall_Satisfaction_with_Electronic_Trades: num [1:14] 3.2 3.2 4 3.7 3 2.7 2.7 3.4 2.7 2.3 ...
B1 <- Brokerage_Satisfaction$Satisfaction_with_Speed_of_Execution
B2 <- Brokerage_Satisfaction$Satisfaction_with_Trade_Price
B3 <- Brokerage_Satisfaction$Overall_Satisfaction_with_Electronic_Trades
model <- lm(B3 ~ B1 + B2, data = Brokerage_Satisfaction)
summary(model)
##
## Call:
## lm(formula = B3 ~ B1 + B2, data = Brokerage_Satisfaction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## B1 0.4897 0.2016 2.429 0.033469 *
## B2 0.7746 0.1521 5.093 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
# Confidence intervals for coefficients
confint(model, level = 0.80)
## 10 % 90 %
## (Intercept) -1.7879306 0.4612749
## B1 0.2148115 0.7645252
## B2 0.5672241 0.9819956
Fill-in the blanks for the following statements:
0.215 and 0.765.0.567 and 0.982.Fill in the blanks:
# Create a data frame with new observations in order to make predictions
observation = data.frame(B1 = c(3), B2 = c(4))
observation
## B1 B2
## 1 3 4
# Obtain prediction for new observations
predict(model, observation, type = "response")
## 1
## 3.904117
# Obtain prediction interval for new observations
predict(model, observation, interval = "prediction", level = 0.90, type = "response")
## fit lwr upr
## 1 3.904117 3.174452 4.633781
3.174 and
4.634.# Obtain confidence Interval for mean response for new observations
predict(model, observation, interval = "confidence", level = 0.90, type = "response")
## fit lwr upr
## 1 3.904117 3.514362 4.293871
3.514 and 4.294.observation2 = data.frame(B1 = c(2), B2 = c(3))
observation2
## B1 B2
## 1 2 3
predict(model, observation2, type = "response")
## 1
## 2.639838
predict(model, observation2, interval = "prediction", level = 0.85, type = "response")
## fit lwr upr
## 1 2.639838 1.965909 3.313768
1.966 and
3.314.predict(model, observation2, interval = "confidence", level = 0.85, type = "response")
## fit lwr upr
## 1 2.639838 2.225554 3.054123
2.226 and
3.054.Use unit normal scaling to calculate standardized regression coefficients for the model that you created in #1. Based on these coefficients, which covariate is more influential in predicting overall satisfaction? Is the Satisfaction_with_Speed_of_Execution more influential than the Satisfaction_with_Trade_Price? Or is the Satisfaction_with_Trade_Price more influential than the Satisfaction_with_Speed_of_Execution?
# Remove non-numeric variable
brokerage <- Brokerage_Satisfaction[,-1]
colnames(brokerage)
## [1] "Satisfaction_with_Trade_Price"
## [2] "Satisfaction_with_Speed_of_Execution"
## [3] "Overall_Satisfaction_with_Electronic_Trades"
# Need for Standardized regression coefficients and unit normal scaling
head(brokerage)
## # A tibble: 6 × 3
## Satisfaction_with_Trade_Price Satisfaction_with_Speed_of_Exe… Overall_Satisfa…
## <dbl> <dbl> <dbl>
## 1 3.2 3.1 3.2
## 2 3.3 3.1 3.2
## 3 3.1 3.3 4
## 4 2.8 3.5 3.7
## 5 2.9 3.2 3
## 6 2.4 3.2 2.7
summary(model)
##
## Call:
## lm(formula = B3 ~ B1 + B2, data = Brokerage_Satisfaction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## B1 0.4897 0.2016 2.429 0.033469 *
## B2 0.7746 0.1521 5.093 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
# transform the data using unit normal scaling
brokerage_unit_normal = as.data.frame(apply(brokerage, 2, function(x){(x - mean(x))/sd(x)}))
# redo regression
model_unit_normal <- lm(B3 ~ B1 + B2, data = brokerage_unit_normal)
#obtain standardized regression coefficients
model_unit_normal
##
## Call:
## lm(formula = B3 ~ B1 + B2, data = brokerage_unit_normal)
##
## Coefficients:
## (Intercept) B1 B2
## -0.6633 0.4897 0.7746
summary(model_unit_normal)
##
## Call:
## lm(formula = B3 ~ B1 + B2, data = brokerage_unit_normal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## B1 0.4897 0.2016 2.429 0.033469 *
## B2 0.7746 0.1521 5.093 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
# Compare with the original regression, only estimates change
summary(model)
##
## Call:
## lm(formula = B3 ~ B1 + B2, data = Brokerage_Satisfaction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58886 -0.13863 -0.09120 0.05781 0.64613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6633 0.8248 -0.804 0.438318
## B1 0.4897 0.2016 2.429 0.033469 *
## B2 0.7746 0.1521 5.093 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3435 on 11 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.6757
## F-statistic: 14.54 on 2 and 11 DF, p-value: 0.0008157
Based on these coefficients, the B2 covariate is more influential in predicting the overall satisfaction than the B1 covariate.
library(readr)
## Warning: package 'readr' was built under R version 4.2.1
data_RocketProp <- read_csv("C:/Users/Lynx/Documents/MSDA/621/data_RocketProp.csv")
## Rows: 20 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): y, x
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model2 <- lm(y ~ x, data = data_RocketProp)
summary(model2)
##
## Call:
## lm(formula = y ~ x, data = data_RocketProp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -215.98 -50.68 28.74 66.61 106.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2627.822 44.184 59.48 < 2e-16 ***
## x -37.154 2.889 -12.86 1.64e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared: 0.9018, Adjusted R-squared: 0.8964
## F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-10
## The design matrix for a model is a column of 1's followed by all of the independent variables in the dataset. It can be created using the following code:
X = model.matrix(model2)
X
## (Intercept) x
## 1 1 15.50
## 2 1 23.75
## 3 1 8.00
## 4 1 17.00
## 5 1 5.50
## 6 1 19.00
## 7 1 24.00
## 8 1 2.50
## 9 1 7.50
## 10 1 11.00
## 11 1 13.00
## 12 1 3.75
## 13 1 25.00
## 14 1 9.75
## 15 1 22.00
## 16 1 18.00
## 17 1 6.00
## 18 1 12.50
## 19 1 2.00
## 20 1 21.50
## attr(,"assign")
## [1] 0 1
## Leverage is a measure of how "far" a datapoint is from the center of the dataset.
leverage <- hatvalues(model2)
leverage
## 1 2 3 4 5 6 7
## 0.05412893 0.14750959 0.07598722 0.06195725 0.10586587 0.07872092 0.15225968
## 8 9 10 11 12 13 14
## 0.15663134 0.08105925 0.05504393 0.05011875 0.13350221 0.17238964 0.06179345
## 15 16 17 18 19 20
## 0.11742196 0.06943538 0.09898644 0.05067227 0.16667373 0.10984216
max(leverage)
## [1] 0.1723896
## A prediction is considered to be extrapolation when the leverage is higher than the maximum within our dataset.
x_new = c(1, 25.5)
## We calculate leverage **for a point not in our dataset** using the following code:
t(x_new)%*%solve(t(X)%*%X)%*%x_new
## [,1]
## [1,] 0.1831324
Since the leverage of x_new is greater than the maximum leverage
calculated in part 1, this prediction would be considered
extrapolation.
x_new2 = c(1, 15)
t(x_new2)%*%solve(t(X)%*%X)%*%x_new2
## [,1]
## [1,] 0.05242319
Since the leverage of x_new2 is less than the maximum leverage
calculated in part 1, this prediction would not be
considered extrapolation.
cooks <- cooks.distance(model2)
cooks
## 1 2 3 4 5 6
## 0.0373281981 0.0497291858 0.0010260760 0.0161482719 0.3343768993 0.2290842436
## 7 8 9 10 11 12
## 0.0270491200 0.0191323748 0.0003959877 0.0047094549 0.0012482345 0.0761514881
## 13 14 15 16 17 18
## 0.0889892211 0.0192517639 0.0166302585 0.0387158541 0.0005955991 0.0041888627
## 19 20
## 0.1317143774 0.0425721512
max(cooks)
## [1] 0.3343769
no outliers to be
concerned about.