ISyE 6414 Regression Analysis — Spring 2023 Homework 7 For all of the problems in this assignment, please submit the relevant outputs and your R codes (if you used R). In addition, unless otherwise indicated, assume that α = 0.05. In a large metropolitan department store, the number of hours worked (Y ) per day by the clerical staff may depend on the following variables: X1: Number of pieces of mail processed X2: Number of money orders and gift certificates sold X3: Number of window payments transacted X4: Number of change order transactions processed X5: Number of checks cashed X6: Number of pieces of mail processed on an “as available” basis X7: Number of bus tickets sold X8: Working day of the week (they do not work on Sundays). Since X8 is a qualitative variable with six levels, five dummy variables will be used in the model. The attached table 6414-HW7-Clerical.txt provides the observed activities for 54 working days. Answer Questions 1–5 by using the information above. 1. Consider the following first-order multiple linear regression model built by using the variables X2, X5, and X8 for this problem. Y = β0 + β1X2 + β2X5 + β3D1 + β4D2 + β5D3 + β6D4 + β7D5 + , where Dj = { 1, if the day of the week is j 0, otherwise, for j = 1, . . . , 5. Solve the model and submit your output. Are the predictors X2, X5, and X8 statistically significant? You can answer by looking at their p-values.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.1
## corrplot 0.92 loaded
data <- read.table("6414-HW7-Clerical.txt", header =TRUE)
data
## DAY Y X1 X2 X3 X4 X5 X6 X7
## 1 M 128.5 7781 100 886 235 644 56 737
## 2 T 113.6 7004 110 962 388 589 57 1029
## 3 W 146.6 7267 61 1342 398 1081 59 830
## 4 Th 124.3 2129 102 1153 457 891 57 1468
## 5 F 100.4 4878 45 803 577 537 49 335
## 6 S 119.2 3999 144 1127 345 563 64 918
## 7 M 109.5 11777 123 627 326 402 60 335
## 8 T 128.5 5764 78 748 161 495 57 962
## 9 W 131.2 7392 172 876 219 823 62 665
## 10 Th 112.2 8100 126 685 287 555 86 577
## 11 F 95.4 4736 115 436 235 456 38 214
## 12 S 124.6 4337 110 899 127 573 73 484
## 13 M 103.7 3079 96 570 180 428 59 456
## 14 T 103.6 7273 51 826 118 463 53 907
## 15 W 133.2 4091 116 1060 206 961 67 951
## 16 Th 111.4 3390 70 957 284 745 77 1446
## 17 F 97.7 6319 58 559 220 539 41 440
## 18 S 132.1 7447 83 1050 174 553 63 1133
## 19 M 135.9 7100 80 568 124 428 55 456
## 20 T 131.3 8035 115 709 174 498 78 968
## 21 W 150.4 5579 83 568 223 683 79 660
## 22 Th 124.9 4338 78 900 115 556 84 555
## 23 F 97.0 6895 18 442 118 479 41 203
## 24 S 114.1 3629 133 644 155 505 57 781
## 25 M 88.3 5149 92 389 124 405 59 236
## 26 T 117.6 5241 110 612 222 477 55 616
## 27 W 128.2 2917 69 1057 378 970 80 1210
## 28 Th 138.8 4390 70 974 195 1027 81 1452
## 29 F 109.5 4957 24 783 358 893 51 616
## 30 S 118.9 7099 130 1419 374 609 62 957
## 31 M 122.2 7337 128 1137 238 461 51 968
## 32 T 142.8 8301 115 946 191 771 74 719
## 33 W 133.9 4889 86 750 214 513 69 489
## 34 Th 100.2 6308 81 461 132 430 49 341
## 35 F 116.8 6908 145 864 164 549 57 902
## 36 S 97.3 5345 116 604 127 360 48 126
## 37 M 98.0 6994 59 714 107 473 53 726
## 38 T 136.5 6781 78 917 171 805 74 1100
## 39 W 111.7 3142 106 809 335 702 70 1721
## 40 Th 98.6 5738 27 546 126 455 52 502
## 41 F 116.2 4931 174 891 129 481 71 737
## 42 S 108.9 6501 69 643 129 334 47 473
## 43 M 120.6 5678 94 828 107 384 52 1083
## 44 T 131.8 4619 100 777 164 834 67 841
## 45 W 112.4 1832 124 626 158 571 71 627
## 46 Th 92.5 5445 52 432 121 458 42 313
## 47 F 120.0 4123 84 432 153 544 42 654
## 48 S 112.2 5884 89 1061 100 391 31 280
## 49 M 113.0 5505 45 562 84 444 36 814
## 50 T 138.7 2882 94 601 139 799 44 907
## 51 W 122.1 2395 89 637 201 747 30 1666
## 52 Th 86.6 6847 14 810 230 547 40 614
## 53 F 116.8 6808 148 854 164 539 57 905
## 54 S 122.2 5684 79 1161 105 371 39 285
D1 <- ifelse(data$DAY == "M",1,0)
D2 <- ifelse(data$DAY == "T",1,0)
D3 <- ifelse(data$DAY == "W",1,0)
D4 <- ifelse(data$DAY == "Th",1,0)
D5 <- ifelse(data$DAY == "F",1,0)
model <- lm(Y ~ X2 + X5 + D1 + D2 + D3 + D4 + D5, data = data)
summary(model)
##
## Call:
## lm(formula = Y ~ X2 + X5 + D1 + D2 + D3 + D4 + D5, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.8935 -6.7983 -0.7877 5.3274 26.7106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.04109 7.74906 10.974 1.95e-14 ***
## X2 0.08810 0.04470 1.971 0.0548 .
## X5 0.04700 0.01034 4.544 3.98e-05 ***
## D1 -0.98761 5.24524 -0.188 0.8515
## D2 3.85589 5.48805 0.703 0.4858
## D3 -0.76465 6.11147 -0.125 0.9010
## D4 -10.74410 5.68588 -1.890 0.0651 .
## D5 -11.42393 5.31590 -2.149 0.0369 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.02 on 46 degrees of freedom
## Multiple R-squared: 0.5468, Adjusted R-squared: 0.4778
## F-statistic: 7.929 on 7 and 46 DF, p-value: 2.747e-06
The table shows that because X2’s p-value is greater than 0.05, it does not exhibit statistical importance. However, because X5’s p-value is less than 0.05, it exhibits statistical importance. Sunday is chosen as the starting point.
Additionally, the statistical importance of the dummy variable D5, which stands for Day-Friday, can be used to explain the significance of the qualitative variable X8. The other dummy variables (D1, D2, D3, and D4), however, have p-values that are greater than the predetermined cutoff of 0.05, suggesting that they are not statistically significant. However, X8 is regarded as important because there is at least one significant level.
model_aov <- aov(Y~DAY, data = data)
summary(model_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## DAY 5 3772 754.4 4.228 0.00289 **
## Residuals 48 8564 178.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To determine which days are significantly different from each other, we can perform pairwise t-tests with a Bonferroni correction for multiple comparisons.
mean = aggregate(Y ~ DAY, data = data, mean)
mean
## DAY Y
## 1 F 107.7556
## 2 M 113.3000
## 3 S 116.6111
## 4 T 127.1556
## 5 Th 109.9444
## 6 W 129.9667
Null hypothesis: μM= μT= μW= μTh= μF= μS Alternative Hypothesis: μM != μT != μW != μTh != μF != μS The alternative hypothesis contends that at least one mean differs from the others while the ANOVA null hypothesis states that all group means are equal. The null hypothesis in the specific case suggests that the typical everyday working hours are the same on all weekdays. The alternative theory, on the other hand, contends that the mean working hours vary from day to day. The obtained p-value was below the significance level of 0.05, so the results indicate rejecting the null hypothesis. Therefore, it can be concluded that the mean daily working hours are different on different dates. The calculated mean values, which show disparity among the means, support this claim.
mean_X2 <- mean(data$X2)
mean_X5 <- mean(data$X5)
Xdata <- data.frame(X2 = rep(mean_X2, 54), X5 = rep(mean_X5, 54), D1=D1, D2= D2, D3=D3, D4=D4, D5=D5)
Xdata
## X2 X5 D1 D2 D3 D4 D5
## 1 91.81481 588.7222 1 0 0 0 0
## 2 91.81481 588.7222 0 1 0 0 0
## 3 91.81481 588.7222 0 0 1 0 0
## 4 91.81481 588.7222 0 0 0 1 0
## 5 91.81481 588.7222 0 0 0 0 1
## 6 91.81481 588.7222 0 0 0 0 0
## 7 91.81481 588.7222 1 0 0 0 0
## 8 91.81481 588.7222 0 1 0 0 0
## 9 91.81481 588.7222 0 0 1 0 0
## 10 91.81481 588.7222 0 0 0 1 0
## 11 91.81481 588.7222 0 0 0 0 1
## 12 91.81481 588.7222 0 0 0 0 0
## 13 91.81481 588.7222 1 0 0 0 0
## 14 91.81481 588.7222 0 1 0 0 0
## 15 91.81481 588.7222 0 0 1 0 0
## 16 91.81481 588.7222 0 0 0 1 0
## 17 91.81481 588.7222 0 0 0 0 1
## 18 91.81481 588.7222 0 0 0 0 0
## 19 91.81481 588.7222 1 0 0 0 0
## 20 91.81481 588.7222 0 1 0 0 0
## 21 91.81481 588.7222 0 0 1 0 0
## 22 91.81481 588.7222 0 0 0 1 0
## 23 91.81481 588.7222 0 0 0 0 1
## 24 91.81481 588.7222 0 0 0 0 0
## 25 91.81481 588.7222 1 0 0 0 0
## 26 91.81481 588.7222 0 1 0 0 0
## 27 91.81481 588.7222 0 0 1 0 0
## 28 91.81481 588.7222 0 0 0 1 0
## 29 91.81481 588.7222 0 0 0 0 1
## 30 91.81481 588.7222 0 0 0 0 0
## 31 91.81481 588.7222 1 0 0 0 0
## 32 91.81481 588.7222 0 1 0 0 0
## 33 91.81481 588.7222 0 0 1 0 0
## 34 91.81481 588.7222 0 0 0 1 0
## 35 91.81481 588.7222 0 0 0 0 1
## 36 91.81481 588.7222 0 0 0 0 0
## 37 91.81481 588.7222 1 0 0 0 0
## 38 91.81481 588.7222 0 1 0 0 0
## 39 91.81481 588.7222 0 0 1 0 0
## 40 91.81481 588.7222 0 0 0 1 0
## 41 91.81481 588.7222 0 0 0 0 1
## 42 91.81481 588.7222 0 0 0 0 0
## 43 91.81481 588.7222 1 0 0 0 0
## 44 91.81481 588.7222 0 1 0 0 0
## 45 91.81481 588.7222 0 0 1 0 0
## 46 91.81481 588.7222 0 0 0 1 0
## 47 91.81481 588.7222 0 0 0 0 1
## 48 91.81481 588.7222 0 0 0 0 0
## 49 91.81481 588.7222 1 0 0 0 0
## 50 91.81481 588.7222 0 1 0 0 0
## 51 91.81481 588.7222 0 0 1 0 0
## 52 91.81481 588.7222 0 0 0 1 0
## 53 91.81481 588.7222 0 0 0 0 1
## 54 91.81481 588.7222 0 0 0 0 0
Y_pred = predict(model, discriminant_data = Xdata)
min_hours = which.min(Y_pred)
min_hours
## 23
## 23
Friday
var <- c("Y", "X1", "X2", "X3", "X4", "X5", "X6", "X7")
data2 <- data[, var]
pairs(data2)
cor(data2)
## Y X1 X2 X3 X4 X5
## Y 1.00000000 -0.00783577 0.28199476 0.45895267 0.07811222 0.57156203
## X1 -0.00783577 1.00000000 0.03000473 0.05808654 -0.04876899 -0.27604899
## X2 0.28199476 0.03000473 1.00000000 0.23147193 0.02888080 -0.01591777
## X3 0.45895267 0.05808654 0.23147193 1.00000000 0.42857717 0.45273679
## X4 0.07811222 -0.04876899 0.02888080 0.42857717 1.00000000 0.45651739
## X5 0.57156203 -0.27604899 -0.01591777 0.45273679 0.45651739 1.00000000
## X6 0.48155843 -0.01772820 0.33118216 0.29335165 0.18918461 0.40129215
## X7 0.43411353 -0.30144997 0.13767889 0.45356780 0.28771687 0.57468549
## X6 X7
## Y 0.4815584 0.4341135
## X1 -0.0177282 -0.3014500
## X2 0.3311822 0.1376789
## X3 0.2933516 0.4535678
## X4 0.1891846 0.2877169
## X5 0.4012922 0.5746855
## X6 1.0000000 0.3186671
## X7 0.3186671 1.0000000
For a cutoff of 0.5, X5 and X7 positively correlated
model_2 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + DAY, data = data)
library(car)
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.1
vif(model_2)
## GVIF Df GVIF^(1/(2*Df))
## X1 1.511473 1 1.229420
## X2 1.458887 1 1.207844
## X3 2.619486 1 1.618483
## X4 1.507899 1 1.227965
## X5 2.953366 1 1.718536
## X6 1.645622 1 1.282818
## X7 2.043674 1 1.429571
## DAY 4.336939 5 1.158026
Variance inflation factor (VIF) levels greater than 10 typically indicate multicollinearity among predictors. However, all VIF values for the entire first-order model in this case are lower than 10, suggesting no appreciable multicollinearity among the predictors. Therefore, it can be concluded that the entire first-order model’s predictor factors do not significantly correlate with one another.
ln(y) = ln(θ0) + θ1 ln(x) + θ2 ln(t^2) + ε
Now we can see that this equation is in the form of a linear regression model where the coefficients are the parameters we want to estimate, and ln(y), ln(x), and ln(t^2) are the predictor variables. We can use linear regression to estimate the coefficients.
Given the resulting solution to the transformed model, ˆy∗ = 3 + 2.5x∗ + 1.3t, we can see that the coefficients in the original equation are:
ˆθ0 = e^3 ˆθ1 = 2.5 ˆθ2 = 1.3
Plug in the given values of the coefficients: ln(y) = 1.099 + 0.405 * ln(x) + 0.223 * t + ε
For t = 8 and x = 12: ln(y) = 1.099 + 0.405 * ln(12) + 0.223 * 8 + ε ln(y) = 1.099 + 0.405 * 2.4849 + 1.7844 + ε ln(y) = 3.3833 + ε
Taking the exponential of both sides: y = exp(3.3833 + ε) ˆy∗ = exp(3.3833) = 29.39 (rounded to two decimal places)
To calculate ˆy: y = θ0 * x^(θ1) * exp(θ2tln(e)) = θ0 * x^(θ1) * e^(θ2*t)
For t = 8 and x = 12, and using the estimated coefficients: ˆy = θ0 * 12^(θ1) * e^(θ28) ˆy = 29.39^(1/θ0) * 12^(0.405/θ1) * e^(0.2238/θ2)
Suppose you are investigating allegations of gender discrimination in the hiring process of a particular firm. The claim is that females are much less likely to be hired than males with the same background, experience, and other qualifications. To perform this study a logistic regression model was developed, and 28 former applicants were used to fit the model. The response variable, yi = 1 if applicant i is hired; yi = 0 if i is not hired. The following variables were used as predictors: EDUC: Years of higher education (4, 6, or 8) EXP: Years of experience GENDER = 1 if male applicant; 0 if female applicant Answer Questions 8–12 by using the data given in 6414-HW7-Discrim.csv.
data8 <- read.csv("6414-HW7-Discrim.csv")
log_model <- glm(HIRE ~ ., data=data8, family=binomial)
summary(log_model)
##
## Call:
## glm(formula = HIRE ~ ., family = binomial, data = data8)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4380 -0.4573 -0.1009 0.1294 2.1804
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.2483 6.0805 -2.343 0.0191 *
## EDUC 1.1549 0.6023 1.917 0.0552 .
## EXP 0.9098 0.4293 2.119 0.0341 *
## GENDER 5.6037 2.6028 2.153 0.0313 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35.165 on 27 degrees of freedom
## Residual deviance: 14.735 on 24 degrees of freedom
## AIC: 22.735
##
## Number of Fisher Scoring iterations: 7
1 - pchisq(log_model$null.deviance-log_model$deviance, 1)
## [1] 6.18563e-06
H0: β1 = β2 = β3 = 0 H1: β1!=0for atleast 1 value of i A very low p-value from the data analysis shows that the logistic regression model is suitable for the dataset. The test statistic in this study, the deviance, has a three-degree-of-freedom chi-squared distribution. It is possible to determine whether the regression model is suitable and at least one predictor is important in explaining the response variable by computing the critical value from this distribution.
The odds ratio between males and females is estimated to be 5.6 to 1 by the beta coefficient for gender, which is determined to be 5.6037. This shows that men are more likely than women to be employed. The corresponding pvalue for this coefficient is 0.0313, indicating that at a significance threshold of 0.05, this factor is statistically significant.
confint(log_model, level = 0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -31.3101488 -5.628821
## EDUC 0.2049156 2.744315
## EXP 0.3302805 2.136563
## GENDER 1.8033770 12.674158
Confidence Interval: [0.33, 2.14]
iscriminant_data <- data.frame(EDUC=6, EXP=5, GENDER=0)
predict(log_model, discriminant_data = discriminant_data, type="response")
## 1 2 3 4 5 6
## 0.0040730658 0.0175489472 0.9768829419 0.7338510832 0.0001634423 0.0928208655
## 7 8 9 10 11 12
## 0.0992702542 0.0024990525 0.0016437556 0.9835245625 0.0992702542 0.3869926554
## 13 14 15 16 17 18
## 0.0004058834 0.2788778392 0.6281262147 0.0246124941 0.6443889709 0.3088615307
## 19 20 21 22 23 24
## 0.0369764902 0.0424842835 0.0061845776 0.1524780906 0.9941978529 0.1915301359
## 25 26 27 28
## 0.0163126738 0.3088615307 0.9937780455 0.9733825060
probability = 0.6