Hw7_Regression_Abhilasha

ISyE 6414 Regression Analysis — Spring 2023 Homework 7 For all of the problems in this assignment, please submit the relevant outputs and your R codes (if you used R). In addition, unless otherwise indicated, assume that α = 0.05. In a large metropolitan department store, the number of hours worked (Y ) per day by the clerical staff may depend on the following variables: X1: Number of pieces of mail processed X2: Number of money orders and gift certificates sold X3: Number of window payments transacted X4: Number of change order transactions processed X5: Number of checks cashed X6: Number of pieces of mail processed on an “as available” basis X7: Number of bus tickets sold X8: Working day of the week (they do not work on Sundays). Since X8 is a qualitative variable with six levels, five dummy variables will be used in the model. The attached table 6414-HW7-Clerical.txt provides the observed activities for 54 working days. Answer Questions 1–5 by using the information above. 1. Consider the following first-order multiple linear regression model built by using the variables X2, X5, and X8 for this problem. Y = β0 + β1X2 + β2X5 + β3D1 + β4D2 + β5D3 + β6D4 + β7D5 + , where Dj = { 1, if the day of the week is j 0, otherwise, for j = 1, . . . , 5. Solve the model and submit your output. Are the predictors X2, X5, and X8 statistically significant? You can answer by looking at their p-values.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.2.1

## corrplot 0.92 loaded

data <- read.table("6414-HW7-Clerical.txt", header =TRUE)
data

##    DAY     Y    X1  X2   X3  X4   X5 X6   X7
## 1    M 128.5  7781 100  886 235  644 56  737
## 2    T 113.6  7004 110  962 388  589 57 1029
## 3    W 146.6  7267  61 1342 398 1081 59  830
## 4   Th 124.3  2129 102 1153 457  891 57 1468
## 5    F 100.4  4878  45  803 577  537 49  335
## 6    S 119.2  3999 144 1127 345  563 64  918
## 7    M 109.5 11777 123  627 326  402 60  335
## 8    T 128.5  5764  78  748 161  495 57  962
## 9    W 131.2  7392 172  876 219  823 62  665
## 10  Th 112.2  8100 126  685 287  555 86  577
## 11   F  95.4  4736 115  436 235  456 38  214
## 12   S 124.6  4337 110  899 127  573 73  484
## 13   M 103.7  3079  96  570 180  428 59  456
## 14   T 103.6  7273  51  826 118  463 53  907
## 15   W 133.2  4091 116 1060 206  961 67  951
## 16  Th 111.4  3390  70  957 284  745 77 1446
## 17   F  97.7  6319  58  559 220  539 41  440
## 18   S 132.1  7447  83 1050 174  553 63 1133
## 19   M 135.9  7100  80  568 124  428 55  456
## 20   T 131.3  8035 115  709 174  498 78  968
## 21   W 150.4  5579  83  568 223  683 79  660
## 22  Th 124.9  4338  78  900 115  556 84  555
## 23   F  97.0  6895  18  442 118  479 41  203
## 24   S 114.1  3629 133  644 155  505 57  781
## 25   M  88.3  5149  92  389 124  405 59  236
## 26   T 117.6  5241 110  612 222  477 55  616
## 27   W 128.2  2917  69 1057 378  970 80 1210
## 28  Th 138.8  4390  70  974 195 1027 81 1452
## 29   F 109.5  4957  24  783 358  893 51  616
## 30   S 118.9  7099 130 1419 374  609 62  957
## 31   M 122.2  7337 128 1137 238  461 51  968
## 32   T 142.8  8301 115  946 191  771 74  719
## 33   W 133.9  4889  86  750 214  513 69  489
## 34  Th 100.2  6308  81  461 132  430 49  341
## 35   F 116.8  6908 145  864 164  549 57  902
## 36   S  97.3  5345 116  604 127  360 48  126
## 37   M  98.0  6994  59  714 107  473 53  726
## 38   T 136.5  6781  78  917 171  805 74 1100
## 39   W 111.7  3142 106  809 335  702 70 1721
## 40  Th  98.6  5738  27  546 126  455 52  502
## 41   F 116.2  4931 174  891 129  481 71  737
## 42   S 108.9  6501  69  643 129  334 47  473
## 43   M 120.6  5678  94  828 107  384 52 1083
## 44   T 131.8  4619 100  777 164  834 67  841
## 45   W 112.4  1832 124  626 158  571 71  627
## 46  Th  92.5  5445  52  432 121  458 42  313
## 47   F 120.0  4123  84  432 153  544 42  654
## 48   S 112.2  5884  89 1061 100  391 31  280
## 49   M 113.0  5505  45  562  84  444 36  814
## 50   T 138.7  2882  94  601 139  799 44  907
## 51   W 122.1  2395  89  637 201  747 30 1666
## 52  Th  86.6  6847  14  810 230  547 40  614
## 53   F 116.8  6808 148  854 164  539 57  905
## 54   S 122.2  5684  79 1161 105  371 39  285

D1 <- ifelse(data$DAY == "M",1,0)
D2 <- ifelse(data$DAY == "T",1,0)
D3 <- ifelse(data$DAY == "W",1,0)
D4 <- ifelse(data$DAY == "Th",1,0)
D5 <- ifelse(data$DAY == "F",1,0)
model <- lm(Y ~ X2 + X5 + D1 + D2 + D3 + D4 + D5, data = data)
summary(model)

## 
## Call:
## lm(formula = Y ~ X2 + X5 + D1 + D2 + D3 + D4 + D5, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.8935  -6.7983  -0.7877   5.3274  26.7106 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  85.04109    7.74906  10.974 1.95e-14 ***
## X2            0.08810    0.04470   1.971   0.0548 .  
## X5            0.04700    0.01034   4.544 3.98e-05 ***
## D1           -0.98761    5.24524  -0.188   0.8515    
## D2            3.85589    5.48805   0.703   0.4858    
## D3           -0.76465    6.11147  -0.125   0.9010    
## D4          -10.74410    5.68588  -1.890   0.0651 .  
## D5          -11.42393    5.31590  -2.149   0.0369 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.02 on 46 degrees of freedom
## Multiple R-squared:  0.5468, Adjusted R-squared:  0.4778 
## F-statistic: 7.929 on 7 and 46 DF,  p-value: 2.747e-06

The table shows that because X2’s p-value is greater than 0.05, it does not exhibit statistical importance. However, because X5’s p-value is less than 0.05, it exhibits statistical importance. Sunday is chosen as the starting point.

Additionally, the statistical importance of the dummy variable D5, which stands for Day-Friday, can be used to explain the significance of the qualitative variable X8. The other dummy variables (D1, D2, D3, and D4), however, have p-values that are greater than the predetermined cutoff of 0.05, suggesting that they are not statistically significant. However, X8 is regarded as important because there is at least one significant level.

Is the mean number of hours worked on a working day of a week identical for each day? Support your answer by hypothesis testing and expected values.

model_aov <- aov(Y~DAY, data = data)
summary(model_aov)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## DAY          5   3772   754.4   4.228 0.00289 **
## Residuals   48   8564   178.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To determine which days are significantly different from each other, we can perform pairwise t-tests with a Bonferroni correction for multiple comparisons.

mean = aggregate(Y ~ DAY, data = data, mean)
mean

##   DAY        Y
## 1   F 107.7556
## 2   M 113.3000
## 3   S 116.6111
## 4   T 127.1556
## 5  Th 109.9444
## 6   W 129.9667

Null hypothesis: μM= μT= μW= μTh= μF= μS Alternative Hypothesis: μM != μT != μW != μTh != μF != μS The alternative hypothesis contends that at least one mean differs from the others while the ANOVA null hypothesis states that all group means are equal. The null hypothesis in the specific case suggests that the typical everyday working hours are the same on all weekdays. The alternative theory, on the other hand, contends that the mean working hours vary from day to day. The obtained p-value was below the significance level of 0.05, so the results indicate rejecting the null hypothesis. Therefore, it can be concluded that the mean daily working hours are different on different dates. The calculated mean values, which show disparity among the means, support this claim.

Which day of the week has the lowest mean number of hours worked when X2 and X5 are kept constant for all days?

mean_X2 <- mean(data$X2)
mean_X5 <- mean(data$X5)
Xdata <- data.frame(X2 = rep(mean_X2, 54), X5 = rep(mean_X5, 54), D1=D1, D2= D2, D3=D3, D4=D4, D5=D5)
Xdata

##          X2       X5 D1 D2 D3 D4 D5
## 1  91.81481 588.7222  1  0  0  0  0
## 2  91.81481 588.7222  0  1  0  0  0
## 3  91.81481 588.7222  0  0  1  0  0
## 4  91.81481 588.7222  0  0  0  1  0
## 5  91.81481 588.7222  0  0  0  0  1
## 6  91.81481 588.7222  0  0  0  0  0
## 7  91.81481 588.7222  1  0  0  0  0
## 8  91.81481 588.7222  0  1  0  0  0
## 9  91.81481 588.7222  0  0  1  0  0
## 10 91.81481 588.7222  0  0  0  1  0
## 11 91.81481 588.7222  0  0  0  0  1
## 12 91.81481 588.7222  0  0  0  0  0
## 13 91.81481 588.7222  1  0  0  0  0
## 14 91.81481 588.7222  0  1  0  0  0
## 15 91.81481 588.7222  0  0  1  0  0
## 16 91.81481 588.7222  0  0  0  1  0
## 17 91.81481 588.7222  0  0  0  0  1
## 18 91.81481 588.7222  0  0  0  0  0
## 19 91.81481 588.7222  1  0  0  0  0
## 20 91.81481 588.7222  0  1  0  0  0
## 21 91.81481 588.7222  0  0  1  0  0
## 22 91.81481 588.7222  0  0  0  1  0
## 23 91.81481 588.7222  0  0  0  0  1
## 24 91.81481 588.7222  0  0  0  0  0
## 25 91.81481 588.7222  1  0  0  0  0
## 26 91.81481 588.7222  0  1  0  0  0
## 27 91.81481 588.7222  0  0  1  0  0
## 28 91.81481 588.7222  0  0  0  1  0
## 29 91.81481 588.7222  0  0  0  0  1
## 30 91.81481 588.7222  0  0  0  0  0
## 31 91.81481 588.7222  1  0  0  0  0
## 32 91.81481 588.7222  0  1  0  0  0
## 33 91.81481 588.7222  0  0  1  0  0
## 34 91.81481 588.7222  0  0  0  1  0
## 35 91.81481 588.7222  0  0  0  0  1
## 36 91.81481 588.7222  0  0  0  0  0
## 37 91.81481 588.7222  1  0  0  0  0
## 38 91.81481 588.7222  0  1  0  0  0
## 39 91.81481 588.7222  0  0  1  0  0
## 40 91.81481 588.7222  0  0  0  1  0
## 41 91.81481 588.7222  0  0  0  0  1
## 42 91.81481 588.7222  0  0  0  0  0
## 43 91.81481 588.7222  1  0  0  0  0
## 44 91.81481 588.7222  0  1  0  0  0
## 45 91.81481 588.7222  0  0  1  0  0
## 46 91.81481 588.7222  0  0  0  1  0
## 47 91.81481 588.7222  0  0  0  0  1
## 48 91.81481 588.7222  0  0  0  0  0
## 49 91.81481 588.7222  1  0  0  0  0
## 50 91.81481 588.7222  0  1  0  0  0
## 51 91.81481 588.7222  0  0  1  0  0
## 52 91.81481 588.7222  0  0  0  1  0
## 53 91.81481 588.7222  0  0  0  0  1
## 54 91.81481 588.7222  0  0  0  0  0

Y_pred = predict(model, discriminant_data = Xdata)
min_hours = which.min(Y_pred)
min_hours

## 23 
## 23

Friday

Now consider all predictors except the qualitative variable (X8). Produce and submit the scatter plot matrix of Y and all quantitative predictors. By observing the plots of Xi vs. Xj , state which predictors may be related (correlated).

var <- c("Y", "X1", "X2", "X3", "X4", "X5", "X6", "X7")
data2 <- data[, var]
pairs(data2)

cor(data2)

##              Y          X1          X2         X3          X4          X5
## Y   1.00000000 -0.00783577  0.28199476 0.45895267  0.07811222  0.57156203
## X1 -0.00783577  1.00000000  0.03000473 0.05808654 -0.04876899 -0.27604899
## X2  0.28199476  0.03000473  1.00000000 0.23147193  0.02888080 -0.01591777
## X3  0.45895267  0.05808654  0.23147193 1.00000000  0.42857717  0.45273679
## X4  0.07811222 -0.04876899  0.02888080 0.42857717  1.00000000  0.45651739
## X5  0.57156203 -0.27604899 -0.01591777 0.45273679  0.45651739  1.00000000
## X6  0.48155843 -0.01772820  0.33118216 0.29335165  0.18918461  0.40129215
## X7  0.43411353 -0.30144997  0.13767889 0.45356780  0.28771687  0.57468549
##            X6         X7
## Y   0.4815584  0.4341135
## X1 -0.0177282 -0.3014500
## X2  0.3311822  0.1376789
## X3  0.2933516  0.4535678
## X4  0.1891846  0.2877169
## X5  0.4012922  0.5746855
## X6  1.0000000  0.3186671
## X7  0.3186671  1.0000000

For a cutoff of 0.5, X5 and X7 positively correlated

Produce and submit VIF values of all predictors for a complete first-order model (all variables are included). What is your conclusion on possible multicollinearity?

model_2 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + DAY, data = data)
library(car)

## Warning: package 'car' was built under R version 4.2.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.2.1

vif(model_2)

##         GVIF Df GVIF^(1/(2*Df))
## X1  1.511473  1        1.229420
## X2  1.458887  1        1.207844
## X3  2.619486  1        1.618483
## X4  1.507899  1        1.227965
## X5  2.953366  1        1.718536
## X6  1.645622  1        1.282818
## X7  2.043674  1        1.429571
## DAY 4.336939  5        1.158026

Variance inflation factor (VIF) levels greater than 10 typically indicate multicollinearity among predictors. However, all VIF values for the entire first-order model in this case are lower than 10, suggesting no appreciable multicollinearity among the predictors. Therefore, it can be concluded that the entire first-order model’s predictor factors do not significantly correlate with one another.

Taking log on both sides

ln(y) = ln(θ0) + θ1 ln(x) + θ2 ln(t^2) + ε

Now we can see that this equation is in the form of a linear regression model where the coefficients are the parameters we want to estimate, and ln(y), ln(x), and ln(t^2) are the predictor variables. We can use linear regression to estimate the coefficients.

Given the resulting solution to the transformed model, ˆy∗ = 3 + 2.5x∗ + 1.3t, we can see that the coefficients in the original equation are:

ˆθ0 = e^3 ˆθ1 = 2.5 ˆθ2 = 1.3

Consider Problem 6. Calculate ˆy∗ and ˆy when t = 8 and x = 12

Plug in the given values of the coefficients: ln(y) = 1.099 + 0.405 * ln(x) + 0.223 * t + ε

For t = 8 and x = 12: ln(y) = 1.099 + 0.405 * ln(12) + 0.223 * 8 + ε ln(y) = 1.099 + 0.405 * 2.4849 + 1.7844 + ε ln(y) = 3.3833 + ε

Taking the exponential of both sides: y = exp(3.3833 + ε) ˆy∗ = exp(3.3833) = 29.39 (rounded to two decimal places)

To calculate ˆy: y = θ0 * x^(θ1) * exp(θ2tln(e)) = θ0 * x^(θ1) * e^(θ2*t)

For t = 8 and x = 12, and using the estimated coefficients: ˆy = θ0 * 12^(θ1) * e^(θ28) ˆy = 29.39^(1/θ0) * 12^(0.405/θ1) * e^(0.2238/θ2)

Suppose you are investigating allegations of gender discrimination in the hiring process of a particular firm. The claim is that females are much less likely to be hired than males with the same background, experience, and other qualifications. To perform this study a logistic regression model was developed, and 28 former applicants were used to fit the model. The response variable, yi = 1 if applicant i is hired; yi = 0 if i is not hired. The following variables were used as predictors: EDUC: Years of higher education (4, 6, or 8) EXP: Years of experience GENDER = 1 if male applicant; 0 if female applicant Answer Questions 8–12 by using the data given in 6414-HW7-Discrim.csv.

Solve the logistic regression model, and submit the solution.

data8 <- read.csv("6414-HW7-Discrim.csv")
log_model <- glm(HIRE ~ ., data=data8, family=binomial)
summary(log_model)

## 
## Call:
## glm(formula = HIRE ~ ., family = binomial, data = data8)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4380  -0.4573  -0.1009   0.1294   2.1804  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -14.2483     6.0805  -2.343   0.0191 *
## EDUC          1.1549     0.6023   1.917   0.0552 .
## EXP           0.9098     0.4293   2.119   0.0341 *
## GENDER        5.6037     2.6028   2.153   0.0313 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35.165  on 27  degrees of freedom
## Residual deviance: 14.735  on 24  degrees of freedom
## AIC: 22.735
## 
## Number of Fisher Scoring iterations: 7

Conduct a hypothesis test to see whether the logistic regression model is appropriate or not. State the null and alternative hypotheses, the test statistic, critical value, and your conclusion (i.e., give a statement in plain English interpreting this result).

1 - pchisq(log_model$null.deviance-log_model$deviance, 1)

## [1] 6.18563e-06

H0: β1 = β2 = β3 = 0 H1: β1!=0for atleast 1 value of i A very low p-value from the data analysis shows that the logistic regression model is suitable for the dataset. The test statistic in this study, the deviance, has a three-degree-of-freedom chi-squared distribution. It is possible to determine whether the regression model is suitable and at least one predictor is important in explaining the response variable by computing the critical value from this distribution.

What do you say about the claim that females are less likely to be hired? Why? Support your answer by referring to the relevant predictor and its p-value.

The odds ratio between males and females is estimated to be 5.6 to 1 by the beta coefficient for gender, which is determined to be 5.6037. This shows that men are more likely than women to be employed. The corresponding pvalue for this coefficient is 0.0313, indicating that at a significance threshold of 0.05, this factor is statistically significant.

Calculate an approximate 95% confidence interval for the true value of the “Experience” coefficient (β2).

confint(log_model, level = 0.95)

## Waiting for profiling to be done...

##                   2.5 %    97.5 %
## (Intercept) -31.3101488 -5.628821
## EDUC          0.2049156  2.744315
## EXP           0.3302805  2.136563
## GENDER        1.8033770 12.674158

Confidence Interval: [0.33, 2.14]

Using this model, what is the estimated probability that a female applicant with 6 years of higher education and 5 years of experience will be hired?

iscriminant_data <- data.frame(EDUC=6, EXP=5, GENDER=0)
predict(log_model, discriminant_data = discriminant_data, type="response")

##            1            2            3            4            5            6 
## 0.0040730658 0.0175489472 0.9768829419 0.7338510832 0.0001634423 0.0928208655 
##            7            8            9           10           11           12 
## 0.0992702542 0.0024990525 0.0016437556 0.9835245625 0.0992702542 0.3869926554 
##           13           14           15           16           17           18 
## 0.0004058834 0.2788778392 0.6281262147 0.0246124941 0.6443889709 0.3088615307 
##           19           20           21           22           23           24 
## 0.0369764902 0.0424842835 0.0061845776 0.1524780906 0.9941978529 0.1915301359 
##           25           26           27           28 
## 0.0163126738 0.3088615307 0.9937780455 0.9733825060

probability = 0.6