Consider the follwoing data gathering from bbk database from June 1st to 9th in specific time period 14-16 and 16-18. We import and display the data.
data <- read.csv(file='bbk_test.csv')
head(data)
## X teachers num_of_requesting success_rate num_of_waiting waiting_ave
## 1 Mon 18-20 7 63.71429 0.2668161 15.71429 563.4286
## 2 Tue 18-20 11 38.70000 0.4082687 15.30000 258.7000
## 3 Wed 18-20 14 49.92857 0.3204578 15.85714 291.6429
## 4 Thu18-20 8 66.37500 0.2636535 16.12500 460.3750
## 5 Fri 18-20 12 22.83333 0.5364963 11.25000 106.7500
## 6 Mon 20-21 4 62.00000 0.1370968 9.75000 1197.2500
## num_of_cancel cancel_waiting answering_percentage
## 1 41.57143 319.85714 0.9146429
## 2 17.70000 169.90000 0.7707639
## 3 29.28571 251.78571 0.7970437
## 4 43.25000 251.87500 0.8787326
## 5 6.50000 64.41667 0.6849884
## 6 47.75000 578.50000 1.1323611
There are 4 statistical units, each unit represent the different time periods from June 5th and 6th. There are 6 variables describing these units:
teachers: the number of teacher assigned to the time period. num_of_requesting: the number of students asking question. success_rate: the number of questions requested by students / the number of questions solved num_of_waiting: number of students waiting in line. waiting_ave: the average wait time of answering question. num_of_cancel: the number of students canceling the request. cancel_waiting: the average time that students wait till they decide to cancel the request. Answering_Percentage: The rate of time using on answering question (time of answering question/the lecture time). Each lecture is 2 hours = 7200s
Let us first understand the structure of the lecture breifly. 1) BBK will assign some number of teachers to each time spot. 2) During the lecture, there will be students who request a ‘1 on 1’ tutoring, each student will provide their own question that needs help. 3) The teachers will answer each student one by one. If there is no student asking question, then they will present the material they prepared (that is where the answering_percentage comes from). 4) A student can choose to cancel the request if they no longer needs help or they just waiting too long (that is where the cancel_waiting variable comes from)
with(data,plot(teachers,answering_percentage,xlab="Number of Teacher",ylab="Answering_percentage"))
mod<- lm(answering_percentage~teachers,data=data)
abline(mod)
summary(mod)
##
## Call:
## lm(formula = answering_percentage ~ teachers, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07793 -0.06087 -0.04420 0.03707 0.17580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.27186 0.08862 14.351 5.43e-07 ***
## teachers -0.04241 0.00888 -4.776 0.0014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09415 on 8 degrees of freedom
## Multiple R-squared: 0.7403, Adjusted R-squared: 0.7079
## F-statistic: 22.81 on 1 and 8 DF, p-value: 0.001398
MSE <- sum(mod$residuals^2)/(8-2)
MSE
## [1] 0.01181807
Now we would like to study the association between the number of teachers and the answering percentage. The association between the number of teachers and anwering percentage is significant. (F*= 22.81; p-value= 0.001398). The association is negative and approximately linear with a correlation of r = -(R2)0.5 = -0.84
There is a strong association between the number of teachers and the anwwering percentage,
The regression model we have is Est.Answering percentage = 1.27186 + -0.04241(Number of teachers) with sqrt(MSE) = 0.01182
however, should we increase the number of teacher to decrease the percentage of answering?
Let us try to find another variable that might be related to both variables (confounding variable) and which explains the association that has been observed.
possible confouding variable: let’s try to think that the waiting average is related to the numbers and percentage. Now tratify the unit with a dummy variable:the median of the waiting average is 511.9 and we use this level to define a dummy variable:
DummyHigh = 1;waiting average > 511.9 0;waiting average < 511.9 we have divided the units into 2 groups: high waiting average and low waiting average.
summary(data$waiting_ave)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 93.33 266.94 511.90 491.14 628.71 1197.25
## define the dummy variable
data$DummyHigh <- as.numeric(data$waiting_ave>=511.9)
with(data,plot(teachers,answering_percentage,xlab="Number of Teacher",ylab="Answering_percentage",col=factor(DummyHigh), cex=1.25))
with(data,text(teachers,answering_percentage,DummyHigh,cex=0.75,pos=4))
Now we have a linear model with 3 predictors to estimate both lines: x1=DummyHigh; x2=Number of teachers ; x3=DummyHigh * Number of Teachers.
E{Y}= β0+β1x1+β2x2+β3x3 = β0+β1+(β2+β3)(Number of Teachers), DummyHigh =1 = β0+β2(Number of Teachers), DummyHigh =0
From the plot, we can see the pattern that, with a small number of teachers and average wait time higher than 511.9, the time used on answering questions is higher; with a bigger number of teachers and average wait time lower than 511.9, the time used on anwering questions is lower.
mod_ave_wait <- lm (answering_percentage ~ DummyHigh + teachers + I(DummyHigh*teachers),data=data)
mod_ave_wait$coefficient
## (Intercept) DummyHigh teachers
## 1.14381895 -0.02469339 -0.03380728
## I(DummyHigh * teachers)
## 0.01750198
InterceptHigh=mod_ave_wait$coefficients[1]+mod_ave_wait$coefficients[2]
SlopeHigh=mod_ave_wait$coefficients[3]+mod_ave_wait$coefficients[4]
InterceptLow=mod_ave_wait$coefficients[1]
SlopeLow=mod_ave_wait$coefficients[3]
c(InterceptHigh,SlopeHigh,InterceptLow,SlopeLow)
## (Intercept) teachers (Intercept) teachers
## 1.11912555 -0.01630529 1.14381895 -0.03380728
#here we obtain the two regressions:
#Stratum Intercept Slope
#DummyHigh = 1 b0+b1=1.1191 b2+b3=-0.0163
#DummyHigh = 0 b0=1.1438 b2=-0.0338
with(data,plot(teachers,answering_percentage,
xlab="Number of Teachers",ylab="Answering Percentage",
col=factor(DummyHigh), cex=1.25))
with(data,text(teachers,answering_percentage,cex=0.75, pos=4))
abline(InterceptHigh,SlopeHigh)
abline(InterceptLow,SlopeLow,lty=2)
summary(mod_ave_wait)
##
## Call:
## lm(formula = answering_percentage ~ DummyHigh + teachers + I(DummyHigh *
## teachers), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.090346 -0.061999 -0.008957 0.060186 0.126527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.14382 0.20908 5.471 0.00156 **
## DummyHigh -0.02469 0.26977 -0.092 0.93005
## teachers -0.03381 0.01707 -1.980 0.09497 .
## I(DummyHigh * teachers) 0.01750 0.02970 0.589 0.57717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0935 on 6 degrees of freedom
## Multiple R-squared: 0.8079, Adjusted R-squared: 0.7119
## F-statistic: 8.412 on 3 and 6 DF, p-value: 0.01434
We can realize from the p-value, after taking the average wait time into the account, there is no significant evidence to show that the number of teachers is associated to the answering percentage, even though there is a claer negative assocation between these two when the average wait time is not being considered.
Let us use another approach to measure to confouding variable. We include the confounding varialbe in the model as a numeric predicotr, instead of transfering the quantitative variable into a categorical variable.
mod <- lm(answering_percentage ~ teachers + waiting_ave, data=data)
summary( mod)
##
## Call:
## lm(formula = answering_percentage ~ teachers + waiting_ave, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.100849 -0.010913 0.001982 0.005464 0.132574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7731169 0.2023287 3.821 0.00653 **
## teachers -0.0100014 0.0141171 -0.708 0.50156
## waiting_ave 0.0003952 0.0001512 2.614 0.03470 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07159 on 7 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8311
## F-statistic: 23.14 on 2 and 7 DF, p-value: 0.000822
Hence
the partial association between the answering percentage and the number of teachers is negative and not significant.
the partial association between the answering percentage and waiting average is positive and significant.
Now let us add in one more independent variable to see how’s the association between the answering percentage and these two independents variable.