Consider the follwoing data gathering from bbk database from June 1st to 9th in specific time period 14-16 and 16-18. We import and display the data.

data <- read.csv(file='bbk_test.csv')
head(data)
##           X teachers num_of_requesting success_rate num_of_waiting waiting_ave
## 1 Mon 18-20        7          63.71429    0.2668161       15.71429    563.4286
## 2 Tue 18-20       11          38.70000    0.4082687       15.30000    258.7000
## 3 Wed 18-20       14          49.92857    0.3204578       15.85714    291.6429
## 4  Thu18-20        8          66.37500    0.2636535       16.12500    460.3750
## 5 Fri 18-20       12          22.83333    0.5364963       11.25000    106.7500
## 6 Mon 20-21        4          62.00000    0.1370968        9.75000   1197.2500
##   num_of_cancel cancel_waiting answering_percentage
## 1      41.57143      319.85714            0.9146429
## 2      17.70000      169.90000            0.7707639
## 3      29.28571      251.78571            0.7970437
## 4      43.25000      251.87500            0.8787326
## 5       6.50000       64.41667            0.6849884
## 6      47.75000      578.50000            1.1323611

There are 4 statistical units, each unit represent the different time periods from June 5th and 6th. There are 6 variables describing these units:

teachers: the number of teacher assigned to the time period. num_of_requesting: the number of students asking question. success_rate: the number of questions requested by students / the number of questions solved num_of_waiting: number of students waiting in line. waiting_ave: the average wait time of answering question. num_of_cancel: the number of students canceling the request. cancel_waiting: the average time that students wait till they decide to cancel the request. Answering_Percentage: The rate of time using on answering question (time of answering question/the lecture time). Each lecture is 2 hours = 7200s

Let us first understand the structure of the lecture breifly. 1) BBK will assign some number of teachers to each time spot. 2) During the lecture, there will be students who request a ‘1 on 1’ tutoring, each student will provide their own question that needs help. 3) The teachers will answer each student one by one. If there is no student asking question, then they will present the material they prepared (that is where the answering_percentage comes from). 4) A student can choose to cancel the request if they no longer needs help or they just waiting too long (that is where the cancel_waiting variable comes from)

with(data,plot(teachers,answering_percentage,xlab="Number of Teacher",ylab="Answering_percentage"))
mod<- lm(answering_percentage~teachers,data=data)
abline(mod)

summary(mod)
## 
## Call:
## lm(formula = answering_percentage ~ teachers, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07793 -0.06087 -0.04420  0.03707  0.17580 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.27186    0.08862  14.351 5.43e-07 ***
## teachers    -0.04241    0.00888  -4.776   0.0014 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09415 on 8 degrees of freedom
## Multiple R-squared:  0.7403, Adjusted R-squared:  0.7079 
## F-statistic: 22.81 on 1 and 8 DF,  p-value: 0.001398
MSE <- sum(mod$residuals^2)/(8-2)
MSE
## [1] 0.01181807

Now we would like to study the association between the number of teachers and the answering percentage. The association between the number of teachers and anwering percentage is significant. (F*= 22.81; p-value= 0.001398). The association is negative and approximately linear with a correlation of r = -(R2)0.5 = -0.84

There is a strong association between the number of teachers and the anwwering percentage,

The regression model we have is Est.Answering percentage = 1.27186 + -0.04241(Number of teachers) with sqrt(MSE) = 0.01182

however, should we increase the number of teacher to decrease the percentage of answering?

Let us try to find another variable that might be related to both variables (confounding variable) and which explains the association that has been observed.

possible confouding variable: let’s try to think that the waiting average is related to the numbers and percentage. Now tratify the unit with a dummy variable:the median of the waiting average is 511.9 and we use this level to define a dummy variable:

DummyHigh = 1;waiting average > 511.9 0;waiting average < 511.9 we have divided the units into 2 groups: high waiting average and low waiting average.

summary(data$waiting_ave)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   93.33  266.94  511.90  491.14  628.71 1197.25
## define the dummy variable
data$DummyHigh <- as.numeric(data$waiting_ave>=511.9)
with(data,plot(teachers,answering_percentage,xlab="Number of Teacher",ylab="Answering_percentage",col=factor(DummyHigh), cex=1.25))
with(data,text(teachers,answering_percentage,DummyHigh,cex=0.75,pos=4))

Now we have a linear model with 3 predictors to estimate both lines: x1=DummyHigh; x2=Number of teachers ; x3=DummyHigh * Number of Teachers.

E{Y}= β0+β1x1+β2x2+β3x3 = β0+β1+(β2+β3)(Number of Teachers), DummyHigh =1 = β0+β2(Number of Teachers), DummyHigh =0

From the plot, we can see the pattern that, with a small number of teachers and average wait time higher than 511.9, the time used on answering questions is higher; with a bigger number of teachers and average wait time lower than 511.9, the time used on anwering questions is lower.

mod_ave_wait <- lm (answering_percentage ~ DummyHigh + teachers + I(DummyHigh*teachers),data=data)
mod_ave_wait$coefficient
##             (Intercept)               DummyHigh                teachers 
##              1.14381895             -0.02469339             -0.03380728 
## I(DummyHigh * teachers) 
##              0.01750198
InterceptHigh=mod_ave_wait$coefficients[1]+mod_ave_wait$coefficients[2]
SlopeHigh=mod_ave_wait$coefficients[3]+mod_ave_wait$coefficients[4]
InterceptLow=mod_ave_wait$coefficients[1] 
SlopeLow=mod_ave_wait$coefficients[3]
c(InterceptHigh,SlopeHigh,InterceptLow,SlopeLow)
## (Intercept)    teachers (Intercept)    teachers 
##  1.11912555 -0.01630529  1.14381895 -0.03380728
#here we obtain the two regressions: 
#Stratum             Intercept          Slope
#DummyHigh = 1    b0+b1=1.1191       b2+b3=-0.0163
#DummyHigh = 0      b0=1.1438            b2=-0.0338

with(data,plot(teachers,answering_percentage,
xlab="Number of Teachers",ylab="Answering Percentage",
col=factor(DummyHigh), cex=1.25))
with(data,text(teachers,answering_percentage,cex=0.75, pos=4))
abline(InterceptHigh,SlopeHigh)
abline(InterceptLow,SlopeLow,lty=2)

summary(mod_ave_wait)
## 
## Call:
## lm(formula = answering_percentage ~ DummyHigh + teachers + I(DummyHigh * 
##     teachers), data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.090346 -0.061999 -0.008957  0.060186  0.126527 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)              1.14382    0.20908   5.471  0.00156 **
## DummyHigh               -0.02469    0.26977  -0.092  0.93005   
## teachers                -0.03381    0.01707  -1.980  0.09497 . 
## I(DummyHigh * teachers)  0.01750    0.02970   0.589  0.57717   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0935 on 6 degrees of freedom
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.7119 
## F-statistic: 8.412 on 3 and 6 DF,  p-value: 0.01434

We can realize from the p-value, after taking the average wait time into the account, there is no significant evidence to show that the number of teachers is associated to the answering percentage, even though there is a claer negative assocation between these two when the average wait time is not being considered.

Let us use another approach to measure to confouding variable. We include the confounding varialbe in the model as a numeric predicotr, instead of transfering the quantitative variable into a categorical variable.

mod <- lm(answering_percentage ~ teachers + waiting_ave, data=data)
summary(   mod)
## 
## Call:
## lm(formula = answering_percentage ~ teachers + waiting_ave, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.100849 -0.010913  0.001982  0.005464  0.132574 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.7731169  0.2023287   3.821  0.00653 **
## teachers    -0.0100014  0.0141171  -0.708  0.50156   
## waiting_ave  0.0003952  0.0001512   2.614  0.03470 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07159 on 7 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8311 
## F-statistic: 23.14 on 2 and 7 DF,  p-value: 0.000822

Hence

the partial association between the answering percentage and the number of teachers is negative and not significant.

the partial association between the answering percentage and waiting average is positive and significant.

Now let us add in one more independent variable to see how’s the association between the answering percentage and these two independents variable.