Multiple Regression:

2022-12-07

Abstract

I will be exploring an IT support data set to determine if the age of an IT support interaction (i.e., the length of time it takes to resolve a ticket), the number of words exchanged between IT support staff and the customer (i.e., the length of conversations measured in number of words), the number of unique update event (i.e., the number of times the support ticket is updated by the IT support staff or user) and the number of reassignments (i.e., the number of IT support staff involved in resolving the user’s technical issue) contribute to the positive or negative sentiment associated with IT support interactions.

To perform this analysis, I am using a 2022 IT support tickets data set from a liberal arts college in NYC containing IT support requests from faculty, staff and students. This dataset was sourced from my job (where I manage this group). As an IT support organization, the ability to detect positive and negative interactions between employees and users is crucial to the operations of the group; identifying factors that contribute and increase positive sentiment are essential to customer satisfaction and employee retention efforts.

A multiple regression analysis was performed for these specific variables because ticket age, number of updates, communication length and number of reassignments are controllable elements of a support interaction.

If the variables contribute to the sentiment score predication, it should then be possible to dial up or dial down these elements to improve support interactions (i.e., increase their sentiment scores).

Abstract Summary

As a manager of the IT support group for a liberal arts college supporting faculty, staff and students…

I want to be able to identify variables that affect positive or negative interactions between users and IT support staff…

so that I can affect those interactions in a positive direction.

I used a multiple regression to explore 4 predictor variables for a response variable.

Where I started…

The data is an extract from an IT support tickets database containing 6390 support tickets from Jan 1 2022 to Nov 4 2022, sourced from my job.

An IT support ticket is generated every time a user and IT support staff interact to resolve a technical issue.

A sentiment analysis using SentimentR package was run against the dialogue text (conversation that occurs between users & IT support staff).

A bit more about the data…

SentimentR split the dialogue into ~119,000 sentences
SentimentR derived the word count and sentiment score
For reasons confidentiality and data sensitivity, the dialogue columns have been removed from the datasets before publishing github

Data Snapshot

number	total_words	update_count	reassignment_count	age_at_resolution_days	total_sentiment
INC0113479	NA	4	0	11.85	4.442119
INC0113480	339	4	0	0.92	7.626745
INC0113481	775	6	1	4.18	5.159090
INC0113482	138	3	0	4.01	1.689466
INC0113483	330	5	1	15.01	3.315560
INC0113484	59	3	0	0.99	2.285912
INC0113485	253	3	0	0.06	1.513980
INC0113486	538	14	2	1.96	7.682345
INC0113487	1772	14	0	17.72	12.561692
INC0113488	198	5	0	14.86	1.148016
INC0113489	260	4	0	0.57	3.325893
INC0113490	288	8	0	0.11	4.528785
INC0113491	123	6	0	0.07	2.871882
INC0113492	207	5	0	14.01	2.186718
INC0113493	1579	21	4	2.05	15.082365

A few more summary stats…

number	total_words	update_count	reassignment_count	age_at_resolution_days	total_sentiment
Length:6390	Min. : 5.0	Min. : 0.000	Min. : 0.0000	Min. : -2.41	Min. :-18.240
Class :character	1st Qu.: 105.0	1st Qu.: 3.000	1st Qu.: 0.0000	1st Qu.: 0.06	1st Qu.: 1.027
Mode :character	Median : 197.0	Median : 5.000	Median : 0.0000	Median : 1.09	Median : 1.993
NA	Mean : 304.3	Mean : 6.708	Mean : 0.4502	Mean : 7.72	Mean : 2.919
NA	3rd Qu.: 363.0	3rd Qu.: 8.000	3rd Qu.: 0.0000	3rd Qu.: 8.01	3rd Qu.: 3.596
NA	Max. :4998.0	Max. :168.000	Max. :10.0000	Max. :280.01	Max. : 68.527
NA	NA’s :741	NA	NA	NA’s :212	NA

Framing the Context

This is what our data contains:

A support ticket (represents the customer and analyst interaction)
of a certain age (the time it takes to resolve the support issue)
at a certain level of engagement (represented by the total number of words in the dialogue)
with a certain number of update events (represented by the update count)
being addressed by a certain number of technicians throughout its lifespan (represented by the number of reassignments)
achieves a certain level of negative or positive sentiment (the sentiment score)

This is what we ask (The Research Question)

Does the ticket age, number of words, frequency of updates, and number of reassignments predict the sentiment score for each incident?

Predictor Variables:

ticket age
word count
update count
reassignment count

Response Variable:

sentiment score

The Distribution of IT Support Sentiment

Multiple Regression - 5 Step Approach

Selected 4 predictor variables from the data set and 1 response variable
Checked the Residuals Assumptions: homoscedasticity, normality, probability
Checked linear relationships for each predictor variable against the response variable
Checked collinearity among the predictor variables
Checked variables for statistical significance: p < .05

The Residuals - Mixed Results

Homoscedasticity assumption does not appear to be met
There is a normal distribution of the residuals
Normal probability QQ plot indicates skewness in residuals

Linear Relationship Check

We can consider all of these variables in the multiple regression

Collinearity Check: Variable Inflation Factor (VIF)

VIF calculation checks for collinearity among the predictor variables
VIF values greater than 5 indicate collinearity among variables
In this case, there is no collinearity among the 4 variables

Check coefficients, t values and p values

## 
## Call:
## lm(formula = total_sentiment ~ total_words + update_count + reassignment_count + 
##     age_at_resolution_days, data = df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34.35  -0.74  -0.01   0.69  34.83 
## 
## Coefficients:
##                         Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)            0.4468407  0.0338804   13.19 <0.0000000000000002 ***
## total_words            0.0066172  0.0000718   92.20 <0.0000000000000002 ***
## update_count           0.0412394  0.0042201    9.77 <0.0000000000000002 ***
## reassignment_count     0.2779674  0.0275527   10.09 <0.0000000000000002 ***
## age_at_resolution_days 0.0041227  0.0013825    2.98              0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.75 on 5454 degrees of freedom
##   (931 observations deleted due to missingness)
## Multiple R-squared:  0.713,  Adjusted R-squared:  0.712 
## F-statistic: 3.38e+03 on 4 and 5454 DF,  p-value: <0.0000000000000002

The Profile for Positive IT Support Sentiment

increasing number of words (lengthier dialogue)
increasing number of updates (frequent dialogue)
increasing number of reassignments (more analysts)
increasing ticket age (longer time to resolution)

Equation for the sentiment score prediction

y = 0.4468407 + 0.0066172 (words) + 0.0412394(update) + 0.2779674(reassignment) + 0.0041227(age)

Intepretation & Conclusion

With an Adjusted R-squared: 71.2%, the multiple regression model is a decent predictor of sentiment.
The number of words and number of updates makes sense: typically, users & employees are more “positive” with more information and more frequent updates regarding technical issues - i.e., likely positive as this indicates active engagement go resolve the technical issue.
It is unclear why higher reassignment counts and aging tickets contribute to a positive sentiment score. Typically, there is an inverse relationship for these variables: the more technicians involved handling an issue and the longer it takes to resolve an issue, the more “negative” the experience tends to be.

Limitations

The residuals assumptions were not completely met.
The p values identified positive variables that contradict the real-world experience
The predictor and response variables need additional cleanup