2022-12-07

Abstract

I will be exploring an IT support data set to determine if the age of an IT support interaction (i.e., the length of time it takes to resolve a ticket), the number of words exchanged between IT support staff and the customer (i.e., the length of conversations measured in number of words), the number of unique update event (i.e., the number of times the support ticket is updated by the IT support staff or user) and the number of reassignments (i.e., the number of IT support staff involved in resolving the user’s technical issue) contribute to the positive or negative sentiment associated with IT support interactions.

To perform this analysis, I am using a 2022 IT support tickets data set from a liberal arts college in NYC containing IT support requests from faculty, staff and students. This dataset was sourced from my job (where I manage this group). As an IT support organization, the ability to detect positive and negative interactions between employees and users is crucial to the operations of the group; identifying factors that contribute and increase positive sentiment are essential to customer satisfaction and employee retention efforts.

A multiple regression analysis was performed for these specific variables because ticket age, number of updates, communication length and number of reassignments are controllable elements of a support interaction.

If the variables contribute to the sentiment score predication, it should then be possible to dial up or dial down these elements to improve support interactions (i.e., increase their sentiment scores).

Abstract Summary

  • As a manager of the IT support group for a liberal arts college supporting faculty, staff and students…


  • I want to be able to identify variables that affect positive or negative interactions between users and IT support staff…


  • so that I can affect those interactions in a positive direction.


I used a multiple regression to explore 4 predictor variables for a response variable.

Where I started…

  • The data is an extract from an IT support tickets database containing 6390 support tickets from Jan 1 2022 to Nov 4 2022, sourced from my job.


  • An IT support ticket is generated every time a user and IT support staff interact to resolve a technical issue.


  • A sentiment analysis using SentimentR package was run against the dialogue text (conversation that occurs between users & IT support staff).

A bit more about the data…

  • SentimentR split the dialogue into ~119,000 sentences
  • SentimentR derived the word count and sentiment score
  • For reasons confidentiality and data sensitivity, the dialogue columns have been removed from the datasets before publishing github

Data Snapshot

number total_words update_count reassignment_count age_at_resolution_days total_sentiment
INC0113479 NA 4 0 11.85 4.442119
INC0113480 339 4 0 0.92 7.626745
INC0113481 775 6 1 4.18 5.159090
INC0113482 138 3 0 4.01 1.689466
INC0113483 330 5 1 15.01 3.315560
INC0113484 59 3 0 0.99 2.285912
INC0113485 253 3 0 0.06 1.513980
INC0113486 538 14 2 1.96 7.682345
INC0113487 1772 14 0 17.72 12.561692
INC0113488 198 5 0 14.86 1.148016
INC0113489 260 4 0 0.57 3.325893
INC0113490 288 8 0 0.11 4.528785
INC0113491 123 6 0 0.07 2.871882
INC0113492 207 5 0 14.01 2.186718
INC0113493 1579 21 4 2.05 15.082365

A few more summary stats…

number total_words update_count reassignment_count age_at_resolution_days total_sentiment
Length:6390 Min. : 5.0 Min. : 0.000 Min. : 0.0000 Min. : -2.41 Min. :-18.240
Class :character 1st Qu.: 105.0 1st Qu.: 3.000 1st Qu.: 0.0000 1st Qu.: 0.06 1st Qu.: 1.027
Mode :character Median : 197.0 Median : 5.000 Median : 0.0000 Median : 1.09 Median : 1.993
NA Mean : 304.3 Mean : 6.708 Mean : 0.4502 Mean : 7.72 Mean : 2.919
NA 3rd Qu.: 363.0 3rd Qu.: 8.000 3rd Qu.: 0.0000 3rd Qu.: 8.01 3rd Qu.: 3.596
NA Max. :4998.0 Max. :168.000 Max. :10.0000 Max. :280.01 Max. : 68.527
NA NA’s :741 NA NA NA’s :212 NA

Framing the Context

This is what our data contains:

  • A support ticket (represents the customer and analyst interaction)

  • of a certain age (the time it takes to resolve the support issue)

  • at a certain level of engagement (represented by the total number of words in the dialogue)

  • with a certain number of update events (represented by the update count)

  • being addressed by a certain number of technicians throughout its lifespan (represented by the number of reassignments)

  • achieves a certain level of negative or positive sentiment (the sentiment score)

This is what we ask (The Research Question)

Does the ticket age, number of words, frequency of updates, and number of reassignments predict the sentiment score for each incident?


Predictor Variables:

  • ticket age
  • word count
  • update count
  • reassignment count


Response Variable:

  • sentiment score

The Distribution of IT Support Sentiment

Multiple Regression - 5 Step Approach

  1. Selected 4 predictor variables from the data set and 1 response variable
  2. Checked the Residuals Assumptions: homoscedasticity, normality, probability
  3. Checked linear relationships for each predictor variable against the response variable
  4. Checked collinearity among the predictor variables
  5. Checked variables for statistical significance: p < .05

The Residuals - Mixed Results

  • Homoscedasticity assumption does not appear to be met
  • There is a normal distribution of the residuals
  • Normal probability QQ plot indicates skewness in residuals

Linear Relationship Check

We can consider all of these variables in the multiple regression

Collinearity Check: Variable Inflation Factor (VIF)

  • VIF calculation checks for collinearity among the predictor variables
  • VIF values greater than 5 indicate collinearity among variables
  • In this case, there is no collinearity among the 4 variables

Check coefficients, t values and p values

## 
## Call:
## lm(formula = total_sentiment ~ total_words + update_count + reassignment_count + 
##     age_at_resolution_days, data = df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34.35  -0.74  -0.01   0.69  34.83 
## 
## Coefficients:
##                         Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)            0.4468407  0.0338804   13.19 <0.0000000000000002 ***
## total_words            0.0066172  0.0000718   92.20 <0.0000000000000002 ***
## update_count           0.0412394  0.0042201    9.77 <0.0000000000000002 ***
## reassignment_count     0.2779674  0.0275527   10.09 <0.0000000000000002 ***
## age_at_resolution_days 0.0041227  0.0013825    2.98              0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.75 on 5454 degrees of freedom
##   (931 observations deleted due to missingness)
## Multiple R-squared:  0.713,  Adjusted R-squared:  0.712 
## F-statistic: 3.38e+03 on 4 and 5454 DF,  p-value: <0.0000000000000002

The Profile for Positive IT Support Sentiment

  • increasing number of words (lengthier dialogue)
  • increasing number of updates (frequent dialogue)
  • increasing number of reassignments (more analysts)
  • increasing ticket age (longer time to resolution)


Equation for the sentiment score prediction

y = 0.4468407 + 0.0066172 (words) + 0.0412394(update) + 0.2779674(reassignment) + 0.0041227(age)

Intepretation & Conclusion

  • With an Adjusted R-squared: 71.2%, the multiple regression model is a decent predictor of sentiment.

  • The number of words and number of updates makes sense: typically, users & employees are more “positive” with more information and more frequent updates regarding technical issues - i.e., likely positive as this indicates active engagement go resolve the technical issue.

  • It is unclear why higher reassignment counts and aging tickets contribute to a positive sentiment score. Typically, there is an inverse relationship for these variables: the more technicians involved handling an issue and the longer it takes to resolve an issue, the more “negative” the experience tends to be.

Limitations

  • The residuals assumptions were not completely met.
  • The p values identified positive variables that contradict the real-world experience
  • The predictor and response variables need additional cleanup