Data 607 Final Project

One of my goals in taking this degree program is to apply what I learn at work. Our group is in the midst of setting up a new change management organization. One of the greatest challenges we face involves measuring change management effectiveness. Toward that end I created a survey with 21 responses, and used a standardized sentiment value database with positive and negative values for 98,732 words to build a model that will rate the textual input in a numerical way. This will allow me to compare change management effectiveness across multiple distinct initiatives.

1. Loading the packages required

library(stringr)
library(RCurl)
library(knitr)
library(rmongodb)
library(jsonlite)
library(dplyr)
library(sqldf)
library(tidyr)
library(ggplot2)

2. Creating the value dictionary

I used SentiWordNet 3.0, a lexical resource for opinion mining. It can be found at sentiwordnet.isti.cnr.it. After loading the dataset I added column names, eliminated all words with both a positive and negative value of 0 and eliminated any duplicate words.

wordsense <- read.table('https://raw.githubusercontent.com/jeffnieman11/Data607_Final_Project/master/SentiWordNet_3.0.0_20130122.txt', sep="\t")

#C:/Users/jenieman/Documents/CUNY/Data 607/SentiWordnet/home/swn/www/admin/dump/SentiWordNet_3.0.0_20130122.txt

colnames(wordsense) <- c("Letter", "ID", "Positive_Value", "Negative_Value", "Word")
wordsense_use <- subset(wordsense, Positive_Value!=0 | Negative_Value!=0)

wordsense_use1 <- subset(wordsense_use, !duplicated(wordsense_use[,5]))
head(wordsense_use1)

##    Letter   ID Positive_Value Negative_Value       Word
## 1       a 1740          0.125           0.00       able
## 2       a 2098          0.000           0.75     unable
## 11      a 3700          0.250           0.00 dissilient
## 12      a 3829          0.250           0.00 parturient
## 21      a 5107          0.500           0.00      uncut
## 22      a 5205          0.500           0.00   absolute

3. Putting the data into Mongodb

To allow simple ways to check words in dictionary I created a mongodb. I originally planned on using this to compare with my surveys but later on decided to go with a SQL join shown below.

mongofinal<- mongo.create()
mongofinal

## [1] 0
## attr(,"mongo")
## <pointer: 0x000000000a11a328>
## attr(,"class")
## [1] "mongo"
## attr(,"host")
## [1] "127.0.0.1"
## attr(,"name")
## [1] ""
## attr(,"username")
## [1] ""
## attr(,"password")
## [1] ""
## attr(,"db")
## [1] "admin"
## attr(,"timeout")
## [1] 0

mongo.is.connected(mongofinal)

## [1] TRUE

db <- "group.wordsense"
mfinal <- "group.wordsense.word_values"

mongo.get.database.collections(mongofinal, db)

## character(0)

wordsense.1 <- lapply(split(wordsense_use, 1:nrow(wordsense_use1)), function(x) mongo.bson.from.JSON(toJSON(x)))

wordsense.1[1:3]

## $`1`
##  1 : 3    
##      Letter : 2   a
##      ID : 16      1740
##      Positive_Value : 1   0.125000
##      Negative_Value : 16      0
##      Word : 2     able
## 
##  2 : 3    
##      Letter : 2   r
##      ID : 16      90228
##      Positive_Value : 1   0.250000
##      Negative_Value : 1   0.375000
##      Word : 2     perilously
## 
## 
## $`2`
##  1 : 3    
##      Letter : 2   a
##      ID : 16      2098
##      Positive_Value : 16      0
##      Negative_Value : 1   0.750000
##      Word : 2     unable
## 
##  2 : 3    
##      Letter : 2   r
##      ID : 16      90424
##      Positive_Value : 1   0.500000
##      Negative_Value : 16      0
##      Word : 2     wearily
## 
## 
## $`3`
##  1 : 3    
##      Letter : 2   a
##      ID : 16      3700
##      Positive_Value : 1   0.250000
##      Negative_Value : 16      0
##      Word : 2     dissilient
## 
##  2 : 3    
##      Letter : 2   r
##      ID : 16      90551
##      Positive_Value : 1   0.125000
##      Negative_Value : 16      0
##      Word : 2     vitally

mongo.insert.batch(mongofinal, mfinal, wordsense.1)

## [1] TRUE

mongo.count(mongofinal, mfinal)

## [1] 317909

4. Creating the survey

I used an internal tool to create a survey. Here is a summary of the questions:

For which initiative have you participated in training?
Which training tools have you used?
Rank the following tools from most effective to least effective? (VOD, Web, Instructor, Virtual, OJT, QRG, Powerpoint, Other)
What do you like best about your top rated choice in question 3?
What are some factors that make an ideal training experience?
How would you describe your normal training experience?

5. Ranking the training methods

I loaded the raw data obtained from the survey into R. I then transformed the dataset and used that to rank the training methods we use from most effective to least with graphics displaying the answers.

survey.raw <- read.csv('https://raw.githubusercontent.com/jeffnieman11/Data607_Final_Project/master/survey_raw_text.csv', header = T)

#C:/Users/jenieman/Documents/CUNY/Data 607/survey_raw_text.csv
#rank most effective means of CM training
training_methods <- survey.raw[,c(1, 13:20)]
colnames(training_methods) <- c("ID", "VOD", "Web", "Instructor", "Virtual_Instructor", "OJT", "QRG", "PowerPoint", "Other")

training.1 <- summarise_each(training_methods, funs(mean))
training.2 <- training.1[,2:9]
training.2 <- gather(training.2, "Method", "Value")
training.2 <- arrange(training.2, Value)
training.2

##               Method    Value
## 1                OJT 3.095238
## 2              Other 3.619048
## 3         Instructor 3.666667
## 4                Web 3.809524
## 5 Virtual_Instructor 3.904762
## 6                QRG 4.095238
## 7         PowerPoint 4.142857
## 8                VOD 4.190476

ggplot(data=training.2, aes(x=reorder(Method, Value), y=Value)) + geom_bar(stat="identity", color="blue", fill="steelblue", width=0.5) + scale_y_continuous() + xlab("CM Training Methods from most to least popular") + ylab("Score (lower value means more popular)")

Conclusion: To no surprise the most preferred form of CM training is OJT (On the Job Training). Unfortunately a very common form of training , the VOD (Video on Demand) is the least preferred form of training. Our training lead and I will be presenting this analysis to our service manager.

6. Evaluating the effectiveness of initiatives

I subsetted the responses into those reflected into three categories of initiative: CSCC, SOT and other. I then pulled out the evaluation text from the last question in the survey and transformed the text into a data frame of words (eliminating capital letters and punctuation). I also created a test negative and a test positive text as a control.

training.evaluation <- subset(survey.raw[,c(1:4,23)])
colnames(training.evaluation) <- c("ID", "CSCC", "SOT", "Other", "Feedback")
#eliminate blank responses
training.evaluation.1 <- training.evaluation[c(1:8, 10:19),]

CSCC <- subset(training.evaluation.1, CSCC==1)
SOT <- subset(training.evaluation.1, SOT==1)
Other <- subset(training.evaluation.1, Other==1)
n.CSCC <- nrow(CSCC)
n.SOT <- nrow(SOT)
n.Other <- nrow(Other)

CSCC_text <- paste(unlist(CSCC$Feedback), collapse=" ")
CSCC_text1 <- gsub("[[:punct:]]","", CSCC_text)
CSCC_text2 <- tolower(CSCC_text1)
CSCC_text3 <- as.vector(strsplit(CSCC_text2, " "))
CSCC_text4 <- data.frame(unlist(CSCC_text3), ncol=1)
colnames(CSCC_text4) <- "words"
head(CSCC_text4)

##   words NA
## 1    it  1
## 2   was  1
## 3  good  1
## 4  very  1
## 5  good  1
## 6     i  1

SOT_text <- paste(unlist(SOT$Feedback), collapse=" ")
SOT_text1 <- gsub("[[:punct:]]","", SOT_text)
SOT_text2 <- tolower(SOT_text1)
SOT_text3 <- as.vector(strsplit(SOT_text2, " "))
SOT_text4 <- data.frame(unlist(SOT_text3), ncol=1)
colnames(SOT_text4) <- "words"
head(SOT_text4)

##       words NA
## 1         i  1
## 2      have  1
## 3     taken  1
## 4        so  1
## 5      many  1
## 6 trainings  1

Other_text <- paste(unlist(Other$Feedback), collapse=" ")
Other_text1 <- gsub("[[:punct:]]","", Other_text)
Other_text2 <- tolower(Other_text1)
Other_text3 <- as.vector(strsplit(Other_text2, " "))
Other_text4 <- data.frame(unlist(Other_text3), ncol=1)
colnames(Other_text4) <- "words"
head(Other_text4)

##        words NA
## 1          i  1
## 2     review  1
## 3   training  1
## 4 collateral  1
## 5          i  1
## 6       dont  1

#Control reviews
Neg_text <- "This is a terrible product.  The training was boring and long.  I fell asleep because of the poor presentation skills."
Neg_text1 <- gsub("[[:punct:]]","", Neg_text)
Neg_text2 <- tolower(Neg_text1)
Neg_text3 <- as.vector(strsplit(Neg_text2, " "))
Neg_text4 <- data.frame(unlist(Neg_text3), ncol=1)
colnames(Neg_text4) <- "words"
n.Neg<- 1
head(Neg_text4)

##      words NA
## 1     this  1
## 2       is  1
## 3        a  1
## 4 terrible  1
## 5  product  1
## 6           1

Pos_text <-  "I loved the presentation.  I was fully engaged and enjoyed the entire process.  The adoption went smoothly and the support has been wonderful.  Thanks!"
Pos_text1 <- gsub("[[:punct:]]","", Pos_text)
Pos_text2 <- tolower(Pos_text1)
Pos_text3 <- as.vector(strsplit(Pos_text2, " "))
Pos_text4 <- data.frame(unlist(Pos_text3), ncol=1)
colnames(Pos_text4) <- "words"
n.Pos<- 1
head(Pos_text4)

##          words NA
## 1            i  1
## 2        loved  1
## 3          the  1
## 4 presentation  1
## 5               1
## 6            i  1

7. SQL Joins and Sentiment Value

Using SQL I joined each word found in the value dictionary to the text for that initiative. I added positive values and subtracted negative values. I then normalized by dividing by the number of responses. This statistical analysis will allow me to compare my findings across initiatives.

CSCC.join<- sqldf("select wordsense_use1.Word, wordsense_use1.Positive_Value, wordsense_use1.Negative_Value from wordsense_use1 inner join CSCC_text4 on wordsense_use1.Word= CSCC_text4.words")
CSCC.review <- (sum(CSCC.join$Positive_Value) - sum(CSCC.join$Negative_Value))/n.CSCC
head(CSCC.join)

##   Word Positive_Value Negative_Value
## 1 able          0.125              0
## 2 good          0.625              0
## 3 good          0.625              0
## 4 good          0.625              0
## 5 good          0.625              0
## 6 well          0.625              0

CSCC.review

## [1] 0.4166667

SOT.join<- sqldf("select wordsense_use1.Word, wordsense_use1.Positive_Value, wordsense_use1.Negative_Value from wordsense_use1 inner join SOT_text4 on wordsense_use1.Word = SOT_text4.words")
SOT.review <- (sum(SOT.join$Positive_Value) - sum(SOT.join$Negative_Value))/n.SOT
head(SOT.join)

##   Word Positive_Value Negative_Value
## 1 able          0.125              0
## 2 live          0.500              0
## 3 good          0.625              0
## 4 good          0.625              0
## 5 good          0.625              0
## 6 well          0.625              0

SOT.review

## [1] 0.2954545

Other.join<- sqldf("select wordsense_use1.Word, wordsense_use1.Positive_Value, wordsense_use1.Negative_Value from wordsense_use1 inner join Other_text4 on wordsense_use1.Word = Other_text4.words")
Other.review <- (sum(Other.join$Positive_Value) - sum(Other.join$Negative_Value))/n.Other
head(Other.join)

##      Word Positive_Value Negative_Value
## 1    good          0.625           0.00
## 2    good          0.625           0.00
## 3    good          0.625           0.00
## 4    well          0.625           0.00
## 5   taken          0.125           0.50
## 6 forward          0.250           0.25

Other.review

## [1] 0.4791667

#Test two control reviews to see if it works
Neg.join<- sqldf("select wordsense_use1.Word, wordsense_use1.Positive_Value, wordsense_use1.Negative_Value from wordsense_use1 inner join Neg_text4 on Neg_text4.words=wordsense_use1.Word")
Neg.review <- (sum(Neg.join$Positive_Value) - sum(Neg.join$Negative_Value))/n.Neg
head(Neg.join)

##       Word Positive_Value Negative_Value
## 1     long          0.125          0.375
## 2 terrible          0.000          0.625
## 3     poor          0.000          0.875
## 4 training          0.125          0.000

Neg.review

## [1] -1.625

Pos.join<- sqldf("select wordsense_use1.Word, wordsense_use1.Positive_Value, wordsense_use1.Negative_Value from wordsense_use1 inner join Pos_text4 on Pos_text4.words=wordsense_use1.Word")
Pos.review <- (sum(Pos.join$Positive_Value) - sum(Pos.join$Negative_Value))/n.Pos
head(Pos.join)

##       Word Positive_Value Negative_Value
## 1    loved          0.500          0.000
## 2  engaged          0.000          0.125
## 3   thanks          0.000          0.125
## 4    fully          0.375          0.000
## 5 smoothly          0.125          0.000
## 6  support          0.375          0.000

Pos.review

## [1] 1.125

Review <- c(CSCC.review, SOT.review, Other.review, Neg.review, Pos.review)
Init <- c("CSCC", "SOT", "Other", "Neg_Control", "Pos_Control")
Review.findings <- data.frame(Init, Review)
Review.findings <- arrange(Review.findings, Review)
Review.findings

##          Init     Review
## 1 Neg_Control -1.6250000
## 2         SOT  0.2954545
## 3        CSCC  0.4166667
## 4       Other  0.4791667
## 5 Pos_Control  1.1250000

ggplot(data=Review.findings, aes(x=reorder(Init, Review), y=Review)) + geom_bar(stat="identity", color="blue", fill="red", width=0.5) + scale_y_continuous() + xlab("Initiatives ranked from lowest to highest")

8.Confirm statistical significance of results with hypothesis

The results seem to show a lower score for SOT in comparison with other initiatives. My null hypothesis is that there is no difference between SOT evaluations and non-SOT evaluations. To do this I created a loop to generate scores for each individual review.

#SOT
SOT.list <- vector("list", length = nrow(SOT))
for (i in 1:nrow(SOT)){
  SOT_text <- paste(unlist(SOT[i,5]), collapse=" ")
  SOT_text1 <- gsub("[[:punct:]]","", SOT_text)
  SOT_text2 <- tolower(SOT_text1)
  SOT_text3 <- as.vector(strsplit(SOT_text2, " "))
  SOT_text4 <- data.frame(unlist(SOT_text3), ncol=1)
  colnames(SOT_text4) <- "words"
  head(SOT_text4) 
  SOT.join<- sqldf("select wordsense_use1.Word, wordsense_use1.Positive_Value, wordsense_use1.Negative_Value from wordsense_use1 inner join SOT_text4 on wordsense_use1.Word = SOT_text4.words")
SOT.review <- (sum(SOT.join$Positive_Value) - sum(SOT.join$Negative_Value))
SOT.list[[i]] <- SOT.review
print(SOT.review)
}

## [1] -0.5
## [1] -0.125
## [1] 1.25
## [1] 1.5
## [1] 0.125
## [1] 0.625
## [1] -0.25
## [1] 0
## [1] 0.125
## [1] 0.375
## [1] 0.125

#nonSOT
CSCC.list <- vector("list", length = nrow(CSCC))
for (i in 1:nrow(CSCC)){
  CSCC_text <- paste(unlist(CSCC[i,5]), collapse=" ")
  CSCC_text1 <- gsub("[[:punct:]]","", CSCC_text)
  CSCC_text2 <- tolower(CSCC_text1)
  CSCC_text3 <- as.vector(strsplit(CSCC_text2, " "))
  CSCC_text4 <- data.frame(unlist(CSCC_text3), ncol=1)
  colnames(CSCC_text4) <- "words"
  CSCC.join<- sqldf("select wordsense_use1.Word, wordsense_use1.Positive_Value, wordsense_use1.Negative_Value from wordsense_use1 inner join CSCC_text4 on wordsense_use1.Word = CSCC_text4.words")
CSCC.review <- (sum(CSCC.join$Positive_Value) - sum(CSCC.join$Negative_Value))
CSCC.list[[i]] <- CSCC.review
print(CSCC.review)
}

## [1] 0.625
## [1] 1.125
## [1] -0.5
## [1] -0.625
## [1] 1.5
## [1] 0.375

Other.list <- vector("list", length = nrow(Other))
for (i in 1:nrow(Other)){
  Other_text <- paste(unlist(Other[i,5]), collapse=" ")
  Other_text1 <- gsub("[[:punct:]]","", Other_text)
  Other_text2 <- tolower(Other_text1)
  Other_text3 <- as.vector(strsplit(Other_text2, " "))
  Other_text4 <- data.frame(unlist(Other_text3), ncol=1)
  colnames(Other_text4) <- "words"
  Other.join<- sqldf("select wordsense_use1.Word, wordsense_use1.Positive_Value, wordsense_use1.Negative_Value from wordsense_use1 inner join Other_text4 on wordsense_use1.Word = Other_text4.words")
Other.review <- (sum(Other.join$Positive_Value) - sum(Other.join$Negative_Value))
Other.list[[i]] <- Other.review
print(Other.review)
}

## [1] 0.375
## [1] 1.125
## [1] -0.5
## [1] 0.5
## [1] -0.125
## [1] 1.5

review.list <- matrix(c(SOT.list,Other.list,CSCC.list))
colnames(review.list) <- "Value"
program <- c(rep("SOT", times =11), rep("nonSOT", times =12))
review.init <- data.frame(program, review.list)

nonSOT.value <- subset(review.init, program=="nonSOT")
nonSOT.value$Value <- as.numeric(nonSOT.value$Value)
mean(nonSOT.value$Value)

## [1] 0.4479167

SOT.value <- subset(review.init, program=="SOT")
SOT.value$Value <- as.numeric(SOT.value$Value)
m <- mean(SOT.value$Value)
sdev <- sd(SOT.value$Value)
SE <- sdev/sqrt(nrow(SOT.value))
CI <- 1.96*SE
m + CI

## [1] 0.6577545

m - CI

## [1] -0.06684544

Conclusion:

While the SOT program has a lower mean (0.295) than the nonSOT programs (0.448), the 95% confidence intervals for the data we pulled are between -0.067 and 0.658. Since the mean for the nonSOT program fits within this confidence interval we need to accept the null hypothesis. I cannot conclusively say that the SOT program has more negative reviews than the nonSOT programs.