Project Proposal

Introduction

The business case for this project is to reduce the cost of time to resolution of a problem issue. The ability to select best resolution to issues from a list of FAQ can sometimes be a time consuming task. Trial and error can cause prolonged downtime on critical systems. A recommender system that includes the experience of other similar users working on the same scope of issues would help reduce time to resolution.

Objective

The objective of this project is to find a dataset that includes historical resolutions to problems or answers to questions. The rank of resolution or answers can be given higher weights to answers that provide accurate resolutions or have high number of similar responses.

Methodology

The methodology would be to review the dataset for similar resolutions , rank and weigh best resolution to frequently asked questions.

Data Description

We will begin our analysis with “question answer” dataset in Kaggle. The dataset includes 169 files with the following columns:

  • ArticleTitle - the name of the Wikipedia article from which questions and answers initially came.
  • Question - is the question.
  • Answer - is the answer.
  • DifficultyFromQuestioner - prescribed difficulty rating for the question
  • DifficultyFromAnswerer - is a difficulty rating assigned by the individual who evaluated and answered the question, which may differ from the difficulty in field 4.
  • ArticleFile - is the name of the file with the relevant article

Import Data

In the import data section we are are reading only the files with questions and answers as a start. Each file contains details of a specific set of topics. Each file is saved to a dataframe , then we update one columnname and rbind the 3 datasets to 1 dataframe.

s08<- as.data.frame(read.csv('https://raw.githubusercontent.com/apag101/Data612/master/Projects/FinalProject/Data/S08_question_answer_pairs.txt', sep ='\t', stringsAsFactors = FALSE))
s09<- as.data.frame(read.csv('https://raw.githubusercontent.com/apag101/Data612/master/Projects/FinalProject/Data/S09_question_answer_pairs.txt', sep ='\t', stringsAsFactors = FALSE))
s10<- as.data.frame(read.csv('https://raw.githubusercontent.com/apag101/Data612/master/Projects/FinalProject/Data/S10_question_answer_pairs.txt', sep ='\t', stringsAsFactors = FALSE))

# Update First column to remove UTf characters

names(s08)[1] <- "ArticleTitle"
names(s09)[1] <- "ArticleTitle"
names(s10)[1] <- "ArticleTitle"

# Combine 3 DF to 1 DF
qa<-rbind(s08,s09,s10)
glimpse(qa)
## Rows: 3,995
## Columns: 6
## $ ArticleTitle             <chr> "Abraham_Lincoln", "Abraham_Lincoln", "Abr...
## $ Question                 <chr> "Was Abraham Lincoln the sixteenth Preside...
## $ Answer                   <chr> "yes", "Yes.", "yes", "Yes.", "no", "No.",...
## $ DifficultyFromQuestioner <chr> "easy", "easy", "easy", "easy", "easy", "e...
## $ DifficultyFromAnswerer   <chr> "easy", "easy", "medium", "easy", "medium"...
## $ ArticleFile              <chr> "S08_set3_a4", "S08_set3_a4", "S08_set3_a4...

Review/Transform Data

In the review and transform section we table the Question and Answer colunns and notice some Nulls. We set Nulls to NA’s then select the ArticleTitle, Question, Answer and Articlefile columns and use complete cases only. We remove punctuations, set Answer column to lower case and add a count column to count unique rows. The result reduces rowscount from 3995 to 2793 and increase column count from 4 to 5.

## 
##                      easy         hard       medium         NULL S08_set3_a10 
##            1         1035          972         1034          952            1
## 
##              easy     hard   medium     NULL too easy too hard 
##        8     1344      733     1222      570       14      104
## [1] 3995    4
## [1] 3425    4
## [1] 2794    5
## [1] 2793    5

Word Term Frequency Checks

In this section we are taking all text files to display word and term frequencies. This data is the raw data that was used to create the question answer files. The tops words overall are city, language and world, which is confirmed in the frequency and word cloud diagrams.

This second review splits the question answer files (_pairs.txt) and the raw data files that has the full verbiage of the text. From these plots there is really not much correlation between top words in question answer text and full verbiage. The best approach would be to match the search terms with question and answer files then link the results to full verbiage for further information. This approach would rely on the accuracy of the question answer file.

Search Query Result

This first approach attempts to use a sparse matrix to get similarity of question to document numbers from the results of the bind_tf_idf function. This code displays a 5x5 result of document to question.

## 5 x 5 sparse Matrix of class "dgCMatrix"
##                                                                    830     2554
## According to Reader's Digest, is Finland best for living?     7.697575 .       
## Approximately how many species of Testudines are alive today? .        7.697575
## Are a wolf's teeth its main weapons?                          .        .       
## Are adult ducks fast fliers?                                  .        .       
## Are all dialects of Korean similar to each other?             .        .       
##                                                                   1054      698
## According to Reader's Digest, is Finland best for living?     .        .       
## Approximately how many species of Testudines are alive today? .        .       
## Are a wolf's teeth its main weapons?                          7.697575 .       
## Are adult ducks fast fliers?                                  .        7.697575
## Are all dialects of Korean similar to each other?             .        .       
##                                                                   1477
## According to Reader's Digest, is Finland best for living?     .       
## Approximately how many species of Testudines are alive today? .       
## Are a wolf's teeth its main weapons?                          .       
## Are adult ducks fast fliers?                                  .       
## Are all dialects of Korean similar to each other?             7.697575
##                           Do all ducks quack? 
##                                             1 
## Is the drum a member of the percussion group? 
##                                             0

This second and final approach uses QrySearch code function from a link in the appendix that we modified for our dataset needs. It uses VectorSource and VCorpus functions on the QA raw data. The code builds a TD matrix on the queryTerm and QA VCorpus doc. It attempts to find similarities between the queryTerm and rawdata and returns the Questions, Answers, Similarity Score, number of similar answers and the Answer Article file name. We add an additional function to display the first vector of the answer file.

We add an additional function to display the first vector with verbiage in the answer file.

getcontent<-function()
{
  f<-paste(searchRes$ArticleFile,'.txt.clean',sep="")
  d<-as.data.frame(read.csv(paste('.\\Data\\text_data\\',f,sep=""), sep ='\t', stringsAsFactors = FALSE))
  d[2,1]
}

This first execution is searching for a full string and the score is 1. The second search execution just searches for a one string text and the score is .59. This makes sense as the full string is an exact match. The last output is the verbiage of the output file for this search.

QrySearch("Was he regarded as a mostly reclusive artist?")
## [1] "Used 4.1549 seconds"
##                                        Question
## 1 Was he regarded as a mostly reclusive artist?
##                                              Answer Score Count ArticleFile
## 1 yes he was regarded as a mostly reclusive artist      1     1 S10_set6_a9
QrySearch("artist?")
## [1] "Used 3.9404 seconds"
##                                        Question
## 1 Was he regarded as a mostly reclusive artist?
##                                              Answer  Score Count ArticleFile
## 1 yes he was regarded as a mostly reclusive artist  0.5888     1 S10_set6_a9
getcontent()
## [1] "Paul Jackson Pollock (January 28, 1912   August 11, 1956) was an influential American painter and a major figure in the abstract expressionist movement. During his lifetime, Pollock enjoyed considerable fame and notoriety. He was regarded as a mostly reclusive artist. He had a volatile personality and struggled with alcoholism all of his life. In 1945, he married the artist Lee Krasner, who became an important influence on his career and on his legacy. He died at the age of 44 in an alcohol-related, single-car crash. In December 1956, he was given a memorial retrospective exhibition at the Museum of Modern Art (MoMA) in New York City, and a larger more comprehensive exhibition there in 1967. More recently, in 1998 and 1999, his work was honored with large-scale retrospective exhibitions at MoMA and at The Tate in London. In 2000, Pollock was the subject of an Academy Awardâ\200“winning film directed by and starring Ed Harris. "

This second execution initially search for a one word string and results in a .7 score. The second search execution just searches for full text without the question mark and has a score of .46. This is quite interesting that it is lower than the 1 string search. The last search is the full string with the question mark and has a score of 1. The last output is the verbiage of the output file for this search.

QrySearch("pneumonia?")
## [1] "Used 3.7779 seconds"
##                           Question Answer  Score Count ArticleFile
## 1 Did his mother die of pneumonia?     no 0.6994     2 S08_set3_a4
QrySearch("Did his mother die of pneumonia")
## [1] "Used 3.9115 seconds"
##                           Question Answer  Score Count ArticleFile
## 1 Did his mother die of pneumonia?     no 0.4584     2 S08_set3_a4
QrySearch("Did his mother die of pneumonia?")
## [1] "Used 4.2822 seconds"
##                           Question Answer Score Count ArticleFile
## 1 Did his mother die of pneumonia?     no     1     2 S08_set3_a4
getcontent()
## [1] "Lincoln closely supervised the victorious war effort, especially the selection of top generals, including Ulysses S. Grant. Historians have concluded that he handled the factions of the Republican Party well, bringing leaders of each faction into his cabinet and forcing them to cooperate. Lincoln successfully defused a war scare with the United Kingdom in 1861. Under his leadership, the Union took control of the border slave states at the start of the war. Additionally, he managed his own reelection in the 1864 presidential election."

Conclusion

We began with the objective of using a Question /Answer dataset to search for strings with answers that we could rank and chose as the best answers. We used a dataset from kaggle, reviewed and transformed the data. We performed some word term frequency checks and found the QA dataset had very little similarity with the raw dataset. We determined we needed an approach to focus the query on the questions and use the raw data to give additional information. We attempted a similarity approach using sim2 , but instead chose to use and modify an existing function that focused on the question file. The result was a function that returned the data based on the users query and returned some file output with additional information.