Intro

The purpose of this project is to classify documents in the Caribbean Community Climate Change Center’s Reginal Clearing house.

The official taxonomy contains 102 terms.

The focus was on the following topics in order to reduce storage and computational complexity.

- 1100 Impacts of Climate Change - 1200 Data and Observations - 1441 Policy Making - 1400 Adaptation - 1322 Renewable Energy - 1010 General on Climate Change - 1445 Water Resource Management

Random forests were used to classify the data.

Data Source

The data was manually sourced and compiled from the Clearing house.

An Initial sample of 2650 documents were taken.

The corresponding metadata was sourced separately, due to software licensing constraints.

Description Based Model

Textual data is held in the Description field of the metadata file, and also in the actual PDF documents.

Both of these will be modeled separately.

Load Data

#load consolidated data
descriptionData <- read.csv2('data/metadata.csv', sep=",", header = TRUE, stringsAsFactors = FALSE)

print(paste0("Number of variables: ",ncol(descriptionData)))

## [1] "Number of variables: 428"

print(paste0("Number of observations: ",nrow(descriptionData)))

## [1] "Number of observations: 2650"

Filter & Clean Data

descriptionData <- select(descriptionData, ID, Description, TopicID, Topic) %>%
  filter(!is.na(Description), Description != '',Description != '\n',!is.na(TopicID), TopicID != '') %>%
  mutate(Description = as.character(Description)) %>%
  cSplit("TopicID", sep = ";", direction = "long") %>%
  distinct(ID,Description, .keep_all = TRUE) %>%
  cSplit("Topic", sep = ";", direction = "long") %>%
  distinct(ID,Description, .keep_all = TRUE)
 
names(descriptionData) <-c("ID","Description","ClassID","Class")
print(paste0("Number of variables: ",ncol(descriptionData)))

## [1] "Number of variables: 4"

print(paste0("Number of observations: ",nrow(descriptionData)))

## [1] "Number of observations: 1602"

** Preview Data

datatable(head(descriptionData), options = list(filter = FALSE), filter = "none")

Topic Distribution

#Plot Distribution  
(descriptionData %>% 
   group_by(Class) %>% 
    summarise(Freq=n()) %>% 
    top_n(10) %>%
    ungroup() %>%  
ggplot(aes(x = reorder(Class,Freq), y = Freq,fill=Freq, label= `Freq`)) + 
    geom_bar(stat = "identity", show.legend = F) + 
    geom_text(size = 2, position = position_stack(vjust = 0.5)) + 
    coord_flip() +  
    labs(title = "Distribution of Topics (Top 10)", x = "Topic", y = "Frequency"))

## Selecting by Freq

Sampling

The data is highly unbalanced.

This will pose problems for ML algorithms, especially the Random Forest, that will be used for this project.

The data set will be down sampled using the downsample function.

l <- c(109,81,10,119,17,125,55)
ll <- c("Impacts of Climate Change", 
                             "Data and Observations",
                             "Policy Making",
                             "Adaptation",
                             "Renewable Energy", 
                             "General on Climate Change", 
                             "Water Resource Management")
mlData <- filter(descriptionData, ClassID %in% l) %>%
            mutate(ClassID = factor(ClassID))

#levels(mlData$ClassID) <- ll
down_train <- downSample(mlData,mlData$ClassID,list = TRUE) 
 
mlData <- down_train$x
 
mlData <- mlData[sample(1:nrow(mlData)), ]
 
mlDataTrain <- mlData 

nTrain = ceiling(nrow(mlData) * 0.70)

mlDataTrain <-mlData[c(1:nTrain),]

mlDataTest <- mlData[c((nTrain + 1):nrow(mlData)),] 

print(paste0("Observations in training set: ",nrow(mlDataTrain)))

## [1] "Observations in training set: 196"

print(paste0("Observations in test set: ",nrow(mlDataTest)))

## [1] "Observations in test set: 84"

Create Tidy Data Frame

data_tokens <- mlDataTrain %>%
  
   #create tidy data frame and converts words to lower case
   unnest_tokens(output = word, input = Description) %>% 
   # remove numbers
   filter(!str_detect(word, "^[0-9]*$")) %>%
   # remove stop words
   anti_join(stop_words) %>%
   # stem the words (derivations of words are mapped )
   # eg. possibly = possible
   mutate(word = SnowballC::wordStem(word))

## Joining, by = "word"

datatable(head(data_tokens), options = list(filter = FALSE), filter = "none")

Create Document-Term Matrix

Both TF & TF-IDF were tested

data_dtm <- data_tokens %>%
   # get count of each token in each document
   count(ID, word) %>%
   
   cast_dtm(document = ID, term = word, value = n
            ,weighting = tm::weightTfIdf
            )

data_dtm

## <<DocumentTermMatrix (documents: 196, terms: 1543)>>
## Non-/sparse entries: 4967/297461
## Sparsity           : 98%
## Maximal term length: 15
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

data_dtm <- removeSparseTerms(data_dtm, sparse = .90)

Exploratory Analysis

** Create TFIDF for plotting **

data_tfidf <- data_tokens %>%
   count(ClassID, word) %>%
   bind_tf_idf(word, ClassID, n)

plot_data <- data_tfidf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))
  
plot_data %>%
  filter(ClassID %in% l) %>%
  mutate(ClassID = factor(ClassID, levels=l,
                        labels = ll)) %>%
  group_by(ClassID) %>%
  top_n(5) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf)) +
  geom_col() +
  labs(title = "Influencial Words by Topic", x = NULL, y = "IF-IDF") +
  facet_wrap(~ClassID, scales = "free") +
  coord_flip()

## Selecting by tf_idf

Train Description Random Forest

set.seed(1001)

system.time({
  RF_Traineds <- train(x = as.matrix(data_dtm),
                       y = factor(mlDataTrain$ClassID),
                       method = "rf",
                       ntree = 1000,
                       num.threads = 7,
                       trControl = trainControl(method = "oob"))
})

RF_Traineds$finalModel

The Results

As we can see, the model based on the description didn’t perform well at all, even with the long wait of 1000 trees.

Let’s see how the Random Forest on actual PDF contents performs.

PDF Content Based Model

Load PDF Data

The corpus contained 14 Gb of data.

This data need to be loaded into the dataframe containing the the metadata.

Rename Files

The file names included title text with the document ID embedded.

These files must be extracted and renamed based on its ID.

    for(i in 1:nrow(descriptionData)){
     if(descriptionData$ID[i] > 1000){
        doc <- list.files(path = "data/corpus/", pattern = paste0(descriptionData$ID[i],").pdf$"),full.names = TRUE)
        if(length(doc) > 0){
          new_name <- paste0("data/corpus/",descriptionData$ID[i],".pdf")
          print(new_name)
          file.rename(doc[1],new_name)
        }
     }
    
  }

Clean & Load Contents into Dataframe

data2$Content <- rep("",nrow(data2))
for(i in nrow(mlData2):1){
  
  doc <- paste0("data/corpus/",mlData2$ID[i],".pdf")
  if(file.exists(doc)){
    content <- pdftools::pdf_text(doc)
      content <- paste(content, collapse = ' ')
      content <- str_replace_all(content, "[\r\n]", " ")
      content <- str_replace_all(content, "[[:punct:]]", " ")
      content <- str_replace_all(content, "[[:digit:]]", " ")
      content <- str_replace_all(content, ",", " ")
      content <- str_replace_all(content, "<", " ")
      content <- str_replace_all(content, ">", " ")
      data2[which(data2$ID==mlData2$ID[i]),]$Content <- content
    
  }
  
}

write.csv(select(data2,ID,ClassID,Content),file="data/pdf_contents.csv",quote = TRUE )

Prepare Data for Modeling

#load consolidated data
contentData <- read.csv2('data/pdf_contents.csv', sep=",", header = TRUE, stringsAsFactors = FALSE)

print(paste0("Number of variables: ",ncol(contentData)))

## [1] "Number of variables: 3"

print(paste0("Number of observations: ",nrow(contentData)))

## [1] "Number of observations: 574"

l <- c(109,81,10,119,17,125,55)
ll <- c("Impacts of Climate Change", 
                             "Data and Observations",
                             "Policy Making",
                             "Adaptation",
                             "Renewable Energy", 
                             "General on Climate Change", 
                             "Water Resource Management")
mlData <- filter(contentData, ClassID %in% l) %>%
            mutate(ClassID = as.integer(ClassID))  %>%
            mutate(ClassID = factor(ClassID))
            

#levels(mlData$ClassID) <- ll
down_train <- downSample(mlData,mlData$ClassID,list = TRUE) 
 
mlData <- down_train$x
 
mlData <- mlData[sample(1:nrow(mlData)), ]
 
mlDataTrain <- mlData 

nTrain = ceiling(nrow(mlData) * 0.70)

mlDataTrain <-mlData[c(1:nTrain),]

mlDataTest <- mlData[c((nTrain + 1):nrow(mlData)),] 

print(paste0("Observations in training set: ",nrow(mlDataTrain)))

## [1] "Observations in training set: 59"

print(paste0("Observations in test set: ",nrow(mlDataTest)))

## [1] "Observations in test set: 25"

Create Tidy Data Frame

data_tokens <- mlDataTrain %>%
  
   #create tidy data frame and converts words to lower case
   unnest_tokens(output = word, input = Content) %>% 
   # remove numbers
   filter(!str_detect(word, "^[0-9]*$")) %>%
   # remove stop words
   anti_join(stop_words) %>%
   # stem the words (derivations of words are mapped )
   # eg. possibly = possible
   mutate(word = SnowballC::wordStem(word))

## Joining, by = "word"

datatable(head(data_tokens), options = list(filter = FALSE), filter = "none")

Create Document-Term Matrix

data_dtm <- data_tokens %>%
   # get count of each token in each document
   count(ID, word) %>%
   
   cast_dtm(document = ID, term = word, value = n
            #,weighting = tm::weightTfIdf
            )

data_dtm

## <<DocumentTermMatrix (documents: 51, terms: 21712)>>
## Non-/sparse entries: 78081/1029231
## Sparsity           : 93%
## Maximal term length: 119
## Weighting          : term frequency (tf)

data_dtm <- removeSparseTerms(data_dtm, sparse = .90)

Exploratory Analysis

** Create TFIDF for plotting **

data_tfidf <- data_tokens %>%
   count(ClassID, word) %>%
   bind_tf_idf(word, ClassID, n)

plot_data <- data_tfidf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))
  
plot_data %>%
  filter(ClassID %in% l) %>%
  mutate(ClassID = factor(ClassID, levels=l,
                        labels = ll)) %>%
  group_by(ClassID) %>%
  top_n(5) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf)) +
  geom_col() +
  labs(title = "Influencial Words by Topic", x = NULL, y = "IF-IDF") +
  facet_wrap(~ClassID, scales = "free") +
  coord_flip()

## Selecting by tf_idf

Train Content Random Forest

system.time({
  RF_Traineds <- train(x = as.matrix(data_dtm),
                       y = factor(mlDataTrain$ClassID),
                       method = "rf",
                       ntree = 200,
                       num.threads = 7,
                       trControl = trainControl(method = "oob"))
})

RF_Traineds$finalModel

The Updated Model Results

While the performance is much better, it is not quite ready for production.

There seems to be issues with the original classifications of the data.

Conclusion

After investigations it was found that the likely reasons for the poor performance of the models are:

- Non-standard document tagging practices
- Limited computing resources (laptop)

Documents are assigned to multipe topcs of the same group

The main issue stems from the fact that the same documents are tagged with the parent topic, and also with the child topics. This is likely to confuse the Random Forest at the splits. It would have been best to only select child topics, but use the parental topics as tags.

It is recommended to do a through review of the document tagging strategy. This has always been placed on the back-burner!

The garbage in garbage out rule also applies to “smart” algorithms!

This has been a challenging but fruitful exercise, Thanks to DATA 607, I have learnt the skills necessary to think like a data scientist!

DATA 607 Final Project: Text Classification With Random Forests

Albert Gilharry

May 10, 2018

Intro

Data Source

Description Based Model

Load Data

Filter & Clean Data

Topic Distribution

Sampling

Create Tidy Data Frame

Create Document-Term Matrix

Exploratory Analysis

Train Description Random Forest

The Results

PDF Content Based Model

Load PDF Data

Rename Files

Clean & Load Contents into Dataframe

Prepare Data for Modeling

Create Tidy Data Frame

Create Document-Term Matrix

Exploratory Analysis

Train Content Random Forest

The Updated Model Results

Conclusion