In this paper the text mining analysis of the jailbirds’ last statement is shown. Multiple methods have been chosen as follows: pre-analysis, categorization, clustering, sentiment analysis. The code and the explanations are added next to each point.
The dataset includes information on criminals executed by Texas Department of Criminal Justice from 1982 to November 8th, 2017. In Furman v Georgia in 1972, the Supreme Court considered a group of consolidated cases, whereby it severely restricted the death penalty. However, like other states, Texas adjusted its legislation to address the Court’s concern and once again allow for capital punishment in 1973. Texas adopted execution by lethal injection in 1977 and in 1982, the starting year of this dataset, the first offender was executed by this method.
The dataset consists of 545 observations with 21 variables. The last variable - LastStatement was analised and the presented below.
#used libraries
library(tm)
library(SnowballC)
library(stringr)
library(tidyverse)
library(tidytext)
library(glue)
library(data.table)
library(class)
library(caret)
library(wordcloud)
library(cluster)
library(topicmodels) # LDA
library(tidytext)
library(SentimentAnalysis)
Data structure
## Execution LastName FirstName TDCJNumber Age Race CountyOfConviction
## 1 545 Cardenas Ruben 999275 47 Hispanic Hidalgo
## 2 544 Pruett Robert 999411 38 White Bee
## AgeWhenReceived EducationLevel NativeCounty PreviousCrime Codefendants
## 1 28 11 1 0 0
## 2 22 8 0 1 0
## NumberVictim WhiteVictim HispanicVictim BlackVictim VictimOther.Races
## 1 1 0 1 0 0
## 2 1 1 0 0 0
## FemaleVictim MaleVictim
## 1 1 0
## 2 0 1
## LastStatement
## 1 this is my statement: my final words. first, i want to thank my family for believing in me and being there with me till the end. i love you all very much! and i know that you love me too! life does go on. next, i would like to also thank my attorney?s maurie levin, alicia amezcua rodriguez and sandra babcock for all their hard work they have done to help me out. i am so thankful.\377 i would also like to thank the mexican consul for all their help too, and every government official that was trying to help me out too. thank you maricela luna and julia thimm for being such good friends!\377\377now!\377 i will not and cannot apologize for someone else?s crime, but, i will be back for justice!\377 you can count on that! thank you.
## 2 i just want to let everyone in here to know i love you so much. i?ve hurt a lot of people and a lot of people have hurt me. i love y'all so much. life don?t end here it goes on forever. i?ve had to learn lessons in life the hard way. one day there won?t be a need to hurt people. i love y'all so much. i?m ready to go but i?ll be back. nighty night everybody, nighty night everybody. i?m done warden.
On the graph above we can see the words which were used the most. The ‘love’ is over two times frequent than each single other word. Top 5 words are: love, family, know, thank, will. While thinking about the meaning of them, I would split them on 2 groups where first containing love and family which might be the ‘things’ which are important for the sentenced to death and second group compared from the verbs which may indicate a willingness to take action before death. I am also aware of the fact, that word ‘love’ can be used as verb as well.
Based on the above graph we clearly see that the group of people that used word ‘god’ the most times are Black race. And following for the word family the Hispanic. It might indicate which race values the most religion and which the family.
freq <- rowSums(as.matrix(dtm))
length(freq)
## [1] 545
ord <- order(freq)
m <- as.matrix(dtm)
#Deal with the sparse terms
dtms <- removeSparseTerms(dtm, 0.8)
freq <- colSums(as.matrix(dtms))
dtms
## <<DocumentTermMatrix (documents: 545, terms: 13)>>
## Non-/sparse entries: 2072/5013
## Sparsity : 71%
## Maximal term length: 6
## Weighting : term frequency (tf)
For the most common words the association looks as folllowing:
# Look at the term correlations, specify the words you would like to investigate
#findAssocs(dtm, c("love" , "famili"), corlimit=0.30)
findAssocs(dtms, "love", corlimit=0.3)
## $love
## god know famili will
## 0.37 0.36 0.35 0.33
#findAssocs(dtms, "famili", corlimit=0.3)
#findAssocs(dtms, "god", corlimit=0.3)
# Do the word clouds
dtms <- removeSparseTerms(dtm, 0.80)
freq <- colSums(as.matrix(dtm))
dark2 <- brewer.pal(6, "Dark2")
wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2)
The above graph is the additional method to show the most common words
# Hierarchal Clustering, euclidean is a metric used for computing distances between words, k is the number of clusters
dtms <- removeSparseTerms(dtm, 0.8)
d <- dist(t(dtms), method="euclidian")
fit <- hclust(d=d, method="complete")
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=3)
rect.hclust(fit, k=3, border="red")
We can observe 3 main clusters: one composed only from the word love, and other two groups. The middle one on the dendogram seems to not have any pattern at the first sight, when the third cluster are the verbs connected with the clause: “I do”: “I will? I know/ I want/ I say”
We can also check, whether there are any main topics that the Statements can be splitted to
We can use sentimental analysis in order to specify, whether the last statements are rather positive or negative In order to do obtain it, sentiment analysis can be done for the last speech
example on one statment
## [1] "yes sir, i would first like to say to the sanchez family how sorry i am.\377 words cannot begin to express how sorry\377 i am and the hurt that i have caused you and your family.\377 may this bring you peace and forgiveness. i am sorry.to my family, thank you for all your love and support.\377 i am at peace.\377 jesus christ is lord.\377 i love you all.\377 thank you warden that is it."
analyzeSentiment(docs[5])
## WordCount SentimentGI NegativityGI PositivityGI SentimentHE NegativityHE
## 1 31 0.1612903 0.06451613 0.2258065 0 0
## PositivityHE SentimentLM NegativityLM PositivityLM RatioUncertaintyLM
## 1 0 -0.03225806 0.03225806 0 0.03225806
## SentimentQDAP NegativityQDAP PositivityQDAP
## 1 0.1290323 0.06451613 0.1935484
This statement is nor positive nor negative based on the accessible sentiment default - library