Social Dilema
Let’s Begin
in this article we will using data text to classification model with naive bayes, and we objection is predict a sentiment of text is Negative, Neutral or Positive
## user_name user_location
## 1 Mari Smith San Diego, California
## 2 Mari Smith San Diego, California
## 3 Varun Tyagi Goa, India
## 4 Casey Conway Sydney, New South Wales
## 5 Charlotte Paul Darlington
## 6 Denny Hulme Manchester, England
## user_description
## 1 Premier Facebook Marketing Expert | Social Media Thought Leader | Keynote Speaker | Dynamic Live Video Host | Ambassador | 🇨🇦ðŸ\217´ó \201§ó \201¢ó \201³ó \201£ó \201´ó \201¿ðŸ‡ºðŸ‡¸
## 2 Premier Facebook Marketing Expert | Social Media Thought Leader | Keynote Speaker | Dynamic Live Video Host | Ambassador | 🇨🇦ðŸ\217´ó \201§ó \201¢ó \201³ó \201£ó \201´ó \201¿ðŸ‡ºðŸ‡¸
## 3 Indian | Tech Solution Artist & Hospitality Expert 💻 | Socially Liberal | Travel Enthu | Beer Lover | Passionate about Slow & Sustainable Travel (& Living too)
## 4 Head of Diversity & Inclusion @RugbyAU | It's not a tan, I'm Aboriginal. A gay one ðŸ\217³ï¸\217â\200\215ðŸŒ\210 | IG: casey_conway | 100% my views etc ✌ðŸ\217¾
## 5 Instagram Charlottejyates
## 6
## user_created user_followers user_friends user_favourites user_verified
## 1 2007-09-11 22:22:51 579942 288625 11610 False
## 2 2007-09-11 22:22:51 579942 288625 11610 False
## 3 2009-09-06 10:36:01 257 204 475 False
## 4 2012-12-28 21:45:06 11782 1033 12219 True
## 5 2012-05-28 20:43:08 278 387 5850 False
## 6 2009-11-06 15:20:09 336 616 4748 False
## date
## 1 2020-09-16 20:55:33
## 2 2020-09-16 20:53:17
## 3 2020-09-16 20:51:57
## 4 2020-09-16 20:51:46
## 5 2020-09-16 20:51:11
## 6 2020-09-16 20:50:33
## text
## 1 @musicmadmarc @SocialDilemma_ @netflix @Facebook I'm also reminded of the very poignant quote by French philosopherâ\200¦ https://t.co/CA52aepW6K
## 2 @musicmadmarc @SocialDilemma_ @netflix @Facebook haa, hey Marc. I get what you're saying & don't agree. 🤪\n\nWhicheveâ\200¦ https://t.co/nsVtPHjUs8
## 3 Go watch â\200œThe Social Dilemmaâ\200\235 on Netflix!\n\nItâ\200\231s the best 100 minutes youâ\200\231ll spend in 2020. I bet you💯â\200¦ https://t.co/GSWCx3E9tG
## 4 I watched #TheSocialDilemma last night. Iâ\200\231m scared for humanity. \n\nIâ\200\231m not sure what to do but Iâ\200\231ve logged out of Fâ\200¦ https://t.co/luOBcjCJFb
## 5 The problem of me being on my phone most the time while trying to watch #TheSocialDilemma 🤦ðŸ\217¼â\200\215â\231\200ï¸\217
## 6 #TheSocialDilemma ðŸ\230³ wow!! We need regulations on social media platforms and quick!!
## hashtags source is_retweet Sentiment
## 1 Twitter Web App False Neutral
## 2 Twitter Web App False Neutral
## 3 Twitter for iPhone False Positive
## 4 ['TheSocialDilemma'] Twitter for iPhone False Negative
## 5 ['TheSocialDilemma'] Twitter for iPhone False Positive
## 6 ['TheSocialDilemma'] Twitter for iPhone False Positive
## 'data.frame': 20068 obs. of 14 variables:
## $ user_name : chr "Mari Smith" "Mari Smith" "Varun Tyagi" "Casey Conway" ...
## $ user_location : chr "San Diego, California" "San Diego, California" "Goa, India" "Sydney, New South Wales" ...
## $ user_description: chr "Premier Facebook Marketing Expert | Social Media Thought Leader | Keynote Speaker | Dynamic Live Video Host | A"| __truncated__ "Premier Facebook Marketing Expert | Social Media Thought Leader | Keynote Speaker | Dynamic Live Video Host | A"| __truncated__ "Indian | Tech Solution Artist & Hospitality Expert 💻 | Socially Liberal | Travel Enthu | Beer Lover | Passio"| __truncated__ "Head of Diversity & Inclusion @RugbyAU | It's not a tan, I'm Aboriginal. A gay one ðŸ\217³ï¸\217â\200\215ðŸŒ\21"| __truncated__ ...
## $ user_created : chr "2007-09-11 22:22:51" "2007-09-11 22:22:51" "2009-09-06 10:36:01" "2012-12-28 21:45:06" ...
## $ user_followers : int 579942 579942 257 11782 278 336 120 696 2180 5011 ...
## $ user_friends : int 288625 288625 204 1033 387 616 841 444 1570 2422 ...
## $ user_favourites : int 11610 11610 475 12219 5850 4748 546 10551 18692 619 ...
## $ user_verified : chr "False" "False" "False" "True" ...
## $ date : chr "2020-09-16 20:55:33" "2020-09-16 20:53:17" "2020-09-16 20:51:57" "2020-09-16 20:51:46" ...
## $ text : chr "@musicmadmarc @SocialDilemma_ @netflix @Facebook I'm also reminded of the very poignant quote by French philoso"| __truncated__ "@musicmadmarc @SocialDilemma_ @netflix @Facebook haa, hey Marc. I get what you're saying & don't agree. ðŸ¤"| __truncated__ "Go watch â\200œThe Social Dilemmaâ\200\235 on Netflix!\n\nItâ\200\231s the best 100 minutes youâ\200\231ll spen"| __truncated__ "I watched #TheSocialDilemma last night. Iâ\200\231m scared for humanity. \n\nIâ\200\231m not sure what to do bu"| __truncated__ ...
## $ hashtags : chr "" "" "" "['TheSocialDilemma']" ...
## $ source : chr "Twitter Web App" "Twitter Web App" "Twitter for iPhone" "Twitter for iPhone" ...
## $ is_retweet : chr "False" "False" "False" "False" ...
## $ Sentiment : chr "Neutral" "Neutral" "Positive" "Negative" ...
we have so much column and observation, let’s wrangling data
Text Mining
i will decrese the observation to 10000 for my computer easly cumputing when we make a documentTermMatrix, Actualy thats becouse my computer is POTATO, and we just take 2 column for this article is column text and Sentiment
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dataset = dataset[1:10000,] %>%
select(text, Sentiment) %>%
mutate(Sentiment = as.factor(Sentiment))
head(dataset)## text
## 1 @musicmadmarc @SocialDilemma_ @netflix @Facebook I'm also reminded of the very poignant quote by French philosopherâ\200¦ https://t.co/CA52aepW6K
## 2 @musicmadmarc @SocialDilemma_ @netflix @Facebook haa, hey Marc. I get what you're saying & don't agree. 🤪\n\nWhicheveâ\200¦ https://t.co/nsVtPHjUs8
## 3 Go watch â\200œThe Social Dilemmaâ\200\235 on Netflix!\n\nItâ\200\231s the best 100 minutes youâ\200\231ll spend in 2020. I bet you💯â\200¦ https://t.co/GSWCx3E9tG
## 4 I watched #TheSocialDilemma last night. Iâ\200\231m scared for humanity. \n\nIâ\200\231m not sure what to do but Iâ\200\231ve logged out of Fâ\200¦ https://t.co/luOBcjCJFb
## 5 The problem of me being on my phone most the time while trying to watch #TheSocialDilemma 🤦ðŸ\217¼â\200\215â\231\200ï¸\217
## 6 #TheSocialDilemma ðŸ\230³ wow!! We need regulations on social media platforms and quick!!
## Sentiment
## 1 Neutral
## 2 Neutral
## 3 Positive
## 4 Negative
## 5 Positive
## 6 Positive
cek proportion Sentiment of Data
##
## Negative Neutral Positive
## 0.1809 0.3501 0.4690
Corpus Text Data
before we make a model with this data we need cleansing section, before cleasing data we must change text data to corpus with library(tm)
## Loading required package: NLP
text cleaning
check data text before cleaning
## [1] "Although I am a sucker for things that are curated for me and I get so much joy from the perfect #youtube autoplay list. #thesocialdilemma"
process cleaning text
in this section we will make a lot diferent of our text data. like remove number, tolower, stopword, remove symbol, remove punctuation, stemming text, and remove a spaces that accumulate
data.corpus = tm_map(data.corpus, removeNumbers)
data.corpus = tm_map(data.corpus, content_transformer(tolower))
data.corpus = tm_map(data.corpus, removeWords, stopwords("english"))
data.corpus = tm_map(data.corpus, transformer, "/")
data.corpus = tm_map(data.corpus, transformer, "@")
data.corpus = tm_map(data.corpus, transformer, "-")
data.corpus = tm_map(data.corpus, transformer, "\\.")
data.corpus = tm_map(data.corpus, removePunctuation)
data.corpus = tm_map(data.corpus, stemDocument)
data.corpus = tm_map(data.corpus, stripWhitespace)After Cleaning text
## [1] "although sucker thing curat get much joy perfect youtub autoplay list thesocialdilemma"
Tokenization
in this section we must change a data corpus to Document-Term Matrix (DTM) and through a process called tokenization. Tokenization is functions to break 1 sentence into several term, that term will become a 1 word, pairs of 2 words and a lot of things, the point is in DTM 1 word will become as predictor with its value is a frequency tha word
## <<DocumentTermMatrix (documents: 10000, terms: 15634)>>
## Non-/sparse entries: 99077/156240923
## Sparsity : 100%
## Maximal term length: 220
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs dilemma documentari https just media must netflix social
## 2567 0 0 1 0 0 0 1 0
## 2933 1 0 1 0 1 0 0 2
## 4168 1 0 1 0 0 0 0 1
## 4327 0 0 1 0 0 0 0 0
## 4623 0 0 1 0 0 0 1 0
## 5179 0 0 1 0 0 0 0 0
## 5243 0 0 1 1 1 0 0 1
## 5660 0 0 1 0 0 0 0 0
## 5911 1 0 1 0 1 0 0 2
## 6033 0 0 1 0 0 0 1 1
## Terms
## Docs thesocialdilemma watch
## 2567 1 2
## 2933 0 1
## 4168 0 0
## 4327 0 0
## 4623 0 0
## 5179 0 0
## 5243 0 1
## 5660 0 0
## 5911 0 0
## 6033 0 1
We can see what words appear in at least 500 Observation from all our data. These frequent words make candidates for good predictors.
Croos Validation
in this section we will split a data to train and test set data with composition 75% to train data
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index = sample(nrow(data.dtm), nrow(data.dtm)*0.75)
data_train = data.dtm[index, ]
data_test = data.dtm[-index, ]preparing for label target
check proportion of data
## label_train
## Negative Neutral Positive
## 0.18 0.35 0.46
Further Data Pre-processing
lets we check dimension of data train before we use to make a model
## <<DocumentTermMatrix (documents: 7500, terms: 15634)>>
## Non-/sparse entries: 74179/117180821
## Sparsity : 100%
## Maximal term length: 220
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs dilemma documentari https just media must netflix social
## 1391 0 1 1 0 0 0 0 0
## 2567 0 0 1 0 0 0 1 0
## 2933 1 0 1 0 1 0 0 2
## 4623 0 0 1 0 0 0 1 0
## 5179 0 0 1 0 0 0 0 0
## 5243 0 0 1 1 1 0 0 1
## 5660 0 0 1 0 0 0 0 0
## 6033 0 0 1 0 0 0 1 1
## 6518 0 0 1 0 0 0 0 0
## 8460 0 0 1 1 0 0 0 0
## Terms
## Docs thesocialdilemma watch
## 1391 0 0
## 2567 1 2
## 2933 0 1
## 4623 0 0
## 5179 0 0
## 5243 0 1
## 5660 0 0
## 6033 0 1
## 6518 0 0
## 8460 0 0
The columns or words we have for prediction are numerous. To reduce noise (words that rarely appear), we will use words that appear quite often, at least in 20 observations.
## [1] "â\200¦" "â\200\230" "â\200\230usersâ\200\231" "â\200”" "â\200\235"
## [6] "â\200œ"
We take only the words that appear in data_freq
## <<DocumentTermMatrix (documents: 7500, terms: 483)>>
## Non-/sparse entries: 51645/3570855
## Sparsity : 99%
## Maximal term length: 19
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs dilemma documentari https just media must netflix social
## 1249 1 1 1 1 0 0 1 1
## 5815 0 0 1 0 0 0 0 0
## 6566 1 0 0 1 0 0 0 1
## 7064 0 0 0 0 0 0 0 0
## 8317 0 0 1 1 0 0 1 1
## 8350 4 0 1 0 0 0 0 4
## 8921 0 0 0 0 1 0 0 1
## 9099 0 0 0 0 1 0 1 1
## 9191 0 0 1 0 1 0 0 1
## 9760 0 0 1 1 0 0 0 0
## Terms
## Docs thesocialdilemma watch
## 1249 0 2
## 5815 0 0
## 6566 1 0
## 7064 1 0
## 8317 0 1
## 8350 0 5
## 8921 1 1
## 9099 1 0
## 9191 0 0
## 9760 1 1
We can see theres have 7500 document or we say text and 483 terms in data_train
Bernoulli Converter
in this section we want chage a number just 0 and 1 in our dTM * if the number of words appearing> = 1 (appearing) = 1 * if the number of words shown is 0 (not shown) = 0
## [1] 170.25
## [1] 52000
Predicting
i want to make a raw and class type of data for we check in Confusion Matrix and ROC & AUC
# your code
data_predClass = predict(object = model_naive,
newdata = data_test_bn,
type = "class")
data_predRaw = predict(object = model_naive,
newdata = data_test_bn,
type = "raw")
head(data_predClass)## [1] Neutral Neutral Positive Neutral Positive Negative
## Levels: Negative Neutral Positive
Confusiion Matrix
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Neutral Positive
## Negative 190 31 74
## Neutral 97 740 213
## Positive 143 95 917
##
## Overall Statistics
##
## Accuracy : 0.7388
## 95% CI : (0.7211, 0.7559)
## No Information Rate : 0.4816
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.573
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: Negative Class: Neutral Class: Positive
## Sensitivity 0.4419 0.8545 0.7616
## Specificity 0.9493 0.8103 0.8164
## Pos Pred Value 0.6441 0.7048 0.7939
## Neg Pred Value 0.8912 0.9131 0.7866
## Prevalence 0.1720 0.3464 0.4816
## Detection Rate 0.0760 0.2960 0.3668
## Detection Prevalence 0.1180 0.4200 0.4620
## Balanced Accuracy 0.6956 0.8324 0.7890
in thats summary we have 73.8% of Accuracy, that mean this model is quite good. then let’s see score ROC&AUC
ROC and AUC in every target
ROC and AUC in probality Positive
library(ROCR)
# buat objek prediction
data_roc_positive = prediction(predictions = data_predRaw[,3], # prob kelas positif
labels = as.numeric(label_test == "Positive")) # label kelas positif
# buat performance dari objek prediction
perfPositive = performance(prediction.obj = data_roc_positive,
measure = "tpr", # tpr = true positive rate
x.measure = "fpr") #fpr = false positive rate
asd = as.data.frame(data_predClass)
# buat plot
plot(perfPositive)
abline(0,1, lty = 2) ROC and AUC in probality Neutral
data_roc_neutral = prediction(predictions = data_predRaw[,2], # prob kelas neutral
labels = as.numeric(label_test == "Neutral")) # label kelas neutral
# buat performance dari objek neutral
perfNeutral = performance(prediction.obj = data_roc_neutral,
measure = "tpr", # tpr = true positive rate
x.measure = "fpr") #fpr = false positive rate
plot(perfNeutral)
abline(0,1, lty = 2)ROC and AUC in probality Negative
data_roc_negative = prediction(predictions = data_predRaw[,1], # prob kelas negative
labels = as.numeric(label_test == "Negative")) # label kelas negative
# buat performance dari objek negative
perfNegative = performance(prediction.obj = data_roc_negative,
measure = "tpr", # tpr = true positive rate
x.measure = "fpr") #fpr = false positive rate
plot(perfNegative)
abline(0,1, lty = 2)Make a Better Model
for this part i will just tell you how to we make a model greater, and theres is when we set lowfreq in findFreqTerms(). we can set change to whatever you want, before this i try to change that param to 50 and i get accuracy 70.6, if you have a free time go change a parameter in lowfreq, why i dont show how to make a better model? becouse my computer takes a very" long time too compute that and even worst my computer will got
Conclusion
before we make a conclusion i hope you all have watch this film before, and my conclusion is : if i was a person who know a technology like AI, machine learning, deep learning or something like that i will make a positive sentiment or use persession thingking. that becouse we all know we life in world of technology that mean we understand how technology is work for make life easier, or mybe for business, and that is a something a powerfull tools and we use them everytime and evrywhare.