Social Dilema

Let’s Begin

in this article we will using data text to classification model with naive bayes, and we objection is predict a sentiment of text is Negative, Neutral or Positive

dataset = read.csv("TheSocialDilemma.csv")
head(dataset)

##        user_name           user_location
## 1     Mari Smith   San Diego, California
## 2     Mari Smith   San Diego, California
## 3    Varun Tyagi              Goa, India
## 4   Casey Conway Sydney, New South Wales
## 5 Charlotte Paul              Darlington
## 6    Denny Hulme     Manchester, England
##                                                                                                                                                          user_description
## 1 Premier Facebook Marketing Expert | Social Media Thought Leader | Keynote Speaker | Dynamic Live Video Host | Ambassador | ðŸ‡¨ðŸ‡¦ðŸ\217´ó \201§ó \201¢ó \201³ó \201£ó \201´ó \201¿ðŸ‡ºðŸ‡¸
## 2 Premier Facebook Marketing Expert | Social Media Thought Leader | Keynote Speaker | Dynamic Live Video Host | Ambassador | ðŸ‡¨ðŸ‡¦ðŸ\217´ó \201§ó \201¢ó \201³ó \201£ó \201´ó \201¿ðŸ‡ºðŸ‡¸
## 3      Indian | Tech Solution Artist & Hospitality Expert ðŸ’» | Socially Liberal | Travel Enthu | Beer Lover | Passionate about Slow & Sustainable Travel (& Living too)
## 4                        Head of Diversity & Inclusion @RugbyAU | It's not a tan, I'm Aboriginal. A gay one ðŸ\217³ï¸\217â\200\215ðŸŒ\210 | IG: casey_conway | 100% my views etc âœŒðŸ\217¾
## 5                                                                                                                                               Instagram Charlottejyates
## 6                                                                                                                                                                        
##          user_created user_followers user_friends user_favourites user_verified
## 1 2007-09-11 22:22:51         579942       288625           11610         False
## 2 2007-09-11 22:22:51         579942       288625           11610         False
## 3 2009-09-06 10:36:01            257          204             475         False
## 4 2012-12-28 21:45:06          11782         1033           12219          True
## 5 2012-05-28 20:43:08            278          387            5850         False
## 6 2009-11-06 15:20:09            336          616            4748         False
##                  date
## 1 2020-09-16 20:55:33
## 2 2020-09-16 20:53:17
## 3 2020-09-16 20:51:57
## 4 2020-09-16 20:51:46
## 5 2020-09-16 20:51:11
## 6 2020-09-16 20:50:33
##                                                                                                                                                      text
## 1          @musicmadmarc @SocialDilemma_ @netflix @Facebook I'm also reminded of the very poignant quote by French philosopherâ\200¦ https://t.co/CA52aepW6K
## 2 @musicmadmarc @SocialDilemma_ @netflix @Facebook haa, hey Marc. I get what you're saying &amp; don't agree. ðŸ¤ª\n\nWhicheveâ\200¦ https://t.co/nsVtPHjUs8
## 3           Go watch â\200œThe Social Dilemmaâ\200\235 on Netflix!\n\nItâ\200\231s the best 100 minutes youâ\200\231ll spend in 2020. I bet youðŸ’¯â\200¦ https://t.co/GSWCx3E9tG
## 4  I watched #TheSocialDilemma last night. Iâ\200\231m scared for humanity. \n\nIâ\200\231m not sure what to do but Iâ\200\231ve logged out of Fâ\200¦ https://t.co/luOBcjCJFb
## 5                                             The problem of me being on my phone most the time while trying to watch #TheSocialDilemma ðŸ¤¦ðŸ\217¼â\200\215â\231\200ï¸\217
## 6                                                                  #TheSocialDilemma ðŸ\230³ wow!! We need regulations on social media platforms and quick!!
##               hashtags             source is_retweet Sentiment
## 1                         Twitter Web App      False   Neutral
## 2                         Twitter Web App      False   Neutral
## 3                      Twitter for iPhone      False  Positive
## 4 ['TheSocialDilemma'] Twitter for iPhone      False  Negative
## 5 ['TheSocialDilemma'] Twitter for iPhone      False  Positive
## 6 ['TheSocialDilemma'] Twitter for iPhone      False  Positive

str(dataset)

## 'data.frame':    20068 obs. of  14 variables:
##  $ user_name       : chr  "Mari Smith" "Mari Smith" "Varun Tyagi" "Casey Conway" ...
##  $ user_location   : chr  "San Diego, California" "San Diego, California" "Goa, India" "Sydney, New South Wales" ...
##  $ user_description: chr  "Premier Facebook Marketing Expert | Social Media Thought Leader | Keynote Speaker | Dynamic Live Video Host | A"| __truncated__ "Premier Facebook Marketing Expert | Social Media Thought Leader | Keynote Speaker | Dynamic Live Video Host | A"| __truncated__ "Indian | Tech Solution Artist & Hospitality Expert ðŸ’» | Socially Liberal | Travel Enthu | Beer Lover | Passio"| __truncated__ "Head of Diversity & Inclusion @RugbyAU | It's not a tan, I'm Aboriginal. A gay one ðŸ\217³ï¸\217â\200\215ðŸŒ\21"| __truncated__ ...
##  $ user_created    : chr  "2007-09-11 22:22:51" "2007-09-11 22:22:51" "2009-09-06 10:36:01" "2012-12-28 21:45:06" ...
##  $ user_followers  : int  579942 579942 257 11782 278 336 120 696 2180 5011 ...
##  $ user_friends    : int  288625 288625 204 1033 387 616 841 444 1570 2422 ...
##  $ user_favourites : int  11610 11610 475 12219 5850 4748 546 10551 18692 619 ...
##  $ user_verified   : chr  "False" "False" "False" "True" ...
##  $ date            : chr  "2020-09-16 20:55:33" "2020-09-16 20:53:17" "2020-09-16 20:51:57" "2020-09-16 20:51:46" ...
##  $ text            : chr  "@musicmadmarc @SocialDilemma_ @netflix @Facebook I'm also reminded of the very poignant quote by French philoso"| __truncated__ "@musicmadmarc @SocialDilemma_ @netflix @Facebook haa, hey Marc. I get what you're saying &amp; don't agree. ðŸ¤"| __truncated__ "Go watch â\200œThe Social Dilemmaâ\200\235 on Netflix!\n\nItâ\200\231s the best 100 minutes youâ\200\231ll spen"| __truncated__ "I watched #TheSocialDilemma last night. Iâ\200\231m scared for humanity. \n\nIâ\200\231m not sure what to do bu"| __truncated__ ...
##  $ hashtags        : chr  "" "" "" "['TheSocialDilemma']" ...
##  $ source          : chr  "Twitter Web App" "Twitter Web App" "Twitter for iPhone" "Twitter for iPhone" ...
##  $ is_retweet      : chr  "False" "False" "False" "False" ...
##  $ Sentiment       : chr  "Neutral" "Neutral" "Positive" "Negative" ...

we have so much column and observation, let’s wrangling data

Text Mining

i will decrese the observation to 10000 for my computer easly cumputing when we make a documentTermMatrix, Actualy thats becouse my computer is POTATO, and we just take 2 column for this article is column text and Sentiment

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dataset = dataset[1:10000,] %>%
  select(text, Sentiment) %>% 
  mutate(Sentiment = as.factor(Sentiment))

head(dataset)

##                                                                                                                                                      text
## 1          @musicmadmarc @SocialDilemma_ @netflix @Facebook I'm also reminded of the very poignant quote by French philosopherâ\200¦ https://t.co/CA52aepW6K
## 2 @musicmadmarc @SocialDilemma_ @netflix @Facebook haa, hey Marc. I get what you're saying &amp; don't agree. ðŸ¤ª\n\nWhicheveâ\200¦ https://t.co/nsVtPHjUs8
## 3           Go watch â\200œThe Social Dilemmaâ\200\235 on Netflix!\n\nItâ\200\231s the best 100 minutes youâ\200\231ll spend in 2020. I bet youðŸ’¯â\200¦ https://t.co/GSWCx3E9tG
## 4  I watched #TheSocialDilemma last night. Iâ\200\231m scared for humanity. \n\nIâ\200\231m not sure what to do but Iâ\200\231ve logged out of Fâ\200¦ https://t.co/luOBcjCJFb
## 5                                             The problem of me being on my phone most the time while trying to watch #TheSocialDilemma ðŸ¤¦ðŸ\217¼â\200\215â\231\200ï¸\217
## 6                                                                  #TheSocialDilemma ðŸ\230³ wow!! We need regulations on social media platforms and quick!!
##   Sentiment
## 1   Neutral
## 2   Neutral
## 3  Positive
## 4  Negative
## 5  Positive
## 6  Positive

cek proportion Sentiment of Data

prop.table(table(dataset$Sentiment))

## 
## Negative  Neutral Positive 
##   0.1809   0.3501   0.4690

Corpus Text Data

before we make a model with this data we need cleansing section, before cleasing data we must change text data to corpus with library(tm)

library(tm)

## Loading required package: NLP

data.corpus = VCorpus(VectorSource(dataset$text))

text cleaning

check data text before cleaning

data.corpus[[122]]$content

## [1] "Although I am a sucker for things that are curated for me and I get so much joy from the perfect #youtube autoplay list. #thesocialdilemma"

transformer = content_transformer(FUN = function(x, pattern){
 gsub(x = x, # data text
      pattern = pattern, # pattern yang ditemui
      replacement = " ") # ganti pattern dengan spasi " "
})

process cleaning text

in this section we will make a lot diferent of our text data. like remove number, tolower, stopword, remove symbol, remove punctuation, stemming text, and remove a spaces that accumulate

data.corpus = tm_map(data.corpus, removeNumbers)
data.corpus = tm_map(data.corpus, content_transformer(tolower))
data.corpus = tm_map(data.corpus, removeWords, stopwords("english"))
data.corpus = tm_map(data.corpus, transformer, "/")
data.corpus = tm_map(data.corpus, transformer, "@")
data.corpus = tm_map(data.corpus, transformer, "-")
data.corpus = tm_map(data.corpus, transformer, "\\.") 
data.corpus = tm_map(data.corpus, removePunctuation)
data.corpus = tm_map(data.corpus, stemDocument)
data.corpus = tm_map(data.corpus, stripWhitespace)

After Cleaning text

data.corpus[[122]]$content

## [1] "although sucker thing curat get much joy perfect youtub autoplay list thesocialdilemma"

Tokenization

in this section we must change a data corpus to Document-Term Matrix (DTM) and through a process called tokenization. Tokenization is functions to break 1 sentence into several term, that term will become a 1 word, pairs of 2 words and a lot of things, the point is in DTM 1 word will become as predictor with its value is a frequency tha word

data.dtm = DocumentTermMatrix(data.corpus)

# cek data
inspect(data.dtm)

## <<DocumentTermMatrix (documents: 10000, terms: 15634)>>
## Non-/sparse entries: 99077/156240923
## Sparsity           : 100%
## Maximal term length: 220
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   dilemma documentari https just media must netflix social
##   2567       0           0     1    0     0    0       1      0
##   2933       1           0     1    0     1    0       0      2
##   4168       1           0     1    0     0    0       0      1
##   4327       0           0     1    0     0    0       0      0
##   4623       0           0     1    0     0    0       1      0
##   5179       0           0     1    0     0    0       0      0
##   5243       0           0     1    1     1    0       0      1
##   5660       0           0     1    0     0    0       0      0
##   5911       1           0     1    0     1    0       0      2
##   6033       0           0     1    0     0    0       1      1
##       Terms
## Docs   thesocialdilemma watch
##   2567                1     2
##   2933                0     1
##   4168                0     0
##   4327                0     0
##   4623                0     0
##   5179                0     0
##   5243                0     1
##   5660                0     0
##   5911                0     0
##   6033                0     1

We can see what words appear in at least 500 Observation from all our data. These frequent words make candidates for good predictors.

dataFreq = findFreqTerms(x = data.dtm, lowfreq = 500)

Croos Validation

in this section we will split a data to train and test set data with composition 75% to train data

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

index = sample(nrow(data.dtm), nrow(data.dtm)*0.75)

data_train = data.dtm[index, ]
data_test = data.dtm[-index, ]

preparing for label target

label_train = dataset[index, "Sentiment"] 
label_test = dataset[-index, "Sentiment"]

check proportion of data

round(prop.table(table(label_train)),2)

## label_train
## Negative  Neutral Positive 
##     0.18     0.35     0.46

Further Data Pre-processing

lets we check dimension of data train before we use to make a model

inspect(data_train)

## <<DocumentTermMatrix (documents: 7500, terms: 15634)>>
## Non-/sparse entries: 74179/117180821
## Sparsity           : 100%
## Maximal term length: 220
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   dilemma documentari https just media must netflix social
##   1391       0           1     1    0     0    0       0      0
##   2567       0           0     1    0     0    0       1      0
##   2933       1           0     1    0     1    0       0      2
##   4623       0           0     1    0     0    0       1      0
##   5179       0           0     1    0     0    0       0      0
##   5243       0           0     1    1     1    0       0      1
##   5660       0           0     1    0     0    0       0      0
##   6033       0           0     1    0     0    0       1      1
##   6518       0           0     1    0     0    0       0      0
##   8460       0           0     1    1     0    0       0      0
##       Terms
## Docs   thesocialdilemma watch
##   1391                0     0
##   2567                1     2
##   2933                0     1
##   4623                0     0
##   5179                0     0
##   5243                0     1
##   5660                0     0
##   6033                0     1
##   6518                0     0
##   8460                0     0

The columns or words we have for prediction are numerous. To reduce noise (words that rarely appear), we will use words that appear quite often, at least in 20 observations.

data_freq = findFreqTerms(x = data_train, lowfreq = 20)
head(data_freq)

## [1] "â\200¦"         "â\200\230"         "â\200\230usersâ\200\231" "â\200”"         "â\200\235"        
## [6] "â\200œ"

We take only the words that appear in data_freq

data_train = data_train[, data_freq]

inspect(data_train)

## <<DocumentTermMatrix (documents: 7500, terms: 483)>>
## Non-/sparse entries: 51645/3570855
## Sparsity           : 99%
## Maximal term length: 19
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   dilemma documentari https just media must netflix social
##   1249       1           1     1    1     0    0       1      1
##   5815       0           0     1    0     0    0       0      0
##   6566       1           0     0    1     0    0       0      1
##   7064       0           0     0    0     0    0       0      0
##   8317       0           0     1    1     0    0       1      1
##   8350       4           0     1    0     0    0       0      4
##   8921       0           0     0    0     1    0       0      1
##   9099       0           0     0    0     1    0       1      1
##   9191       0           0     1    0     1    0       0      1
##   9760       0           0     1    1     0    0       0      0
##       Terms
## Docs   thesocialdilemma watch
##   1249                0     2
##   5815                0     0
##   6566                1     0
##   7064                1     0
##   8317                0     1
##   8350                0     5
##   8921                1     1
##   9099                1     0
##   9191                0     0
##   9760                1     1

We can see theres have 7500 document or we say text and 483 terms in data_train

Bernoulli Converter

in this section we want chage a number just 0 and 1 in our dTM * if the number of words appearing> = 1 (appearing) = 1 * if the number of words shown is 0 (not shown) = 0

# fungsi DIY
bernoulli_conv = function(x){
  x = as.factor(as.numeric(x>0))
}

memory.size(max = T)

## [1] 170.25

memory.limit(size=52000)

## [1] 52000

data_train_bn = apply(X = data_train, MARGIN = 2, FUN = bernoulli_conv)
data_test_bn = apply(X = data_test, MARGIN = 2, FUN = bernoulli_conv)

Modelling Navie Bayes

we will just make a simpel model without extra tunning in this model

# your code
library(e1071)
model_naive = naiveBayes(x = data_train_bn, # data prediktor
                          y = label_train, # data target
                          laplace = 1)

Predicting

i want to make a raw and class type of data for we check in Confusion Matrix and ROC & AUC

# your code
data_predClass = predict(object = model_naive, 
                         newdata = data_test_bn,
                         type = "class")

data_predRaw = predict(object = model_naive, 
                         newdata = data_test_bn,
                         type = "raw")

head(data_predClass)

## [1] Neutral  Neutral  Positive Neutral  Positive Negative
## Levels: Negative Neutral Positive

Confusiion Matrix

# your code
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

confusionMatrix(data = data_predClass, # hasil prediksi
                reference = label_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Neutral Positive
##   Negative      190      31       74
##   Neutral        97     740      213
##   Positive      143      95      917
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7388          
##                  95% CI : (0.7211, 0.7559)
##     No Information Rate : 0.4816          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.573           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: Negative Class: Neutral Class: Positive
## Sensitivity                   0.4419         0.8545          0.7616
## Specificity                   0.9493         0.8103          0.8164
## Pos Pred Value                0.6441         0.7048          0.7939
## Neg Pred Value                0.8912         0.9131          0.7866
## Prevalence                    0.1720         0.3464          0.4816
## Detection Rate                0.0760         0.2960          0.3668
## Detection Prevalence          0.1180         0.4200          0.4620
## Balanced Accuracy             0.6956         0.8324          0.7890

in thats summary we have 73.8% of Accuracy, that mean this model is quite good. then let’s see score ROC&AUC

ROC and AUC in every target

ROC and AUC in probality Positive

library(ROCR)

# buat objek prediction
data_roc_positive = prediction(predictions = data_predRaw[,3], # prob kelas positif
                        labels = as.numeric(label_test == "Positive")) # label kelas positif

# buat performance dari objek prediction
perfPositive = performance(prediction.obj = data_roc_positive,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate

asd = as.data.frame(data_predClass)
    
# buat plot
plot(perfPositive)
abline(0,1, lty = 2)

ROC and AUC in probality Neutral

data_roc_neutral = prediction(predictions = data_predRaw[,2], # prob kelas neutral
                        labels = as.numeric(label_test == "Neutral")) # label kelas neutral

# buat performance dari objek neutral
perfNeutral = performance(prediction.obj = data_roc_neutral,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate

plot(perfNeutral)
abline(0,1, lty = 2)

ROC and AUC in probality Negative

data_roc_negative = prediction(predictions = data_predRaw[,1], # prob kelas negative
                               labels = as.numeric(label_test == "Negative")) # label kelas negative

# buat performance dari objek negative
perfNegative = performance(prediction.obj = data_roc_negative,
                           measure = "tpr", # tpr = true positive rate
                           x.measure = "fpr") #fpr = false positive rate

plot(perfNegative)
abline(0,1, lty = 2)

Make a Better Model

for this part i will just tell you how to we make a model greater, and theres is when we set lowfreq in findFreqTerms(). we can set change to whatever you want, before this i try to change that param to 50 and i get accuracy 70.6, if you have a free time go change a parameter in lowfreq, why i dont show how to make a better model? becouse my computer takes a very" long time too compute that and even worst my computer will got

Conclusion

before we make a conclusion i hope you all have watch this film before, and my conclusion is : if i was a person who know a technology like AI, machine learning, deep learning or something like that i will make a positive sentiment or use persession thingking. that becouse we all know we life in world of technology that mean we understand how technology is work for make life easier, or mybe for business, and that is a something a powerfull tools and we use them everytime and evrywhare.

Social Dilema

Social Dilema

The Social Dilema

Let’s Begin

Text Mining

cek proportion Sentiment of Data

Corpus Text Data

text cleaning

process cleaning text

Tokenization

Croos Validation

preparing for label target

Further Data Pre-processing

Bernoulli Converter

Modelling Navie Bayes

Predicting

Confusiion Matrix

ROC and AUC in every target

ROC and AUC in probality Positive

ROC and AUC in probality Neutral

ROC and AUC in probality Negative

Make a Better Model

Conclusion