1 Introduction

In a rapidly evolving information age, social media platforms like TikTok face challenges in maintaining content quality. This dataset, sourced from Kaggle, which contains user reports of content claimed to contain inaccurate information, provides an empirical foundation for the development of data-driven solutions. Prediction models built on this dataset are expected to be a step forward in smarter and more responsive content moderation efforts.

This dataset is designed to support research in data analytics and machine learning. By analyzing reports of TikTok users flagging content with claims, this dataset enables the development of prediction models capable of automating the process of identifying claims in videos. This is expected to contribute to improving the efficiency of content moderation on the TikTok platform.

2 Business Question

As a data scientist, we are expected to automate the claims-based video identification process on TikTok to reduce moderator workload and improve content review efficiency. And develop machine learning models to prioritize user reports based on the likelihood of flagged content containing claims.

3 Data Preparation

This dataframe offers comprehensive details on the reports submitted by specific TikTok users. A user’s report is represented by each row, which provides details about the reported content.

To understand more about the dataset we can look into its metadata.

Column Name Type Description
1. claim_status factor Indicates whether the video has been reported for containing a claim (0) or opinion (1).
2. video_id double Unique identifier for each video.
3. video_duration_sec numeric Duration of the video in seconds.
4. video_transcription_text char Transcription of the video’s audio content.
5. verified_status factor Indicates whether the video creator’s account is verified (e.g., ‘Verified’, ‘Not Verified’).
6. author_ban_status factor Indicates if the video’s author has been banned (e.g., ‘active’, ‘banned’, ‘under review’).
7. video_view_count numeric Total number of views the video has received.
8. video_like_count numeric Total number of likes the video has received.
9. video_share_count numeric Total number of times the video has been shared.
10. video_download_count numeric Total number of times the video has been downloaded.
11. video_comment_count numeric Total number of comments on the video.

3.1 Import

Result

Libraries
# Data wrangling
library(dplyr)
library(tidyr)

# Neural Network
library(neuralnet)

# Model evaluation
library(caret)

# Set graphic theme
theme_set(theme_minimal())
options(scipen = 999)
Read & Inspect Data
tiktok <- read.csv("data_input/tiktok_dataset.csv")

The first step is to do is to investigate the imported dataset, because we want to observe the initial and final data of the tiktok dataset. We use the head() and tail() functions.

tiktok %>% 
  head()
tail(tiktok)
Structure Data

To find out the suitable data type, it is checked first with the `glimpse() function.

tiktok %>% 
  glimpse()
#> Rows: 19,382
#> Columns: 12
#> $ X.                       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
#> $ claim_status             <chr> "claim", "claim", "claim", "claim", "claim", …
#> $ video_id                 <dbl> 7017666017, 4014381136, 9859838091, 186684799…
#> $ video_duration_sec       <int> 59, 32, 31, 25, 19, 35, 16, 41, 50, 45, 47, 3…
#> $ video_transcription_text <chr> "someone shared with me that drone deliveries…
#> $ verified_status          <chr> "not verified", "not verified", "not verified…
#> $ author_ban_status        <chr> "under review", "active", "active", "active",…
#> $ video_view_count         <chr> "343296", "140877", "902185", "437506", "5616…
#> $ video_like_count         <chr> "19425", "77355", "97690", "239954", "34987",…
#> $ video_share_count        <chr> "241", "19034", "2858", "34812", "4110", "623…
#> $ video_download_count     <chr> "1", "1161", "833", "1234", "547", "4293", "8…
#> $ video_comment_count      <chr> "0", "684", "329", "584", "152", "1857", "544…

3.2 Data Cleansing

To select the most relevant predictor variables, consider the relationship between them and claim_status. Here are some variables that could potentially influence the model:
1. video_duration_sec: Longer videos may be more likely to contain complex claims or opinions.
2. verified_status: Verified accounts may have higher credibility, so their claims are more likely to be believed.
3. author_ban_status: Banned users may be more likely to spread misinformation.
4. video_view_count, video_like_count, video_share_count, video_comment_count, video_download_count: These popularity metrics can indicate how viral a video is and how much interaction it generates, which may correlate with the likelihood of a claim.

We will select several columns that will be used to create a neural network model. The columns used are as follows:
1. claim_status,
2. video_duration_sec,
3. verified_status,
4. author_ban_status,
5. video_view_count,
6, video_like_count,
7. video_share_count,
8. video_comment_count,
9. video_download_count.

tiktok_cl <- 
  tiktok %>% 
    select(c('claim_status', 'video_duration_sec', 'verified_status', 'author_ban_status', 'video_view_count', 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count'))
  
head(tiktok_cl)

Columns that need to be adjusted to numeric type:
1. claim_status,
2. video_duration_sec,
3. verified_status,
4. author_ban_status,
5. video_view_count,
6, video_like_count,
7. video_share_count,
8. video_comment_count,
9. video_download_count.

tiktok_clean <- tiktok_cl %>%
  mutate(
    claim_status = case_when(
      claim_status == "claim" ~ 0,
      claim_status == "opinion" ~ 1
    ),
    verified_status = case_when(
      verified_status == "not verified" ~ 1,
      verified_status == "verified" ~ 2
    ),
    author_ban_status = case_when(
      author_ban_status == "active" ~ 1,
      author_ban_status == "banned" ~ 2,
      author_ban_status == "under review" ~ 3
    ),
    across(c('claim_status', 'video_view_count', 'video_like_count',  'video_share_count', 'video_download_count', 'video_comment_count', 'verified_status', 'author_ban_status'), as.integer)
  )

head(tiktok_clean)
tiktok_clean %>% 
  glimpse()
#> Rows: 19,382
#> Columns: 9
#> $ claim_status         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ video_duration_sec   <int> 59, 32, 31, 25, 19, 35, 16, 41, 50, 45, 47, 30, 5…
#> $ verified_status      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1…
#> $ author_ban_status    <int> 3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1…
#> $ video_view_count     <int> 343296, 140877, 902185, 437506, 56167, 336647, 75…
#> $ video_like_count     <int> 19425, 77355, 97690, 239954, 34987, 175546, 48619…
#> $ video_share_count    <int> 241, 19034, 2858, 34812, 4110, 62303, 193911, 50,…
#> $ video_download_count <int> 1, 1161, 833, 1234, 547, 4293, 8616, 22, 53, 4104…
#> $ video_comment_count  <int> 0, 684, 329, 584, 152, 1857, 5446, 11, 27, 2540, …

The data types are all correct, then we go to the EDA stage.

4 Exploratory Data Analysis (EDA)

Result

Summary

In order to understand the dataframe, we can use summary function:

summary(tiktok_clean)
#>   claim_status    video_duration_sec verified_status author_ban_status
#>  Min.   :0.0000   Min.   : 5.00      Min.   :1.000   Min.   :1.000    
#>  1st Qu.:0.0000   1st Qu.:18.00      1st Qu.:1.000   1st Qu.:1.000    
#>  Median :0.0000   Median :32.00      Median :1.000   Median :1.000    
#>  Mean   :0.4965   Mean   :32.42      Mean   :1.064   Mean   :1.299    
#>  3rd Qu.:1.0000   3rd Qu.:47.00      3rd Qu.:1.000   3rd Qu.:1.000    
#>  Max.   :1.0000   Max.   :60.00      Max.   :2.000   Max.   :3.000    
#>  NA's   :298                                                          
#>  video_view_count video_like_count   video_share_count video_download_count
#>  Min.   :    20   Min.   :     0.0   Min.   :     0    Min.   :    0       
#>  1st Qu.:  4942   1st Qu.:   810.8   1st Qu.:   115    1st Qu.:    7       
#>  Median :  9954   Median :  3403.5   Median :   717    Median :   46       
#>  Mean   :254709   Mean   : 84304.6   Mean   : 16735    Mean   : 1049       
#>  3rd Qu.:504327   3rd Qu.:125020.0   3rd Qu.: 18222    3rd Qu.: 1156       
#>  Max.   :999817   Max.   :657830.0   Max.   :256130    Max.   :14994       
#>  NA's   :298      NA's   :298        NA's   :298       NA's   :298         
#>  video_comment_count
#>  Min.   :   0.0     
#>  1st Qu.:   1.0     
#>  Median :   9.0     
#>  Mean   : 349.3     
#>  3rd Qu.: 292.0     
#>  Max.   :9599.0     
#>  NA's   :298

You can see the contents of the summary of the tiktok_clean dataset, value of the min, mean, median, max, 1st Qu, 1st Qu and NA’s of each column.

Missing Value

To ensure that this dataset has missing values, we check using the is.na() function.

tiktok_clean %>%
  is.na() %>%
  colSums()
#>         claim_status   video_duration_sec      verified_status 
#>                  298                    0                    0 
#>    author_ban_status     video_view_count     video_like_count 
#>                    0                  298                  298 
#>    video_share_count video_download_count  video_comment_count 
#>                  298                  298                  298

Insights: There are missing values in the dataset

The total observations from the dataframe are 19382 and there are missing values in some columns of 298 rows. You can see data_na below, which is used to see that the missing values are in the same row between several columns.

data_na <- 
tiktok_clean %>%
  filter(is.na(video_view_count) | is.na(video_like_count) | is.na(video_share_count) | 
         is.na(video_download_count) | is.na(video_comment_count) | is.na(claim_status))
head(data_na)
Drop Missing Value

Next, we will drop the rows that have the missing value.

tiktok_fix <- na.omit(tiktok_clean)

tiktok_fix %>% 
  is.na() %>%
  colSums()
#>         claim_status   video_duration_sec      verified_status 
#>                    0                    0                    0 
#>    author_ban_status     video_view_count     video_like_count 
#>                    0                    0                    0 
#>    video_share_count video_download_count  video_comment_count 
#>                    0                    0                    0

Insight: Missing values no longer exist in the tiktok_fix dataframe.

We make sure once again that the dataset has the correct data type.

tiktok_fix %>% 
  glimpse()
#> Rows: 19,084
#> Columns: 9
#> $ claim_status         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ video_duration_sec   <int> 59, 32, 31, 25, 19, 35, 16, 41, 50, 45, 47, 30, 5…
#> $ verified_status      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1…
#> $ author_ban_status    <int> 3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1…
#> $ video_view_count     <int> 343296, 140877, 902185, 437506, 56167, 336647, 75…
#> $ video_like_count     <int> 19425, 77355, 97690, 239954, 34987, 175546, 48619…
#> $ video_share_count    <int> 241, 19034, 2858, 34812, 4110, 62303, 193911, 50,…
#> $ video_download_count <int> 1, 1161, 833, 1234, 547, 4293, 8616, 22, 53, 4104…
#> $ video_comment_count  <int> 0, 684, 329, 584, 152, 1857, 5446, 11, 27, 2540, …

4.1 Cross Validation

Result

Data Splitting

Next, we will divide the dataset tiktok_fix into train (train_tiktok) and test (test_tiktok) datasets, maintaining an 80%:20% ratio using training() and testing() functions.

library(rsample)

RNGkind(sample.kind = "Rounding")
set.seed(123)

#split with proportion 80:20
splitter <- initial_split(data = tiktok_fix, prop = 0.8)

# extract to dataframe
train_tiktok <- training(splitter)
test_tiktok <- testing(splitter)

head(train_tiktok)
Check for Imbalance Class

It’s important to check the class distribution in train_tiktok$claim_status before proceeding. This helps mitigate potential bias in the model.

table(train_tiktok$claim_status) %>% 
  prop.table()
#> 
#>         0         1 
#> 0.5037663 0.4962337

Insights :

The dataset exhibits class balance, with 50.37% of samples classified as Claim (0) and 49.62% as Opinion (1).

5 Neural Network

Result

Neural Network Model

We try to create a Neural Network model for the classification case with the target claim_status with Indicates whether the video has been reported for containing a claim (0) or opinion (1).

Using the neuralnet function with parameters :

  • formula = claim_status ~ .
    +claim_status: The target variable to be predicted (i.e., whether a video contains a claim or not).
    • ~ .: This means that all predictor variables (features) in the train_tiktok dataset will be used as input for the model.
  • data = train_tiktok : The dataset used to train the model.
  • hidden = c(2,1) : It means determining the number of neurons in each hidden layer. In this case, there are 2 hidden layers with 2 and 1 neurons respectively.
  • linear.output: FALSE : This indicates that the output of the neural network is not a linear numerical value, but a probability (because it uses a logistic activation function).
  • err.fct: "ce"️ (Cross-Entrop) : It means error/loss function, the value used to measure the amount of error for binary classification.
  • act.fct: “logistic” ️: This means that the activation function used in the output layer is logistic, which produces a value between 0 and 1, suitable for interpretation as a probability.
set.seed(100)
tiktok_nn <- neuralnet(formula = claim_status ~ .,
                           data = train_tiktok,
                           hidden = c(2,1),  
                           linear.output = FALSE,
                           err.fct = "ce",
                           act.fct = "logistic")

The probability values generated by the model can be used to make decisions:

  • If probability > 0.5: The video is classified as containing a claim.
  • If probability <= 0.5: The video is classified as containing no claims or opinions
Predict Neural Network Model

To predict the classification of claim_status using the model that has been created, like other machine learning models neuralnet using the function predict(). We try to predict the classification case using the model that has been created.

  • object = tiktok_nn (NN model)
  • newdata = test_tiktok: variable predictor data to be predicted, namely test data
test_tiktok$pred_nn <- predict(object = tiktok_nn,
                newdata = test_tiktok)

test_tiktok %>% 
  select('claim_status', 'pred_nn') %>% 
  head()

From the prediction results in the form of the above odds, it can be class 1 or 0 by determining the boundaries.

Convert odds to categorical with the function ifelse().
- test_tiktok: tested condition
- 0: claim
- 1: opinion

# conversion of odds into classes

test_tiktok$pred_label_nn <- ifelse(test_tiktok$pred_nn > 0.5, 1 , 0 )

test_tiktok %>% 
  select('claim_status', 'pred_nn', 'pred_label_nn') %>% 
  head()
Evaluation Neural Network Model
confusion_matrix_neural <- confusionMatrix(
                        data = as.factor(test_tiktok$pred_label_nn),
                        reference = as.factor(test_tiktok$claim_status),
                        positive = "0")
print(confusion_matrix_neural)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1887    0
#>          1   30 1900
#>                                                
#>                Accuracy : 0.9921               
#>                  95% CI : (0.9888, 0.9947)     
#>     No Information Rate : 0.5022               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9843               
#>                                                
#>  Mcnemar's Test P-Value : 0.0000001192         
#>                                                
#>             Sensitivity : 0.9844               
#>             Specificity : 1.0000               
#>          Pos Pred Value : 1.0000               
#>          Neg Pred Value : 0.9845               
#>              Prevalence : 0.5022               
#>          Detection Rate : 0.4944               
#>    Detection Prevalence : 0.4944               
#>       Balanced Accuracy : 0.9922               
#>                                                
#>        'Positive' Class : 0                    
#> 

Insight :

Based on these metrics, the neural network model appears to be highly accurate and effective in predicting claim status on TikTok videos.
- The overall accuracy of 99.21% suggests that the model is making correct predictions most of the time.
- The model is very good at identifying both claims and non-claims, indicating that it’s not biased towards one class over the other, so Sensitivity and Specificity is excellent.
- Every prediction of a claim is indeed a claim, suggesting high confidence in the model’s positive predictions.
Overall, the confusion matrix and statistics indicate that the neural network model is performing exceptionally well in predicting claim status on TikTok videos.

6 Logistic Regression

Result

Logistic Regression Model

After creating a neural network model and its predictions, we want to compare when analyzing using logistic regression as a comparison of the results. Whether the logistic regression model or the neural network model has the best accuracy for modeling this Tiktok dataset.

Create a logistic regression model with the glm() function.

# Model Logistic with all predictor

model_logistic <- glm(claim_status ~ .,
                      data = train_tiktok,
                      family = "binomial")
model_logistic
#> 
#> Call:  glm(formula = claim_status ~ ., family = "binomial", data = train_tiktok)
#> 
#> Coefficients:
#>          (Intercept)    video_duration_sec       verified_status  
#>           5.78147169           -0.00718885            2.27575307  
#>    author_ban_status      video_view_count      video_like_count  
#>          -0.45249587           -0.00037573           -0.00048029  
#>    video_share_count  video_download_count   video_comment_count  
#>          -0.00001337            0.02485430           -0.09643754  
#> 
#> Degrees of Freedom: 15266 Total (i.e. Null);  15258 Residual
#> Null Deviance:       21160 
#> Residual Deviance: 1108  AIC: 1126
# Model Logistic Null predictor

model_log_null <- glm(claim_status ~  1,
                      data = train_tiktok,
                      family = "binomial")

model_log_null
#> 
#> Call:  glm(formula = claim_status ~ 1, family = "binomial", data = train_tiktok)
#> 
#> Coefficients:
#> (Intercept)  
#>    -0.01507  
#> 
#> Degrees of Freedom: 15266 Total (i.e. Null);  15266 Residual
#> Null Deviance:       21160 
#> Residual Deviance: 21160     AIC: 21170

We want to build a logistic regression model with Step-wise Regression with both methods (combination of Backward Elimination and Forward Selection).

The stepwise regression process uses the step() function, by filling in some parameters: model_log_null as the object, and “both” as the direction. For the Both Selection method, we need to define the scope parameter to indicate the maximum upper limit of predictor combinations with model_logistic.

model_tiktok_both <- step(object = model_log_null,
                      direction = "both",
                      scope = list(upper= model_logistic),
                      trace=F)
model_tiktok_both
#> 
#> Call:  glm(formula = claim_status ~ video_view_count + video_like_count + 
#>     verified_status + video_comment_count + video_download_count + 
#>     author_ban_status, family = "binomial", data = train_tiktok)
#> 
#> Coefficients:
#>          (Intercept)      video_view_count      video_like_count  
#>            5.5409881            -0.0003746            -0.0004903  
#>      verified_status   video_comment_count  video_download_count  
#>            2.2627162            -0.0970922             0.0253348  
#>    author_ban_status  
#>           -0.4424207  
#> 
#> Degrees of Freedom: 15266 Total (i.e. Null);  15260 Residual
#> Null Deviance:       21160 
#> Residual Deviance: 1110  AIC: 1124
Predict Logistic Regression

Create the prediction from model_tiktok_both, by filling the parameters of the test data = test_tiktok.

test_tiktok$pred_log <- predict(object = model_tiktok_both,
        newdata = test_tiktok,
        type = "response")

test_tiktok %>% 
  select('claim_status','pred_log') %>% 
  head()
# Convert odds into classes

test_tiktok$pred_label_log <- ifelse(test_tiktok$pred_log > 0.5, 1 , 0 )
test_tiktok %>% 
  select('claim_status','pred_log', 'pred_label_log') %>% 
  head()
Evaluate logistic regression
# confusion matrix
library(caret)
confusionMatrix(data = as.factor(test_tiktok$claim_status), 
                reference = as.factor(test_tiktok$pred_label_log), 
                positive = "0")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1904   13
#>          1    0 1900
#>                                                
#>                Accuracy : 0.9966               
#>                  95% CI : (0.9942, 0.9982)     
#>     No Information Rate : 0.5012               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9932               
#>                                                
#>  Mcnemar's Test P-Value : 0.0008741            
#>                                                
#>             Sensitivity : 1.0000               
#>             Specificity : 0.9932               
#>          Pos Pred Value : 0.9932               
#>          Neg Pred Value : 1.0000               
#>              Prevalence : 0.4988               
#>          Detection Rate : 0.4988               
#>    Detection Prevalence : 0.5022               
#>       Balanced Accuracy : 0.9966               
#>                                                
#>        'Positive' Class : 0                    
#> 

Insight :

Based on these metrics, the logistic regression model with stepwise selection appears to be highly accurate and effective in predicting claim status on TikTok videos.
- The overall accuracy of 99.66% is even higher than the previous model, suggesting that the stepwise selection process effectively identified the most important features for prediction.
- Excellent Sensitivity and Specificity, this model is very good at identifying both claims and non-claims, indicating that it’s not biased towards one class over the other.
- This model has a high precision and recall, indicating that it’s both accurate in identifying claims and avoiding false positives.
Overall, the confusion matrix and statistics indicate that the logistic regression model with stepwise selection is a very effective and accurate model for predicting claim status on TikTok videos.

7 Conclusion

Based on the performance metrics, both the neural network and logistic regression models are well-suited to automate the claims-based video identification process and prioritize user reports on TikTok. These models can significantly reduce the workload on moderators and improve the efficiency of content review by accurately identifying claim-based videos and prioritizing reports accordingly.