In a rapidly evolving information age, social media platforms like TikTok face challenges in maintaining content quality. This dataset, sourced from Kaggle, which contains user reports of content claimed to contain inaccurate information, provides an empirical foundation for the development of data-driven solutions. Prediction models built on this dataset are expected to be a step forward in smarter and more responsive content moderation efforts.
This dataset is designed to support research in data analytics and machine learning. By analyzing reports of TikTok users flagging content with claims, this dataset enables the development of prediction models capable of automating the process of identifying claims in videos. This is expected to contribute to improving the efficiency of content moderation on the TikTok platform.
As a data scientist, we are expected to automate the claims-based video identification process on TikTok to reduce moderator workload and improve content review efficiency. And develop machine learning models to prioritize user reports based on the likelihood of flagged content containing claims.
This dataframe offers comprehensive details on the reports submitted by specific TikTok users. A user’s report is represented by each row, which provides details about the reported content.
To understand more about the dataset we can look into its metadata.
| Column Name | Type | Description |
|---|---|---|
| 1. claim_status | factor | Indicates whether the video has been reported for containing a claim (0) or opinion (1). |
| 2. video_id | double | Unique identifier for each video. |
| 3. video_duration_sec | numeric | Duration of the video in seconds. |
| 4. video_transcription_text | char | Transcription of the video’s audio content. |
| 5. verified_status | factor | Indicates whether the video creator’s account is verified (e.g., ‘Verified’, ‘Not Verified’). |
| 6. author_ban_status | factor | Indicates if the video’s author has been banned (e.g., ‘active’, ‘banned’, ‘under review’). |
| 7. video_view_count | numeric | Total number of views the video has received. |
| 8. video_like_count | numeric | Total number of likes the video has received. |
| 9. video_share_count | numeric | Total number of times the video has been shared. |
| 10. video_download_count | numeric | Total number of times the video has been downloaded. |
| 11. video_comment_count | numeric | Total number of comments on the video. |
# Data wrangling
library(dplyr)
library(tidyr)
# Neural Network
library(neuralnet)
# Model evaluation
library(caret)
# Set graphic theme
theme_set(theme_minimal())
options(scipen = 999)
tiktok <- read.csv("data_input/tiktok_dataset.csv")
The first step is to do is to investigate the imported dataset,
because we want to observe the initial and final data of the
tiktok dataset. We use the head() and
tail() functions.
tiktok %>%
head()
tail(tiktok)
To find out the suitable data type, it is checked first with the `glimpse() function.
tiktok %>%
glimpse()
#> Rows: 19,382
#> Columns: 12
#> $ X. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
#> $ claim_status <chr> "claim", "claim", "claim", "claim", "claim", …
#> $ video_id <dbl> 7017666017, 4014381136, 9859838091, 186684799…
#> $ video_duration_sec <int> 59, 32, 31, 25, 19, 35, 16, 41, 50, 45, 47, 3…
#> $ video_transcription_text <chr> "someone shared with me that drone deliveries…
#> $ verified_status <chr> "not verified", "not verified", "not verified…
#> $ author_ban_status <chr> "under review", "active", "active", "active",…
#> $ video_view_count <chr> "343296", "140877", "902185", "437506", "5616…
#> $ video_like_count <chr> "19425", "77355", "97690", "239954", "34987",…
#> $ video_share_count <chr> "241", "19034", "2858", "34812", "4110", "623…
#> $ video_download_count <chr> "1", "1161", "833", "1234", "547", "4293", "8…
#> $ video_comment_count <chr> "0", "684", "329", "584", "152", "1857", "544…
To select the most relevant predictor variables, consider the
relationship between them and claim_status. Here are some variables that
could potentially influence the model: 1.
video_duration_sec: Longer videos may be more likely to
contain complex claims or opinions. 2.
verified_status: Verified accounts may have higher
credibility, so their claims are more likely to be believed. 3.
author_ban_status: Banned users may be more likely to
spread misinformation. 4. video_view_count,
video_like_count, video_share_count,
video_comment_count, video_download_count:
These popularity metrics can indicate how viral a video is and how much
interaction it generates, which may correlate with the likelihood of a
claim.
We will select several columns that will be used to create a neural
network model. The columns used are as follows: 1.
claim_status, 2. video_duration_sec,
3. verified_status, 4.
author_ban_status, 5. video_view_count,
6, video_like_count, 7.
video_share_count, 8.
video_comment_count, 9.
video_download_count.
tiktok_cl <-
tiktok %>%
select(c('claim_status', 'video_duration_sec', 'verified_status', 'author_ban_status', 'video_view_count', 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count'))
head(tiktok_cl)
Columns that need to be adjusted to numeric type: 1.
claim_status, 2.
video_duration_sec, 3. verified_status,
4. author_ban_status, 5.
video_view_count, 6, video_like_count,
7. video_share_count, 8.
video_comment_count, 9.
video_download_count.
tiktok_clean <- tiktok_cl %>%
mutate(
claim_status = case_when(
claim_status == "claim" ~ 0,
claim_status == "opinion" ~ 1
),
verified_status = case_when(
verified_status == "not verified" ~ 1,
verified_status == "verified" ~ 2
),
author_ban_status = case_when(
author_ban_status == "active" ~ 1,
author_ban_status == "banned" ~ 2,
author_ban_status == "under review" ~ 3
),
across(c('claim_status', 'video_view_count', 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count', 'verified_status', 'author_ban_status'), as.integer)
)
head(tiktok_clean)
tiktok_clean %>%
glimpse()
#> Rows: 19,382
#> Columns: 9
#> $ claim_status <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ video_duration_sec <int> 59, 32, 31, 25, 19, 35, 16, 41, 50, 45, 47, 30, 5…
#> $ verified_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1…
#> $ author_ban_status <int> 3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1…
#> $ video_view_count <int> 343296, 140877, 902185, 437506, 56167, 336647, 75…
#> $ video_like_count <int> 19425, 77355, 97690, 239954, 34987, 175546, 48619…
#> $ video_share_count <int> 241, 19034, 2858, 34812, 4110, 62303, 193911, 50,…
#> $ video_download_count <int> 1, 1161, 833, 1234, 547, 4293, 8616, 22, 53, 4104…
#> $ video_comment_count <int> 0, 684, 329, 584, 152, 1857, 5446, 11, 27, 2540, …
The data types are all correct, then we go to the EDA stage.
In order to understand the dataframe, we can use summary
function:
summary(tiktok_clean)
#> claim_status video_duration_sec verified_status author_ban_status
#> Min. :0.0000 Min. : 5.00 Min. :1.000 Min. :1.000
#> 1st Qu.:0.0000 1st Qu.:18.00 1st Qu.:1.000 1st Qu.:1.000
#> Median :0.0000 Median :32.00 Median :1.000 Median :1.000
#> Mean :0.4965 Mean :32.42 Mean :1.064 Mean :1.299
#> 3rd Qu.:1.0000 3rd Qu.:47.00 3rd Qu.:1.000 3rd Qu.:1.000
#> Max. :1.0000 Max. :60.00 Max. :2.000 Max. :3.000
#> NA's :298
#> video_view_count video_like_count video_share_count video_download_count
#> Min. : 20 Min. : 0.0 Min. : 0 Min. : 0
#> 1st Qu.: 4942 1st Qu.: 810.8 1st Qu.: 115 1st Qu.: 7
#> Median : 9954 Median : 3403.5 Median : 717 Median : 46
#> Mean :254709 Mean : 84304.6 Mean : 16735 Mean : 1049
#> 3rd Qu.:504327 3rd Qu.:125020.0 3rd Qu.: 18222 3rd Qu.: 1156
#> Max. :999817 Max. :657830.0 Max. :256130 Max. :14994
#> NA's :298 NA's :298 NA's :298 NA's :298
#> video_comment_count
#> Min. : 0.0
#> 1st Qu.: 1.0
#> Median : 9.0
#> Mean : 349.3
#> 3rd Qu.: 292.0
#> Max. :9599.0
#> NA's :298
You can see the contents of the summary of the tiktok_clean dataset, value of the min, mean, median, max, 1st Qu, 1st Qu and NA’s of each column.
To ensure that this dataset has missing values, we check using the
is.na() function.
tiktok_clean %>%
is.na() %>%
colSums()
#> claim_status video_duration_sec verified_status
#> 298 0 0
#> author_ban_status video_view_count video_like_count
#> 0 298 298
#> video_share_count video_download_count video_comment_count
#> 298 298 298
Insights: There are missing values in the dataset
The total observations from the dataframe are 19382 and there are
missing values in some columns of 298 rows. You can see
data_na below, which is used to see that the missing values
are in the same row between several columns.
data_na <-
tiktok_clean %>%
filter(is.na(video_view_count) | is.na(video_like_count) | is.na(video_share_count) |
is.na(video_download_count) | is.na(video_comment_count) | is.na(claim_status))
head(data_na)
Next, we will drop the rows that have the missing value.
tiktok_fix <- na.omit(tiktok_clean)
tiktok_fix %>%
is.na() %>%
colSums()
#> claim_status video_duration_sec verified_status
#> 0 0 0
#> author_ban_status video_view_count video_like_count
#> 0 0 0
#> video_share_count video_download_count video_comment_count
#> 0 0 0
Insight: Missing values no longer exist in the
tiktok_fix dataframe.
We make sure once again that the dataset has the correct data type.
tiktok_fix %>%
glimpse()
#> Rows: 19,084
#> Columns: 9
#> $ claim_status <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ video_duration_sec <int> 59, 32, 31, 25, 19, 35, 16, 41, 50, 45, 47, 30, 5…
#> $ verified_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1…
#> $ author_ban_status <int> 3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1…
#> $ video_view_count <int> 343296, 140877, 902185, 437506, 56167, 336647, 75…
#> $ video_like_count <int> 19425, 77355, 97690, 239954, 34987, 175546, 48619…
#> $ video_share_count <int> 241, 19034, 2858, 34812, 4110, 62303, 193911, 50,…
#> $ video_download_count <int> 1, 1161, 833, 1234, 547, 4293, 8616, 22, 53, 4104…
#> $ video_comment_count <int> 0, 684, 329, 584, 152, 1857, 5446, 11, 27, 2540, …
Next, we will divide the dataset tiktok_fix into train
(train_tiktok) and test (test_tiktok)
datasets, maintaining an 80%:20% ratio using training() and
testing() functions.
library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(123)
#split with proportion 80:20
splitter <- initial_split(data = tiktok_fix, prop = 0.8)
# extract to dataframe
train_tiktok <- training(splitter)
test_tiktok <- testing(splitter)
head(train_tiktok)
It’s important to check the class distribution in
train_tiktok$claim_status before proceeding. This helps
mitigate potential bias in the model.
table(train_tiktok$claim_status) %>%
prop.table()
#>
#> 0 1
#> 0.5037663 0.4962337
Insights :
The dataset exhibits class balance, with 50.37% of samples classified as Claim (0) and 49.62% as Opinion (1).
We try to create a Neural Network model for the classification case
with the target claim_status with Indicates whether the
video has been reported for containing a claim (0) or opinion (1).
Using the neuralnet function with parameters :
formula = claim_status ~ .
+claim_status: The target variable to be predicted (i.e.,
whether a video contains a claim or not).
data = train_tiktok : The dataset used to
train the model.hidden = c(2,1) : It means determining the
number of neurons in each hidden layer. In this case, there are 2 hidden
layers with 2 and 1 neurons respectively. linear.output: FALSE : This indicates that
the output of the neural network is not a linear numerical value, but a
probability (because it uses a logistic activation function). err.fct: "ce"️ (Cross-Entrop) : It means
error/loss function, the value used to measure the amount of error for
binary classification.act.fct: “logistic” ️: This means that the
activation function used in the output layer is logistic, which produces
a value between 0 and 1, suitable for interpretation as a
probability.set.seed(100)
tiktok_nn <- neuralnet(formula = claim_status ~ .,
data = train_tiktok,
hidden = c(2,1),
linear.output = FALSE,
err.fct = "ce",
act.fct = "logistic")
The probability values generated by the model can be used to make decisions:
To predict the classification of claim_status using the
model that has been created, like other machine learning models
neuralnet using the function predict(). We try
to predict the classification case using the model that has been
created.
object = tiktok_nn (NN model)newdata = test_tiktok: variable predictor
data to be predicted, namely test datatest_tiktok$pred_nn <- predict(object = tiktok_nn,
newdata = test_tiktok)
test_tiktok %>%
select('claim_status', 'pred_nn') %>%
head()
From the prediction results in the form of the above odds, it can be class 1 or 0 by determining the boundaries.
Convert odds to categorical with the function ifelse().
- test_tiktok: tested condition -
0: claim - 1: opinion
# conversion of odds into classes
test_tiktok$pred_label_nn <- ifelse(test_tiktok$pred_nn > 0.5, 1 , 0 )
test_tiktok %>%
select('claim_status', 'pred_nn', 'pred_label_nn') %>%
head()
confusion_matrix_neural <- confusionMatrix(
data = as.factor(test_tiktok$pred_label_nn),
reference = as.factor(test_tiktok$claim_status),
positive = "0")
print(confusion_matrix_neural)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1887 0
#> 1 30 1900
#>
#> Accuracy : 0.9921
#> 95% CI : (0.9888, 0.9947)
#> No Information Rate : 0.5022
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9843
#>
#> Mcnemar's Test P-Value : 0.0000001192
#>
#> Sensitivity : 0.9844
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9845
#> Prevalence : 0.5022
#> Detection Rate : 0.4944
#> Detection Prevalence : 0.4944
#> Balanced Accuracy : 0.9922
#>
#> 'Positive' Class : 0
#>
Insight :
Based on these metrics, the neural network model appears to be highly accurate and effective in predicting claim status on TikTok videos. - The overall accuracy of 99.21% suggests that the model is making correct predictions most of the time. - The model is very good at identifying both claims and non-claims, indicating that it’s not biased towards one class over the other, so Sensitivity and Specificity is excellent. - Every prediction of a claim is indeed a claim, suggesting high confidence in the model’s positive predictions. Overall, the confusion matrix and statistics indicate that the neural network model is performing exceptionally well in predicting claim status on TikTok videos.
After creating a neural network model and its predictions, we want to compare when analyzing using logistic regression as a comparison of the results. Whether the logistic regression model or the neural network model has the best accuracy for modeling this Tiktok dataset.
Create a logistic regression model with the glm()
function.
# Model Logistic with all predictor
model_logistic <- glm(claim_status ~ .,
data = train_tiktok,
family = "binomial")
model_logistic
#>
#> Call: glm(formula = claim_status ~ ., family = "binomial", data = train_tiktok)
#>
#> Coefficients:
#> (Intercept) video_duration_sec verified_status
#> 5.78147169 -0.00718885 2.27575307
#> author_ban_status video_view_count video_like_count
#> -0.45249587 -0.00037573 -0.00048029
#> video_share_count video_download_count video_comment_count
#> -0.00001337 0.02485430 -0.09643754
#>
#> Degrees of Freedom: 15266 Total (i.e. Null); 15258 Residual
#> Null Deviance: 21160
#> Residual Deviance: 1108 AIC: 1126
# Model Logistic Null predictor
model_log_null <- glm(claim_status ~ 1,
data = train_tiktok,
family = "binomial")
model_log_null
#>
#> Call: glm(formula = claim_status ~ 1, family = "binomial", data = train_tiktok)
#>
#> Coefficients:
#> (Intercept)
#> -0.01507
#>
#> Degrees of Freedom: 15266 Total (i.e. Null); 15266 Residual
#> Null Deviance: 21160
#> Residual Deviance: 21160 AIC: 21170
We want to build a logistic regression model with Step-wise Regression with both methods (combination of Backward Elimination and Forward Selection).
The stepwise regression process uses the step()
function, by filling in some parameters: model_log_null as
the object, and “both” as the direction. For the Both
Selection method, we need to define the scope parameter to
indicate the maximum upper limit of predictor combinations with
model_logistic.
model_tiktok_both <- step(object = model_log_null,
direction = "both",
scope = list(upper= model_logistic),
trace=F)
model_tiktok_both
#>
#> Call: glm(formula = claim_status ~ video_view_count + video_like_count +
#> verified_status + video_comment_count + video_download_count +
#> author_ban_status, family = "binomial", data = train_tiktok)
#>
#> Coefficients:
#> (Intercept) video_view_count video_like_count
#> 5.5409881 -0.0003746 -0.0004903
#> verified_status video_comment_count video_download_count
#> 2.2627162 -0.0970922 0.0253348
#> author_ban_status
#> -0.4424207
#>
#> Degrees of Freedom: 15266 Total (i.e. Null); 15260 Residual
#> Null Deviance: 21160
#> Residual Deviance: 1110 AIC: 1124
Create the prediction from model_tiktok_both, by filling
the parameters of the test data = test_tiktok.
test_tiktok$pred_log <- predict(object = model_tiktok_both,
newdata = test_tiktok,
type = "response")
test_tiktok %>%
select('claim_status','pred_log') %>%
head()
# Convert odds into classes
test_tiktok$pred_label_log <- ifelse(test_tiktok$pred_log > 0.5, 1 , 0 )
test_tiktok %>%
select('claim_status','pred_log', 'pred_label_log') %>%
head()
# confusion matrix
library(caret)
confusionMatrix(data = as.factor(test_tiktok$claim_status),
reference = as.factor(test_tiktok$pred_label_log),
positive = "0")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1904 13
#> 1 0 1900
#>
#> Accuracy : 0.9966
#> 95% CI : (0.9942, 0.9982)
#> No Information Rate : 0.5012
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9932
#>
#> Mcnemar's Test P-Value : 0.0008741
#>
#> Sensitivity : 1.0000
#> Specificity : 0.9932
#> Pos Pred Value : 0.9932
#> Neg Pred Value : 1.0000
#> Prevalence : 0.4988
#> Detection Rate : 0.4988
#> Detection Prevalence : 0.5022
#> Balanced Accuracy : 0.9966
#>
#> 'Positive' Class : 0
#>
Insight :
Based on these metrics, the logistic regression model with stepwise selection appears to be highly accurate and effective in predicting claim status on TikTok videos. - The overall accuracy of 99.66% is even higher than the previous model, suggesting that the stepwise selection process effectively identified the most important features for prediction. - Excellent Sensitivity and Specificity, this model is very good at identifying both claims and non-claims, indicating that it’s not biased towards one class over the other. - This model has a high precision and recall, indicating that it’s both accurate in identifying claims and avoiding false positives. Overall, the confusion matrix and statistics indicate that the logistic regression model with stepwise selection is a very effective and accurate model for predicting claim status on TikTok videos.
Based on the performance metrics, both the neural network and logistic regression models are well-suited to automate the claims-based video identification process and prioritize user reports on TikTok. These models can significantly reduce the workload on moderators and improve the efficiency of content review by accurately identifying claim-based videos and prioritizing reports accordingly.