- Load Libraries and Data: The first chunk loads the necessary libraries and the data from CSV files.
- Preprocessing and Similarity Functions: The second chunk defines the preprocessing and similarity calculation functions.
- Process and Summarize
predicted_vs_manual: The third chunk preprocesses the text, calculates similarity, and summarizes the numerical columns for predicted_vs_manual.
- Process and Summarize
predicted_no_score: The fourth chunk preprocesses the text, calculates similarity, and summarizes the numerical columns for predicted_no_score.
- Visualizations for
predicted_vs_manual: The fifth chunk generates histograms and box-plots for the accuracy and alignment ratings in predicted_vs_manual.
- Visualizations for
predicted_no_score: The sixth chunk generates histograms and box-plots for the similarity scores in predicted_no_score.
Loading required package: ggplot2
Warning: package 'ggplot2' was built under R version 4.4.1
Loading required package: dplyr
Warning: package 'dplyr' was built under R version 4.4.1
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Loading required package: readr
Loading required package: tm
Warning: package 'tm' was built under R version 4.4.1
Loading required package: NLP
The following object is masked from 'package:ggplot2':
annotate
Loading required package: SnowballC
Loading required package: tidytext
Warning: package 'tidytext' was built under R version 4.4.1
Loading required package: lsa
Warning: package 'lsa' was built under R version 4.4.1
Loading required package: stringr
New names:
Rows: 111 Columns: 9
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(5): manual_narrative_name, predicted_narrative, back_office_url, origin... dbl
(3): ...1, How accurately do the predicted narratives represent the cont... lgl
(1): Topics (Exploring 10, Matching 0)
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
Rows: 727 Columns: 5
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): manual_narrative_name, predicted_narrative, back_office_url, origin... dbl
(1): ...1
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
[1] "...1"
[2] "manual_narrative_name"
[3] "predicted_narrative"
[4] "back_office_url"
[5] "original_url"
[6] "How accurately do the predicted narratives represent the content of the articles? (1 to 10)"
[7] "How closely do the predicted narratives align with the manually assigned narratives? (1 to 10)"
[8] "Other feedback"
[9] "Topics (Exploring 10, Matching 0)"
[1] "...1" "manual_narrative_name" "predicted_narrative"
[4] "back_office_url" "original_url"
# A tibble: 1 × 6
mean_accuracy median_accuracy mean_alignment median_alignment mean_similarity
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3.19 1 3.13 1 0
# ℹ 1 more variable: median_similarity <dbl>
# A tibble: 1 × 2
mean_similarity median_similarity
<dbl> <dbl>
1 0.102 0
Data Visualizations
The predicted_vs_manual dataset’s summary shows mean and median values for accuracy (3.19 and 1 respectively) and alignment ratings (3.13 and 1 respectively), indicating generally low confidence in automated predictions. Meanwhile, predicted_no_score displays a mean similarity score of 0.102, with a median of 0, further emphasizing the low similarity between predicted and manual narratives.
Alignment Ratings (Sheet 1)
Similarity Scores (Sheet 2)