Executive summary

What is (are) your main question(s)? What is your story? What does the final graphic show? The current difficulties in the South Korean political sphere stem from controversies surrounding a new 3 Revisions to Broadcasting Laws aimed at countering the control exerted by politicians over broadcasting and media. Efforts by political authorities to control the media have been evident throughout the long history of South Korea. However, these issues continue to persist, and resolving them still appears to be a long way off. Recent controversies at NPR, the public broadcaster in the United States, concerning political bias indicate that this is not solely a problem unique to Korea. However, The BBC in the UK is regarded as a role model of public broadcasting that has achieved both political independence and high-quality content. There is much to learn from how it has innovatively overcome challenges similar to those we are facing. To address the question of whether state-controlled broadcasting can avoid political influence, we can examine cases from BBC and NPR to discern the differences. So I aim to analyze news articles from both broadcasting companies to understand the differences and similarities in how they present information. Based on the current situation at NPR, it seems likely that news articles would tend to represent the position of the current Democratic administration more and cover government actions more extensively. On the other hand, BBC is predicted to prioritize factual information about events with neutrality, maintaining its reputation for impartiality. For that, I’m planning to gather and analyze news articles from the time when the conflict between Israel and Hamas escalated and became the focal issue.

Data background

I got the data in kaggle, data science platform which provides many datasets in various area. Data is made by a user named ‘Kumar Saksham’ with NewsAPI. The data is news dataset around the world collected between October and November 2023. This period marks the initial stages of the conflict between Israel and Hamas, likely attracting significant international attention and consequently resulting in a high volume of related news articles. It has almost 100 thousand news articles’ information including news companies that published the news, news title, description and so on.

Data loading, cleaning and preprocessing

At first, I loaded the csv file in rstudio. Then I filtered out two news companies which is NPR and BBC News. I also chose news articles those whose titles include either “Israel” or “Hamas” from entire dataset. To achieve that, I used code that converts all article titles to lowercase and finds columns containing specific strings. After that, I tokenized the full texts of the articles and removed the stopwords.

dataset <- read.csv("data.csv")

dataset_com <- dataset %>% 
  filter(source_name %in% c("BBC News","NPR")) 
  
dataset_com$title <- tolower(dataset_com$title)
  
data_fil <- dataset_com %>% 
  filter(grepl('hamas|israel', title))

dataset_token <- data_fil %>% 
  unnest_tokens(input = full_content, output = word) %>% 
  anti_join(stop_words)

## Joining with `by = join_by(word)`

Text data analysis

Individual analysis and figures

Anaysis and Figure 1

My first analysis is log odds ratio method. I tried to find words that are relatively more important between two news companies, and compared them using the log odds ratio method. Because the log odds ratio method can help identify words that are relatively more frequently used in one group compared to another group. Especially between two groups. Since there was too many words and I wanted to minimize the impact of irrelevant terms that could affect the results unrelated to the main content, I focused my analysis on words that appeared 100 times or more. Even though there are still some meaningless words included, the results are showing some meaningful patterns. The BBC, as a British public broadcaster, shows a lot of Israel-centric words like Netanyahu, Minister, and government, while NPR, a US public broadcaster, features more US-centric terms like Biden, officials, and aid. This could be interpreted as the United States placing greater importance on its role as a global policeman, and NPR tried to speak for their government. That suggests the possibility that NPR may be influenced by its own government.

dataset_odds <- dataset_token %>%
  count(source_name, word, sort = T) %>% 
  filter(n > 100) %>% 
  pivot_wider(names_from = source_name,
              values_from = n, 
              values_fill = list(n = 0)) %>% 
  rename(BBC = `BBC News`) %>% 
  mutate(ratio_BBC = ((BBC + 1)/(sum(BBC + 1))), 
         ratio_NPR = ((NPR + 1)/(sum(NPR + 1)))) %>% 
  mutate(odds_ratio = ratio_BBC/ratio_NPR) %>% 
  mutate(log_odds_ratio = log(odds_ratio)) %>% 
  group_by(source_name = ifelse(log_odds_ratio > 0, "BBC", "NPR")) %>%
  slice_max(abs(log_odds_ratio), n = 10, with_ties = F) %>% 
  mutate(log_odds_ratio = abs(log_odds_ratio))

ggplot(dataset_odds, aes(x = reorder_within(word, log_odds_ratio, source_name),
                  y = log_odds_ratio,
                  fill = source_name)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ source_name, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL)+
  ggtitle("Top 10 words in log odds ratio") +
  theme(plot.title = element_text(hjust = 0.5,size=20,face='bold'))

Anaysis and Figure 2

So I wanted to understand how these words are being discussed in terms of their respective narratives by each news company. I looked into whether there would be absolute differences as well as relative ones. So, I attempted to find associated words connected to key term in each article through Semantic Network Analysis. I used the pairwise_count function to calculate how often words are associated within each article, and represented these relationships in a graph. From the results, I could see that both companies had the similar key words with many connections related to events such as Israel, Hamas, Gaza, indicating their significance in the discussions. In both graphs, the words with numerous connections predominantly described factual aspects, with terms like October 7th, attack, and people which reflecting major events that occurred. A notable difference can be found in the less connected edges of the graphs, BBC prominently features the word “children,” while NPR includes “support.” BBC focuses on smaller-scale impacts like casualties from the war, whereas NPR concentrates on larger entities and overarching themes driving the war.

data_npr <- dataset_token %>% 
  group_by(source_name) %>% 
  pairwise_count(item = word,
                 feature = title,
                 sort = T) %>% 
  filter(source_name == "NPR")

data_npr$source_name <-NULL

data_bbc <- dataset_token %>% 
  group_by(source_name) %>% 
  pairwise_count(item = word,
                 feature = title,
                 sort = T) %>% 
  filter(source_name == "BBC News")

data_bbc$source_name <-NULL

graph_npr <- data_npr %>%
  filter(n >= 45) %>%
  as_tbl_graph() 

graph_bbc <- data_bbc %>%
  filter(n >= 55) %>%
  as_tbl_graph() 

set.seed(1235)

ggraph(graph_npr, layout = "fr") +
  geom_edge_link(color = "gray50",
                   alpha = 0.4) +
  geom_node_point(color = "deepskyblue", size = 3) +
  geom_node_text(aes(label = name), repel = T, size = 3) +
  ggtitle("nework analysis of NPR") +
  theme_graph()

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): 윈도우즈
## 폰트데이터베이스에서 찾을 수 없는 폰트페밀리입니다

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## 윈도우즈 폰트데이터베이스에서 찾을 수 없는 폰트페밀리입니다

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## 윈도우즈 폰트데이터베이스에서 찾을 수 없는 폰트페밀리입니다

ggraph(graph_bbc, layout = "fr") +
  geom_edge_link(color = "gray50",
                   alpha = 0.4) +
  geom_node_point(color = "pink", size = 3) +
  geom_node_text(aes(label = name), repel = T, size = 3) +
  ggtitle("nework analysis of BBC") +
  theme_graph()

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## 윈도우즈 폰트데이터베이스에서 찾을 수 없는 폰트페밀리입니다

Anaysis and Figure 3

Lastly, I tried to find the positive and negative words that appear in their articles. Differences in the positive and negative words used in these articles might indicate their political stance. For that, I used Sentiment analysis method and created a table comparing positive and negative words. And I used Bing Sentiment to select words with emotional nuances and listed them in order of their frequency of occurrence. From the results, it appears that the top-ranking words are mostly similar regardless of whether they are positive or negative. It’s unfortunate that noticeable differences are scarce, but I could observed a contrast in the intensity between BBC’s “retaliatory” and NPR’s “bombardment”. And interpretation suggests that BBC focuses on internal motivations, while NPR concentrates on external events.

bing <- get_sentiments("bing")

bbc_ne <- dataset_token %>% 
  filter(source_name == "BBC News") %>% 
  inner_join(bing) %>% 
  filter(sentiment == "negative") %>% 
  count(source_name, sentiment, word, sort = T)

## Joining with `by = join_by(word)`

bbc_po <- dataset_token %>% 
  filter(source_name == "BBC News") %>% 
  inner_join(bing) %>% 
  filter(sentiment == "positive") %>% 
  count(source_name, sentiment, word, sort = T)

## Joining with `by = join_by(word)`

npr_ne <- dataset_token %>% 
  filter(source_name == "NPR") %>% 
  inner_join(bing) %>% 
  filter(sentiment == "negative") %>% 
  count(source_name, sentiment, word, sort = T)

## Joining with `by = join_by(word)`

npr_po <- dataset_token %>% 
  filter(source_name == "NPR") %>% 
  inner_join(bing) %>% 
  filter(sentiment == "positive") %>% 
  count(source_name, sentiment, word, sort = T)

## Joining with `by = join_by(word)`

negative_bing <- bind_rows(bbc_ne, npr_ne) %>%
  group_by(source_name) %>% 
  slice_max(n, n = 10, with_ties = F)

positive_bing <- bind_rows(bbc_po, npr_po) %>%
  group_by(source_name) %>% 
  slice_max(n, n = 10, with_ties = F)

ggplot(negative_bing, aes(x = reorder_within(word, n, source_name),
                  y = n,
                  fill = source_name)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~source_name, scales = "free") +
  scale_x_reordered() +
  labs(x = NULL) +
  ggtitle("Top 10 negative words") +
  theme(plot.title = element_text(hjust = 0.5,size=20,face='bold'))

ggplot(positive_bing, aes(x = reorder_within(word, n, source_name),
                  y = n,
                  fill = source_name)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~source_name, scales = "free") +
  scale_x_reordered() +
  labs(x = NULL) +
  ggtitle("Top 10 positive words") +
  theme(plot.title = element_text(hjust = 0.5,size=20,face='bold'))

By completing the comparative analysis between NPR and BBC, I could see the difference in situations influenced by politics. While the analysis results may not show significant differences, I suggest they represent sufficiently important distinctions to dangerous situation considering that how frequently such differences occurs. Through these findings, we are reminded of the need to guarantee press freedom and to seek wiser methods to achieve that goal.

News article analysis between BBC and NPR

임준용

2024.06.13