The Cable News Network (CNN) is a multinational news-based pay television channel headquartered in Atlanta, Georgia. This objective of this project is to identify if there are any biases in the way reports are aggregated in the CNN website. Our main aim is to identify markers which will help CNN to remove any biases if present to increase the daily site visitors.
The stakeholders of this project are the owners of CNN, AT&T’s WarnerMedia and the reporters with the viewers of CNN.
In order to prepare for our projet we’ve identified a Kaggle dataset with the name CNN News Articles from 2011 to 2022. It is clean data with article collection from 2011 to 2022. This data set fulfills the ROCCC pattern i.e it is Reliable, Original,Comprehensive, Cited and Current.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
library(tidyr)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
cnn_articles_1 <- read.csv("Downloads/archive (5)/CNN_Articels_clean/CNN_Articels_clean.csv")
cnn_articles_2 <- read.csv("Downloads/archive (5)/CNN_Articels_clean_2/CNN_Articels_clean.csv")
In order to process the data we’ll have to combine both data sets to get a comprehensive picture. But before that let’s look at the basic structure of both the data sets.
summary(cnn_articles_1)
## Index Author Date.published Category
## Min. : 0 Length:4076 Length:4076 Length:4076
## 1st Qu.:1643 Class :character Class :character Class :character
## Median :2670 Mode :character Mode :character Mode :character
## Mean :2608
## 3rd Qu.:3698
## Max. :4729
## Section Url Headline Description
## Length:4076 Length:4076 Length:4076 Length:4076
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Keywords Second.headline Article.text
## Length:4076 Length:4076 Length:4076
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
summary(cnn_articles_2)
## Index Author Date.published Category
## Min. : 0 Length:37949 Length:37949 Length:37949
## 1st Qu.:16094 Class :character Class :character Class :character
## Median :25692 Mode :character Mode :character Mode :character
## Mean :25145
## 3rd Qu.:35235
## Max. :44997
## Section Url Headline Description
## Length:37949 Length:37949 Length:37949 Length:37949
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Keywords Second.headline Article.text
## Length:37949 Length:37949 Length:37949
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
Now once we’ve established that the structure of both data sets is the same, let’s join the datasets together:
cnn_article <- full_join(cnn_articles_1, cnn_articles_2)
## Joining, by = c("Index", "Author", "Date.published", "Category", "Section",
## "Url", "Headline", "Description", "Keywords", "Second.headline",
## "Article.text")
In our processing we find that in ‘Date.published’ column both date and time are joined together. We would not need the time but date of publishing can help in our analysis. Hence, let us split the column and create another with just date.
cnn_article$date <- as.Date(cnn_article$Date.published)
Now lets analyze the data for what it is in reality. As we already know from the process phase of our project there are two columns in this data namely ‘Category’ and ‘Section’. Analyzing what these two columns contain can give us a picture of what these data contain.
cnn_article %>% count(Category)
## Category n
## 1 business 958
## 2 entertainment 470
## 3 health 609
## 4 news 19687
## 5 politics 2536
## 6 sport 17718
## 7 style 1
## 8 travel 39
## 9 vr 5
As we can see from this analysis certain categories have overwhelmingly more articles than others. Although there are three categories which are not clear in meaning, these are ‘travel’, ‘vr’ and ‘style’. We’ll have to remove these from our analysis.
cnn_article %>% count(Section)
## Section n
## 1 africa 232
## 2 americas 220
## 3 app-news-section 30
## 4 app-politics-section 2
## 5 app-tech-section 1
## 6 asia 310
## 7 australia 661
## 8 business 307
## 9 business-food 5
## 10 business-india 1
## 11 business-money 5
## 12 cars 8
## 13 celebrities 2
## 14 china 108
## 15 cnn10 1
## 16 economy 72
## 17 energy 24
## 18 entertainment 445
## 19 equestrian 3
## 20 europe 11435
## 21 fashion 1
## 22 foodanddrink 1
## 23 football 5529
## 24 golf 1671
## 25 health 609
## 26 homes 21
## 27 horseracing 1
## 28 india 29
## 29 intl_business 1
## 30 intl_world 2
## 31 investing 91
## 32 justice 25
## 33 living 100
## 34 media 87
## 35 middleeast 153
## 36 motorsport 1486
## 37 movies 2
## 38 opinion 188
## 39 opinions 895
## 40 perspectives 66
## 41 politics 2536
## 42 sailing 1
## 43 showbiz 21
## 44 skiing 3
## 45 sport 6727
## 46 success 50
## 47 tech 220
## 48 tennis 2297
## 49 travel 38
## 50 uk 2232
## 51 us 2253
## 52 vr 5
## 53 weather 149
## 54 world 657
## 55 worldsport 4
If we analyse closely the data we’ll find some countries are more represented than others and same is true if we look at continents.
For analyzing date we’ll have to split the ‘date’ column further into Year, Month, Date.
cnn_article_year <- cnn_article %>%
dplyr::mutate(year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date))
Now let us see how publishing articles have changed over the years.
cnn_article_ggplot <- cnn_article_year %>% count(year)
View(cnn_article_ggplot)
We clearly see publishing in site has increased over the years although the data for initial-year(2011) and last-year(2022) is not comprehensive. Further analysis indicates that some months are not included from 2011 and 2022
Over the course of this analysis we’ve established that there is definitive bias in the way CNN aggregates its news. In a globalized world where anybody can access the CNN website from anywhere in the world I think CNN is losing out on a lot of potential users. It can change the scenario doing the following :
Increase reporting from places like China, India and Africa. These places have significant english speaking population who can become future CNN website users.
Diversify the type of news that gets published, not focusing only on politics or sports. Health and Science related topics should be given due weightage.