Step-1: ASK

The Cable News Network (CNN) is a multinational news-based pay television channel headquartered in Atlanta, Georgia. This objective of this project is to identify if there are any biases in the way reports are aggregated in the CNN website. Our main aim is to identify markers which will help CNN to remove any biases if present to increase the daily site visitors.

The stakeholders of this project are the owners of CNN, AT&T’s WarnerMedia and the reporters with the viewers of CNN.

Step-2: PREPARE

In order to prepare for our projet we’ve identified a Kaggle dataset with the name CNN News Articles from 2011 to 2022. It is clean data with article collection from 2011 to 2022. This data set fulfills the ROCCC pattern i.e it is Reliable, Original,Comprehensive, Cited and Current.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggplot2)
library(tidyr)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
cnn_articles_1 <- read.csv("Downloads/archive (5)/CNN_Articels_clean/CNN_Articels_clean.csv")
cnn_articles_2 <- read.csv("Downloads/archive (5)/CNN_Articels_clean_2/CNN_Articels_clean.csv")

Step-3: PROCESS

In order to process the data we’ll have to combine both data sets to get a comprehensive picture. But before that let’s look at the basic structure of both the data sets.

summary(cnn_articles_1)
##      Index         Author          Date.published       Category        
##  Min.   :   0   Length:4076        Length:4076        Length:4076       
##  1st Qu.:1643   Class :character   Class :character   Class :character  
##  Median :2670   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2608                                                           
##  3rd Qu.:3698                                                           
##  Max.   :4729                                                           
##    Section              Url              Headline         Description       
##  Length:4076        Length:4076        Length:4076        Length:4076       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Keywords         Second.headline    Article.text      
##  Length:4076        Length:4076        Length:4076       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
summary(cnn_articles_2)
##      Index          Author          Date.published       Category        
##  Min.   :    0   Length:37949       Length:37949       Length:37949      
##  1st Qu.:16094   Class :character   Class :character   Class :character  
##  Median :25692   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :25145                                                           
##  3rd Qu.:35235                                                           
##  Max.   :44997                                                           
##    Section              Url              Headline         Description       
##  Length:37949       Length:37949       Length:37949       Length:37949      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Keywords         Second.headline    Article.text      
##  Length:37949       Length:37949       Length:37949      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 

Now once we’ve established that the structure of both data sets is the same, let’s join the datasets together:

cnn_article <- full_join(cnn_articles_1, cnn_articles_2)
## Joining, by = c("Index", "Author", "Date.published", "Category", "Section",
## "Url", "Headline", "Description", "Keywords", "Second.headline",
## "Article.text")

In our processing we find that in ‘Date.published’ column both date and time are joined together. We would not need the time but date of publishing can help in our analysis. Hence, let us split the column and create another with just date.

cnn_article$date <- as.Date(cnn_article$Date.published)

Step-4: ANALYZE

Now lets analyze the data for what it is in reality. As we already know from the process phase of our project there are two columns in this data namely ‘Category’ and ‘Section’. Analyzing what these two columns contain can give us a picture of what these data contain.

cnn_article %>% count(Category)
##        Category     n
## 1      business   958
## 2 entertainment   470
## 3        health   609
## 4          news 19687
## 5      politics  2536
## 6         sport 17718
## 7         style     1
## 8        travel    39
## 9            vr     5

As we can see from this analysis certain categories have overwhelmingly more articles than others. Although there are three categories which are not clear in meaning, these are ‘travel’, ‘vr’ and ‘style’. We’ll have to remove these from our analysis.

cnn_article %>% count(Section)
##                 Section     n
## 1                africa   232
## 2              americas   220
## 3      app-news-section    30
## 4  app-politics-section     2
## 5      app-tech-section     1
## 6                  asia   310
## 7             australia   661
## 8              business   307
## 9         business-food     5
## 10       business-india     1
## 11       business-money     5
## 12                 cars     8
## 13          celebrities     2
## 14                china   108
## 15                cnn10     1
## 16              economy    72
## 17               energy    24
## 18        entertainment   445
## 19           equestrian     3
## 20               europe 11435
## 21              fashion     1
## 22         foodanddrink     1
## 23             football  5529
## 24                 golf  1671
## 25               health   609
## 26                homes    21
## 27          horseracing     1
## 28                india    29
## 29        intl_business     1
## 30           intl_world     2
## 31            investing    91
## 32              justice    25
## 33               living   100
## 34                media    87
## 35           middleeast   153
## 36           motorsport  1486
## 37               movies     2
## 38              opinion   188
## 39             opinions   895
## 40         perspectives    66
## 41             politics  2536
## 42              sailing     1
## 43              showbiz    21
## 44               skiing     3
## 45                sport  6727
## 46              success    50
## 47                 tech   220
## 48               tennis  2297
## 49               travel    38
## 50                   uk  2232
## 51                   us  2253
## 52                   vr     5
## 53              weather   149
## 54                world   657
## 55           worldsport     4

If we analyse closely the data we’ll find some countries are more represented than others and same is true if we look at continents.

For analyzing date we’ll have to split the ‘date’ column further into Year, Month, Date.

cnn_article_year <- cnn_article %>%
  dplyr::mutate(year = lubridate::year(date), 
                month = lubridate::month(date), 
                day = lubridate::day(date))

Now let us see how publishing articles have changed over the years.

cnn_article_ggplot <- cnn_article_year %>% count(year)
View(cnn_article_ggplot)

We clearly see publishing in site has increased over the years although the data for initial-year(2011) and last-year(2022) is not comprehensive. Further analysis indicates that some months are not included from 2011 and 2022

Step-5: SHARE

The information gathered after a comprehensive analysis that is worth sharing is as follows:

To show the change over the years we’ll have to first remove the 2011 and 2022 years as for those years data is not complete. Let’s remove the 2011 and 2022 years from our data.

cnn_article_ggplot_filtered<- cnn_article_ggplot[-c(1,12),]

Now let us plot our data

ggplot(data = cnn_article_ggplot_filtered, mapping = aes(x=year, y=n))+
  geom_smooth()+
  labs(x="Year Published", y="Number of Articles", title = "CNN website News Articles")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

First let’s clean our data to get categories with only significant numbers of articles.

cnn_article_category <- cnn_article %>% count(Category)
cnn_article_category_filtered <- filter(cnn_article_category, n>100)
View(cnn_article_category_filtered)

Now let us plot our graph

barplot(cnn_article_category_filtered$n, names.arg = cnn_article_category_filtered$Category, col = 'blue', main = "CNN Articles by Category", ylab = "Number of Articles", cex.names = .90)

As apparent the ‘SPORT’ category is over represented as compared to ‘HEALTH’ or ‘BUSINESS’. This means CNN is overlooking a significant userbase interested in things other than ‘SPORT’ or ‘POLITICS’.

Let us first filter data according to the regions defined in ‘Section’ column.

cnn_article_fil <- cnn_article %>% filter(Section %in% c('asia', 'africa', 'australia', 'china', 'india', 'us', 'uk', 'europe'))
cnn_article_continent <- cnn_article_fil %>% count(Section)
View(cnn_article_continent)

We notice Europe, US and UK are over represented. Considering it’s a US company US being over represented is justified but Australia having more covergae than India and China seems odd.

Let us plot a BarChart to further explain this inequality.

barplot(cnn_article_continent$n , names.arg = cnn_article_continent$Section, col = 'blue', main = "CNN Articles by Region", ylab ="Number of Articles")

Step-6: ACT

Over the course of this analysis we’ve established that there is definitive bias in the way CNN aggregates its news. In a globalized world where anybody can access the CNN website from anywhere in the world I think CNN is losing out on a lot of potential users. It can change the scenario doing the following :