Introduction

In this report, we use the pulitzer data frame from the website FiveThirtyEight to determine whether there is a relationship between a newspaper’s readership and its number of Pulitzer Prizes. By analyzing the data, we intend to determine if earning the Pulitzer Prize has a significant effect on a newspaper’s readership.

This data frame can be obtained directly from the FivethirtyEight https://data.fivethirtyeight.com or by installing the package ‘fivethirtyeightdata’.

Data Overview

The data frame considers the readership of the top fifty newspapers in the country between years 2004 and 2013. It contains the following seven columns

We start by loading the package the data frame

## We start by loading the 538 library and pulitzer data frame
library(fivethirtyeight)

# Read the Pulitzer data frame
data <- fivethirtyeight::pulitzer
head(data)

Data Processing

We begin by removing entries for which the readership in 2013 has dropped to zero (the paper is no longer in business), as these may negatively impact the data.

df = subset(data, data['circ2013'] != 0)
df1 = as.data.frame(df)

We will only keep the following columns

and form a new data frame called ‘pulitzer’:

pulitzer <- df1[c("pctchg_circ", "num_finals1990_2003",
 "num_finals2004_2014" , "num_finals1990_2014")]

Findings and Conclusions

We start by visualizing our data to see if we can detect any trends. We will generate three plots

  1. Percentage change in readership vs. number of awards between 1990 and 2003

  2. Percentage change in readership vs. number of awards between 2004 and 2014

  3. Percentage change in readership vs. total number of awards

It is important to sort the data frame according to the number of awards first.

sorted_pulitzer = pulitzer[order(pulitzer$num_finals1990_2003),]
library(ggplot2)
p = ggplot(sorted_pulitzer, aes(x=num_finals1990_2003, y=pctchg_circ))
# Add a scatter plot to the plot object
p = p + geom_point()

# Add a linear regression line to the plot object
p = p + geom_smooth(method="lm")
# Set the x-axis and y-axis labels
p = p + xlab("Number of Pulitzer prized 1990-2003") + ylab("Percentage change in readership")
# Show the plot
p
## `geom_smooth()` using formula = 'y ~ x'

p = ggplot(sorted_pulitzer, aes(x=num_finals2004_2014, y=pctchg_circ))
# Add a scatter plot to the plot object
p = p + geom_point()

# Add a linear regression line to the plot object
p = p + geom_smooth(method="lm")
# Set the x-axis and y-axis labels
p = p + xlab("Number of Pulitzer prized 2004-2014") + ylab("Percentage change in readership")
# Show the plot
p
## `geom_smooth()` using formula = 'y ~ x'

p = ggplot(sorted_pulitzer, aes(x=num_finals1990_2014, y=pctchg_circ))
# Add a scatter plot to the plot object
p = p + geom_point()

# Add a linear regression line to the plot object
p = p + geom_smooth(method="lm")
# Set the x-axis and y-axis labels
p = p + xlab("Number of Pulitzer prized 1990-2014") + ylab("Percentage change in readership")
# Show the plot
p
## `geom_smooth()` using formula = 'y ~ x'

Of the three plots, the one which shows change in readership in 2004 and 2013 versus the number of Pulitzer prizes in the same period shows the most correlation. The relationship is not linear, however at this stage we think using a linear model is sufficient.

In future maybe using non-linear regression we can show more clear trends.