In this report, we use the pulitzer data frame from the website FiveThirtyEight to determine whether there is a relationship between a newspaper’s readership and its number of Pulitzer Prizes. By analyzing the data, we intend to determine if earning the Pulitzer Prize has a significant effect on a newspaper’s readership.
This data frame can be obtained directly from the FivethirtyEight https://data.fivethirtyeight.com or by installing the package ‘fivethirtyeightdata’.
The data frame considers the readership of the top fifty newspapers in the country between years 2004 and 2013. It contains the following seven columns
Newspaper
Daily Circulation, 2004
Daily Circulation, 2013
Change in Daily Circulation, 2004-2013
Pulitzer Prize Winners and Finalists, 1990-2003
Pulitzer Prize Winners and Finalists, 2004-2014
Pulitzer Prize Winners and Finalists, 1990-2014
We start by loading the package the data frame
## We start by loading the 538 library and pulitzer data frame
library(fivethirtyeight)
# Read the Pulitzer data frame
data <- fivethirtyeight::pulitzer
head(data)
We begin by removing entries for which the readership in 2013 has dropped to zero (the paper is no longer in business), as these may negatively impact the data.
df = subset(data, data['circ2013'] != 0)
df1 = as.data.frame(df)
We will only keep the following columns
pctchg_circnum_finals1990_2003num_finals2004_2014num_finals1990_2014and form a new data frame called ‘pulitzer’:
pulitzer <- df1[c("pctchg_circ", "num_finals1990_2003",
"num_finals2004_2014" , "num_finals1990_2014")]
We start by visualizing our data to see if we can detect any trends. We will generate three plots
Percentage change in readership vs. number of awards between 1990 and 2003
Percentage change in readership vs. number of awards between 2004 and 2014
Percentage change in readership vs. total number of awards
It is important to sort the data frame according to the number of awards first.
sorted_pulitzer = pulitzer[order(pulitzer$num_finals1990_2003),]
library(ggplot2)
p = ggplot(sorted_pulitzer, aes(x=num_finals1990_2003, y=pctchg_circ))
# Add a scatter plot to the plot object
p = p + geom_point()
# Add a linear regression line to the plot object
p = p + geom_smooth(method="lm")
# Set the x-axis and y-axis labels
p = p + xlab("Number of Pulitzer prized 1990-2003") + ylab("Percentage change in readership")
# Show the plot
p
## `geom_smooth()` using formula = 'y ~ x'
p = ggplot(sorted_pulitzer, aes(x=num_finals2004_2014, y=pctchg_circ))
# Add a scatter plot to the plot object
p = p + geom_point()
# Add a linear regression line to the plot object
p = p + geom_smooth(method="lm")
# Set the x-axis and y-axis labels
p = p + xlab("Number of Pulitzer prized 2004-2014") + ylab("Percentage change in readership")
# Show the plot
p
## `geom_smooth()` using formula = 'y ~ x'
p = ggplot(sorted_pulitzer, aes(x=num_finals1990_2014, y=pctchg_circ))
# Add a scatter plot to the plot object
p = p + geom_point()
# Add a linear regression line to the plot object
p = p + geom_smooth(method="lm")
# Set the x-axis and y-axis labels
p = p + xlab("Number of Pulitzer prized 1990-2014") + ylab("Percentage change in readership")
# Show the plot
p
## `geom_smooth()` using formula = 'y ~ x'
Of the three plots, the one which shows change in readership in 2004 and 2013 versus the number of Pulitzer prizes in the same period shows the most correlation. The relationship is not linear, however at this stage we think using a linear model is sufficient.
In future maybe using non-linear regression we can show more clear trends.