Week 11 Discussion

Pulitzer Prizes and Newspaper Circulation

We present a linear regression on newspapers’ circulation as a prediction for Pulitzer Prize finalists. Our data set contains circulation numbers for 2004 and 2014. It contains a count for each newspaper’s finalist numbers for 2004-2014. In order to make the data match, we transform the circulation by averaging circulation numbers for the start and end of the period.

pulitzer.data<-read.csv('https://raw.githubusercontent.com/WigodskyD/data-sets/master/pulitzer-circulation-data.csv',stringsAsFactors = FALSE)
pulitzer.data[[2]]<- as.numeric(gsub(",", "", pulitzer.data[[2]]))
pulitzer.data[[3]]<- as.numeric(gsub(",", "", pulitzer.data[[3]]))
pulitzer.data%<>% 
    mutate(average.circulation = (.5*(Daily.Circulation..2004 + Daily.Circulation..2013))) 
linear.model<-lm(pulitzer.data$Pulitzer.Prize.Winners.and.Finalists..2004.2014~pulitzer.data$average.circulation)
summary(linear.model)
## 
## Call:
## lm(formula = pulitzer.data$Pulitzer.Prize.Winners.and.Finalists..2004.2014 ~ 
##     pulitzer.data$average.circulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.768  -3.784  -1.949   0.766  39.626 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       7.405e-01  2.155e+00   0.344 0.732671
## pulitzer.data$average.circulation 1.450e-05  3.723e-06   3.894 0.000304
##                                      
## (Intercept)                          
## pulitzer.data$average.circulation ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 48 degrees of freedom
## Multiple R-squared:  0.2401, Adjusted R-squared:  0.2242 
## F-statistic: 15.16 on 1 and 48 DF,  p-value: 0.0003045

With a p-value of .0003045, our regression is significant at a .001 level. We now have to check residuals to determine if we have an appropriate model.

plot(resid(linear.model))

Our residuals show that a linear model is not the most appropriate. The variance begins large and tapers off.

bptest(linear.model)
## 
##  studentized Breusch-Pagan test
## 
## data:  linear.model
## BP = 16.463, df = 1, p-value = 4.962e-05

A Breusch-Pagan test shows that heteroskedasticity is indeed a problem.

caret::BoxCoxTrans(pulitzer.data$Pulitzer.Prize.Winners.and.Finalists..2004.2014)
## Box-Cox Transformation
## 
## 50 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    3.00    6.72    6.75   62.00 
## 
## Lambda could not be estimated; no transformation is applied

An attempt at a Box-Cox Transformation has failed. In order to perform a regression that is more appropriate, another method like a weighted least squares or a log transformation would be required.