The price of advertising, and therefore advertising revenue, are different in different newspapers. Publishers conjecture that newspapers that reach more readers create more value for the advertisers. Therefore,circulation is an important factor that determines the advertising revenue of a newspaper. Is this conjecture substantiated in practice? You are provided with a sample for the top 70 newspapers in the country, ranked in terms of total gross advertising revenue, and are asked to address the following questions.
require(foreign)
## Loading required package: foreign
# reading the data
adv <- read.spss(file= 'C:\\Users\\kwnstantinos\\Desktop\\Regression Models R\\AdvertisingRevenue.sav', to.data.frame = TRUE)
## Warning in read.spss(file = "C:\\Users\\kwnstantinos\\Desktop\\Regression
## Models R\\AdvertisingRevenue.sav", : C:\Users\kwnstantinos\Desktop
## \Regression Models R\AdvertisingRevenue.sav: Unrecognized record type 7,
## subtype 18 encountered in system file
summary(adv)
## AdRevenue Circulation
## Min. : 61.1 Min. : 0.3310
## 1st Qu.:104.9 1st Qu.: 0.9922
## Median :133.8 Median : 1.6755
## Mean :171.1 Mean : 3.1185
## 3rd Qu.:179.4 3rd Qu.: 2.7433
## Max. :876.9 Max. :32.7000
knitr::kable(head(adv,10),align='c')
| AdRevenue | Circulation |
|---|---|
| 233.259 | 3.751 |
| 396.865 | 7.639 |
| 286.108 | 4.067 |
| 876.907 | 32.700 |
| 304.185 | 3.205 |
| 291.829 | 4.741 |
| 242.679 | 3.118 |
| 640.072 | 23.041 |
| 107.742 | 1.624 |
| 237.004 | 4.027 |
For the preliminary visual analysis we are going to use a scatterplot. On the Y-axis we are going to put the dependent variable which in our case is AdRevenue and on the X-axis the independent variable Circulation.
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
g <- ggplot(adv, aes(y = AdRevenue, x = Circulation) )
g <- g + geom_point()
g <- g + ylab('Advertising revenue')
g <- g + geom_smooth(stat = 'smooth', method = 'lm', formula=y~x)
g
Our preliminary analysis shows that there is a strong positive correlation between the X (predictor Variable) and Y (response variable). We can also see a few outliers in our data which stand out from the rest data. At least these outliers seem to follow the overall atern of the data and their influence is not that significant but we are going to run additional tests to examine them further.
require(moments)
## Loading required package: moments
sk <- cbind(skewness(adv$AdRevenue), skewness(adv$Circulation))
kur <- cbind(kurtosis(adv$AdRevenue), kurtosis(adv$Circulation))
df <- rbind(sk,kur)
dsk <- data.frame(df)
colnames(dsk) <- c('AdvRevenue','Circulation')
rownames(dsk) <- c('Skewness', 'kyrtosis')
knitr::kable(dsk)
| AdvRevenue | Circulation | |
|---|---|---|
| Skewness | 3.35963 | 4.060297 |
| kyrtosis | 16.52011 | 19.846428 |
| We then che | ck our variab | le using descriptive statistics. Our analysis shows that both of our variable are skewed (3.35 for adrevenue and 4.06 for circulation) so we are going to use a log transformation to cope with this problem. The skewness of the variables can be easily observed using histograms for the variables. |
require(dplyr)
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Transf.adv <- adv %>% mutate(logAdRevenue= log(AdRevenue), logCirculation= log(Circulation))
sk2 <- cbind(skewness(Transf.adv$logAdRevenue), skewness(Transf.adv$logCirculation))
kur2 <- cbind(kurtosis(Transf.adv$logAdRevenue), kurtosis(Transf.adv$logCirculation))
df2 <- rbind(sk2,kur2)
dsk2 <- data.frame(df2)
colnames(dsk2) <- c('log.AdvRevenue','logCirculation')
rownames(dsk2) <- c('Skewness', 'kyrtosis')
knitr::kable(dsk2)
| log.AdvRevenue | logCirculation | |
|---|---|---|
| Skewness | 1.168804 | 0.9978602 |
| kyrtosis | 4.788495 | 4.6309187 |
After transforming our variables the skewness has been reduced greatly (1.168 for logAdvRevenue and 1.020 for logCirculation). The same can be observed for the kyrtosis values since as we know kyrtosis is dependent. We then redraw the scatterplot with the transformed variables also indicates high correlation between the dependent variable with the independent.
require(ggplot2)
g <- ggplot(Transf.adv, aes(y = logAdRevenue, x = logCirculation) )
g <- g + geom_point()
g <- g + ylab('Logarithm Advertising revenue') + xlab('Logarithm of Circulation')
g <- g + geom_smooth(stat = 'smooth', method = 'lm', formula=y~x)
g
Our new scatterplot shows an improvement over the previous one since the datapoints are no longer concentrated at one area but spread out evenly
x <- Transf.adv$logCirculation
y <- Transf.adv$logAdRevenue
fit <- lm(y~x)
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47022 -0.11142 -0.00532 0.10835 0.42705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.67473 0.02525 185.16 <2e-16 ***
## x 0.52876 0.02356 22.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1768 on 68 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.8793
## F-statistic: 503.6 on 1 and 68 DF, p-value: < 2.2e-16
Looking at the model summary we can infer the following:
Using the F-test, we test the overall significance of the regression model. (Specifically, it tests the null hypothesis that all of the regression coefficients are equal to zero). This tests the full model against a model with no variables and with the estimate of the dependent variable being the mean of the values of the dependent variable. We come to the same result since we reject the null hypothesis showing that our coefficients are not zero and the appropiate model is a linear one and not the mean values of the dependent variable. (F-value is equal to 503.6 and p-value is zero so we reject the null hypothesis that all of the coefficients are zero).
The regression model is \(\widehat{Y} = 2.030 + 0.529\)\(X_1\)
The p-values for all the coefficients is equal to 0 suggesting there is strong evidence in rejecting the null hypothesis which states the bo=0 and b1 =0 . The p-values are derived from two individual tests suggesting that both of the coefficients are significant and definetly not equal to zero.
The R square show above is the proportion of variance in the dependent variable (AdRevenue) which can be explained by the independent variable (Circulation). The R Square is equal to .881, this means that 88.1% of the total variability in AdRevenue is explained by the model. This is a really high number suggesting that our model can predict future values with high accuracy using the Circulation values.
This plot shows if residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship.
# for plotting all diagnostic plots
# for (i in 1:5){
# plot(step.model, which=i)
# }
# Residuals vs Fitted
plot(fit,which = 1)
In our model the residuals are equally spread around a horizontal line without distinct patterns, that is a good indication that we don’t have non-linear relationships. We have a few points which seem to be outliers since they seem to have larger residual values than the rest, but nothing too extreme.
Using this plot we can check if the residuals are normally distributed.
# Q-Q plot
plot(fit, which=2)
The residuals approximately follow a straight line well and do not deviate severely with the exeption of the 3 outliers that can be seen on the Residual vs Fitted values plot as well.
It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how we can check the assumption of equal variance (homoscedasticity).
# Scale-Location
plot(fit, which=3)
In our model we see a horizontal line with equally (randomly) spread points with the exception of a few outliers.
This plot helps us to find influential cases if any. Not all outliers are influential in linear regression analysis. Even though data can have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values. They could be extreme cases against a regression line and can alter the results if we exclude them from analysis.
# Cook's distance
plot(fit, which = 4)
In our case we can see that observation 4, 49 and 60 seem to be influential with obs number 4 being the most extreme. Similar results were observed in the Scale location plot and the Residual vs Leverage plot drawn below.
# Residuals vs Leverage
plot(fit, which = 5)
Our model \(\widehat{Y} = 2.030 + 0.529\)\(X_1\) has a lot of explanatory and predictive power (88%). The above model indicates that 1% increase in variable circulation will result in .529% increase in Advertising revenue. Some observations in our sample appear problematic such as 60, 64, 4 that should be given extra attention since they are distant from other variables and can be considered as outliers but not as influential since they don’t have a strong effect on our model. Concluding, our data and tests prove that our model can be trusted to predict Advertising Revenues just by using the circulation of newspapers.