Task 1: Correlation

  1. For the following sets of variables produce a scatterplot,

Comment on the relationship between the variables (strength and direction of the relationship).

  1. Number of ads and number of impressions.
## Plot
plot(x = dat$number_ads, y = dat$impressions)

## Covariance
# Covariance using built-in function:
SxSy <- cov(dat$number_ads, dat$impressions)

# Covariance, hipster-style: 
RESx <- dat$number_ads  - mean(dat$number_ads)
RESy <- dat$impressions - mean (dat$impressions)
(SxSy <- sum(RESx * RESy)/(N-1))
## [1] 5974098
## Correlation
# Built-in function
r <- cor(dat$number_ads, dat$impressions)

# Hipster version
(r <- SxSy/(sd(dat$number_ads)*sd(dat$impressions)))
## [1] 0.6975701

There’s positive (70%) correlation between variables, so as number of ads increase on a webpage, number of impressions increase as well. But direction is unclear.

  1. Number of impressions and number of clicks.
## Plot
plot(x = dat$impressions, y = dat$clicks)

## Covariance
# Covariance using built-in function:
SxSy <- cov(dat$clicks, dat$impressions)

# Covariance, hipster-style: 
RESx <- dat$clicks  - mean(dat$clicks)
RESy <- dat$impressions - mean (dat$impressions)
(SxSy <- sum(RESx * RESy)/(N-1))
## [1] 122209677
## Correlation
# Built-in function
r <- cor(dat$clicks, dat$impressions)

# Hipster version
(r <- SxSy/(sd(dat$clicks)*sd(dat$impressions)))
## [1] 0.6717794

There’s positive (67%) correlation between variables, so as number of impressions increase on a webpage, number of clicks increase as well. But direction is unclear.

Task 2: Simple linear regression

You are curious about the relationships discussed earlier in Task 1, so you decide to dive deeper.

  1. Run a linear regression on the two sets of variables and interpret the coefficients.

Number of ads - impressions

Assumptions check

Looking at the scatter plot as well as model diagnostics below, you’d prefer to use a more proper and advanced regression model, however for simplicity we go with the linear regression model first.

x <- dat$number_ads
y <- dat$impressions

## hipster style:
X <- cbind(1, x)

coeffs <- solve(t(X) %*% X) %*% (t(X) %*% y)
coeffs <- solve(crossprod(X)) %*% crossprod(X, y)

lr <- lm(impressions ~ number_ads + number_ads, data = dat)

summaryLR <- summary(lr); summaryLR
## 
## Call:
## lm(formula = impressions ~ number_ads + number_ads, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2589730 -1499885  -218481   805872  9582015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2153615     100183  -21.50   <2e-16 ***
## number_ads    474641      13998   33.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1730000 on 1213 degrees of freedom
## Multiple R-squared:  0.4866, Adjusted R-squared:  0.4862 
## F-statistic:  1150 on 1 and 1213 DF,  p-value: < 2.2e-16
plot(y ~ x, pch = 16)
abline(lr, col = "darkred", lwd = 2)
legend("topleft", paste0("R-square = ", round(summaryLR$r.squared, 2)))

par(mfrow = c(2, 2))
plot(lr)

Number of impressions - clicks

Assumptions check

Looking at the scatter plot as well as model diagnostics below, you’d prefer to use a more proper and advanced regression model, however for simplicity we go with the linear regression model first.

y <- dat$clicks
x <- dat$impressions

reg <- lm(clicks ~ impressions, data = dat)

summaryLR <- summary(reg); summaryLR
## 
## Call:
## lm(formula = clicks ~ impressions, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -172.80   -1.75   -1.69   -0.76  828.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.684e+00  1.687e+00   0.998    0.318    
## impressions 2.097e-05  6.640e-07  31.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.85 on 1213 degrees of freedom
## Multiple R-squared:  0.4513, Adjusted R-squared:  0.4508 
## F-statistic: 997.6 on 1 and 1213 DF,  p-value: < 2.2e-16
plot(y ~ x, pch = 16)
abline(reg, col = "darkred", lwd = 2)
legend("topleft", paste0("R-square = ", round(summaryLR$r.squared, 2)))

par(mfrow = c(2, 2))
plot(reg)

  • Prediction:
  1. How many impressions would a website owner sell on average if she published 5 ad slots on her website?
num_ads <- 5 
lr$coefficients[1] + lr$coefficients[2]*num_ads
## (Intercept) 
##      219592
  1. How many clicks would a website owner generate on average if she sold 1M impressions on her website?
num_imps <- 1e+6
reg$coefficients[1] + reg$coefficients[2]*num_ads
## (Intercept) 
##    1.684169

Task 3: Multiple linear regression

  1. Run an appropriate test.
mr <- lm(clicks ~ impressions + number_ads, data = dat)
summary(mr)
## 
## Call:
## lm(formula = clicks ~ impressions + number_ads, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -170.32   -3.36   -1.44   -0.01  830.50 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.687e+00  3.800e+00  -0.444    0.657    
## impressions  2.033e-05  9.267e-07  21.941   <2e-16 ***
## number_ads   6.243e-01  6.305e-01   0.990    0.322    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.85 on 1212 degrees of freedom
## Multiple R-squared:  0.4517, Adjusted R-squared:  0.4508 
## F-statistic: 499.3 on 2 and 1212 DF,  p-value: < 2.2e-16
  1. Compare the obtained results with above regression 2 results and comment on whether the results are improved.
  2. Interpret the effect of number of impressions on number of clicks.

Effect of number of impressions doesn’t change (significant / positive) when we include number of ads in the regression. However effects gets slightly lower. Number of ads has no significant effect on number of clicks, despite its significant effect on number of impressions.

  1. How many clicks would a website owner generate on average if she published 5 ad slots and sold 1M impressions on her website?
mr$coefficients[1] + mr$coefficients[2]*num_imps + mr$coefficients[3]*num_ads
## (Intercept) 
##    21.76686