Comment on the relationship between the variables (strength and direction of the relationship).
## Plot
plot(x = dat$number_ads, y = dat$impressions)
## Covariance
# Covariance using built-in function:
SxSy <- cov(dat$number_ads, dat$impressions)
# Covariance, hipster-style:
RESx <- dat$number_ads - mean(dat$number_ads)
RESy <- dat$impressions - mean (dat$impressions)
(SxSy <- sum(RESx * RESy)/(N-1))
## [1] 5974098
## Correlation
# Built-in function
r <- cor(dat$number_ads, dat$impressions)
# Hipster version
(r <- SxSy/(sd(dat$number_ads)*sd(dat$impressions)))
## [1] 0.6975701
There’s positive (70%) correlation between variables, so as number of ads increase on a webpage, number of impressions increase as well. But direction is unclear.
## Plot
plot(x = dat$impressions, y = dat$clicks)
## Covariance
# Covariance using built-in function:
SxSy <- cov(dat$clicks, dat$impressions)
# Covariance, hipster-style:
RESx <- dat$clicks - mean(dat$clicks)
RESy <- dat$impressions - mean (dat$impressions)
(SxSy <- sum(RESx * RESy)/(N-1))
## [1] 122209677
## Correlation
# Built-in function
r <- cor(dat$clicks, dat$impressions)
# Hipster version
(r <- SxSy/(sd(dat$clicks)*sd(dat$impressions)))
## [1] 0.6717794
There’s positive (67%) correlation between variables, so as number of impressions increase on a webpage, number of clicks increase as well. But direction is unclear.
You are curious about the relationships discussed earlier in Task 1, so you decide to dive deeper.
Assumptions check
Looking at the scatter plot as well as model diagnostics below, you’d prefer to use a more proper and advanced regression model, however for simplicity we go with the linear regression model first.
x <- dat$number_ads
y <- dat$impressions
## hipster style:
X <- cbind(1, x)
coeffs <- solve(t(X) %*% X) %*% (t(X) %*% y)
coeffs <- solve(crossprod(X)) %*% crossprod(X, y)
lr <- lm(impressions ~ number_ads + number_ads, data = dat)
summaryLR <- summary(lr); summaryLR
##
## Call:
## lm(formula = impressions ~ number_ads + number_ads, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2589730 -1499885 -218481 805872 9582015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2153615 100183 -21.50 <2e-16 ***
## number_ads 474641 13998 33.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1730000 on 1213 degrees of freedom
## Multiple R-squared: 0.4866, Adjusted R-squared: 0.4862
## F-statistic: 1150 on 1 and 1213 DF, p-value: < 2.2e-16
plot(y ~ x, pch = 16)
abline(lr, col = "darkred", lwd = 2)
legend("topleft", paste0("R-square = ", round(summaryLR$r.squared, 2)))
par(mfrow = c(2, 2))
plot(lr)
Assumptions check
Looking at the scatter plot as well as model diagnostics below, you’d prefer to use a more proper and advanced regression model, however for simplicity we go with the linear regression model first.
y <- dat$clicks
x <- dat$impressions
reg <- lm(clicks ~ impressions, data = dat)
summaryLR <- summary(reg); summaryLR
##
## Call:
## lm(formula = clicks ~ impressions, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -172.80 -1.75 -1.69 -0.76 828.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.684e+00 1.687e+00 0.998 0.318
## impressions 2.097e-05 6.640e-07 31.585 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.85 on 1213 degrees of freedom
## Multiple R-squared: 0.4513, Adjusted R-squared: 0.4508
## F-statistic: 997.6 on 1 and 1213 DF, p-value: < 2.2e-16
plot(y ~ x, pch = 16)
abline(reg, col = "darkred", lwd = 2)
legend("topleft", paste0("R-square = ", round(summaryLR$r.squared, 2)))
par(mfrow = c(2, 2))
plot(reg)
num_ads <- 5
lr$coefficients[1] + lr$coefficients[2]*num_ads
## (Intercept)
## 219592
num_imps <- 1e+6
reg$coefficients[1] + reg$coefficients[2]*num_ads
## (Intercept)
## 1.684169
mr <- lm(clicks ~ impressions + number_ads, data = dat)
summary(mr)
##
## Call:
## lm(formula = clicks ~ impressions + number_ads, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.32 -3.36 -1.44 -0.01 830.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.687e+00 3.800e+00 -0.444 0.657
## impressions 2.033e-05 9.267e-07 21.941 <2e-16 ***
## number_ads 6.243e-01 6.305e-01 0.990 0.322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.85 on 1212 degrees of freedom
## Multiple R-squared: 0.4517, Adjusted R-squared: 0.4508
## F-statistic: 499.3 on 2 and 1212 DF, p-value: < 2.2e-16
Effect of number of impressions doesn’t change (significant / positive) when we include number of ads in the regression. However effects gets slightly lower. Number of ads has no significant effect on number of clicks, despite its significant effect on number of impressions.
mr$coefficients[1] + mr$coefficients[2]*num_imps + mr$coefficients[3]*num_ads
## (Intercept)
## 21.76686