Examples of data

We’ll use the marketing data set [datarium package], which contains the impact of the amount of money spent on three advertising medias (youtube, facebook and newspaper) on sales.

#install.packages("datarium")
data("marketing", package = "datarium")
head(marketing, 4)
##   youtube facebook newspaper sales
## 1  276.12    45.36     83.04 26.52
## 2   53.40    47.16     54.12 12.48
## 3   20.64    55.08     83.16 11.16
## 4  181.80    49.56     70.20 22.20
#write.csv(marketing,file = "C:\\Users\\mindy\\Dropbox\\SHI\\Course Materials\\Mindy\\Basic Biostatistics\\Lecture 10\\marketing.csv")

Building model - simple linear regression with one covariate

We want to build a model for estimating sales based on the advertising budget invested in youtube as follow:

\(sales = b0 + b1*youtube\)

You can compute the model coefficients in R as follow:

model <- lm(sales ~ youtube, data = marketing)
summary(model)
## 
## Call:
## lm(formula = sales ~ youtube, data = marketing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0632  -2.3454  -0.2295   2.4805   8.6548 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.439112   0.549412   15.36   <2e-16 ***
## youtube     0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

Interpretation

To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statitic p-values:

summary(model)$coefficient
##               Estimate  Std. Error  t value    Pr(>|t|)
## (Intercept) 8.43911226 0.549411528 15.36028 1.40630e-35
## youtube     0.04753664 0.002690607 17.66763 1.46739e-42

For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.

It can be seen that, changing in youtube advertising budget are significantly associated to changes in sales.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.

The youtube coefficient suggests that for every 1000 dollars increase in youtube advertising budget, we can expect an increase of 0.048*1000 = 48 sales units, on average.

The confidence interval of the model coefficient can be extracted as follow:

confint(model)
##                  2.5 %     97.5 %
## (Intercept) 7.35566312 9.52256140
## youtube     0.04223072 0.05284256

Prediction

Let us predict the sales volume when the youtube advertisement spent is 200, using the above simple linear regression model.

newData = data.frame(youtube=200)
predict(model, newdata=newData, interval="confidence", level=0.95)
##        fit      lwr      upr
## 1 17.94644 17.38703 18.50585