INTRODUCTION:
OBJECTIVE:
DATA DESCRIPTION AND DATA PREPROCESSING:
- Checking for Missing Values:
EXPLORATORY DATA ANALYSIS:
MODEL FITTING:
- DETECTION OF OUTLIERS AND INFLUENTIAL POINTS:
  - Detection of influential points by cook’s distance:
  - Detection of outliers using studentized residuals:
- TRAIN AND TEST DATA:
CHECKING MODEL ASSUMPTIONS:
INFERENTIAL DATA ANALYSIS
CONCLUSION

INTRODUCTION:

Advertising aims to to put a product or service in the spotlight in the hope of grabbing the attention of the potential customers. This is mainly done in order to attract interest, engagement and sales. Therefore, money needs to be spent wisely on advertising in order to reach more potential buyers so that the sales of the product increases. Therefore, studies are conducted by corporate firms so to find a perfect combination of amount spent on promotions and number of units of product sales so that a company can take decisions to gain and increase its sales.

OBJECTIVE:

The primary objective of this study is to analyze the impact of different advertising channels (TV, Billboards, Google Ads, Social Media, Influencer Marketing, and Affiliate Marketing) on product sales. This study aims to determine the statistical significance of these relationships and develop a predictive model to forecast future sales. The goal of the study is to provide actionable insights for optimizing advertising budgets. This will help businesses enhance their marketing strategies and improve sales performance.

DATA DESCRIPTION AND DATA PREPROCESSING:

The data set is obtained from kaggle-Product Advertising Data.

The data set contains information of advertising impact on product sales. The data contains 300 observations.

• TV: Advertising costs on television.

• Billboards: Advertising costs associated with billboards.

• Google_Ads: Advertising costs incurred on Google Ads.

• Social_Media: Advertising expenses on various social media platforms.

• Influencer_Marketing: Costs related to influencer marketing.

• Affiliate_Marketing: Expenditure on affiliate marketing efforts.

• Product_Sold: Number of units sold corresponding to the advertising costs on various platforms.

At first, we import the data set. Then we use str() to find the structure of the data set and information about the class, length and content of each column.

'data.frame':   300 obs. of  7 variables:
 $ TV                  : num  281 703 313 899 767 ...
 $ Billboards          : num  538.8 296.5 295.9 61.3 550.7 ...
 $ Google_Ads          : num  124 558 643 549 652 ...
 $ Social_Media        : num  349 181 506 241 666 ...
 $ Influencer_Marketing: num  243 781 439 279 396 ...
 $ Affiliate_Marketing : num  910 132 464 432 842 ...
 $ Product_Sold        : num  7164 5055 6154 5480 9669 ...

All of the variables are of integer data type.

Now, we take a look at the first few rows of the data set.

TV	Billboards	Google_Ads	Social_Media	Influencer_Marketing	Affiliate_Marketing	Product_Sold
281.42	538.80	123.94	349.30	242.77	910.10	7164
702.97	296.53	558.13	180.55	781.06	132.43	5055
313.14	295.94	642.96	505.71	438.91	464.23	6154
898.52	61.27	548.73	240.93	278.96	432.27	5480
766.52	550.72	651.91	666.33	396.33	841.93	9669
507.13	612.27	230.67	142.96	171.79	965.77	7627
486.64	555.02	203.26	271.62	70.04	366.25	5177
762.09	184.57	176.61	97.85	116.67	251.74	3726
638.60	778.31	796.32	759.04	857.13	515.16	9801
591.48	542.51	400.23	329.15	577.38	931.65	8652

Here, Product_Sold is the response variable.

Our objective is to determine how the regressors are able to explain sales of the product by building up a regression model of Product_Sales on the regressors.

Checking for Missing Values:

For successful data analysis, it is needed to check whether there are any missing values in the data set or not since missing information may lead to erroneous conclusions. If there are missing observations, the rows or columns containing missing values may be deleted or impute the missing value with a constant or some statistics like mean, median or mode of each column in which the missing value is located.

                  TV           Billboards           Google_Ads 
                   0                    0                    0 
        Social_Media Influencer_Marketing  Affiliate_Marketing 
                   0                    0                    0 
        Product_Sold 
                   0

It is found that our data set contains no missing information. Therefore, the analysis can be proceeded.

EXPLORATORY DATA ANALYSIS:

We conduct some initial investigations on the data to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.

Firstly, we use summary() to calculate the summary statistics for each column.

	TV	Billboards	Google_Ads	Social_Media	Influencer_Marketing	Affiliate_Marketing	Product_Sold
X	Min. : 1.04	Min. : 3.63	Min. : 14.86	Min. : 11.69	Min. : 0.77	Min. : 6.74	Min. : 2259
X.1	1st Qu.:273.09	1st Qu.:277.91	1st Qu.:250.53	1st Qu.:265.61	1st Qu.:214.48	1st Qu.:267.68	1st Qu.: 5922
X.2	Median :513.97	Median :533.02	Median :528.97	Median :486.38	Median :480.36	Median :451.31	Median : 7051
X.3	Mean :517.43	Mean :502.64	Mean :512.44	Mean :489.80	Mean :465.73	Mean :484.43	Mean : 7032
X.4	3rd Qu.:774.27	3rd Qu.:745.00	3rd Qu.:763.35	3rd Qu.:695.12	3rd Qu.:703.09	3rd Qu.:721.07	3rd Qu.: 8278
X.5	Max. :998.10	Max. :995.32	Max. :999.23	Max. :996.16	Max. :999.83	Max. :987.58	Max. :12227

Now, looking at the distribution of each of the predictors.

Next, looking at the distribution of the response variable Product_Sold.

Visualizing the relationship between amount on advertising on units sold and the covariates.

Also getting a clear idea about the relationship of the response value with the covariates by observing of the correlation matrix.

	TV	Billboards	Google_Ads	Social_Media	Influencer_Marketing	Affiliate_Marketing	Product_Sold
TV	1.00	-0.03	0.03	-0.04	0.01	0.09	0.37
Billboards	-0.03	1.00	0.05	0.05	-0.01	-0.04	0.48
Google_Ads	0.03	0.05	1.00	0.04	-0.06	-0.13	0.20
Social_Media	-0.04	0.05	0.04	1.00	-0.04	-0.02	0.40
Influencer_Marketing	0.01	-0.01	-0.06	-0.04	1.00	-0.05	0.14
Affiliate_Marketing	0.09	-0.04	-0.13	-0.02	-0.05	1.00	0.61
Product_Sold	0.37	0.48	0.20	0.40	0.14	0.61	1.00

Summary of analysis:

• Response: The histogram of Product_Sold is slightly negatively distributed. The box-plot shows the presence of outliers. The median number of units of products sold is approximately 7100.

• Regressors: The variable TV has a very slight positive skewness. The median expenditure of TV is nearly 500. The variable Billboards has a very slight negative skewness. The median expenditure of Billboards is nearly 530. The variable Google_Ads has a very slight negative skewness. The median expenditure of Google_Ads is nearly 530. The variable Social_Media has a very slight positive skewness. The median expenditure of Social_Media is nearly 490. The variable Influencer_Marketing has a slight positive skewness. The median expenditure of Influencer_Marketing is nearly 480. The variable Affiliate_Marketing has a slight positive skewness. The median expenditure of Affiliate_Marketing is nearly 450.

• Correlation: There is positive correlation between the response and the covariates. Among them, the correlation between the response and Affiliate_Marketing is maximum.

MODEL FITTING:

First, we regress the response against each predictor.

We define a multiple linear regression model as:

$Y_{i}=\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\beta_{3}x_{i3}+\beta_{4}x_{i4}+\beta_{5}x_{i5}+\beta_{6}x_{i6}+\epsilon_{i}\forall i=1(1)300$ under the assumptions,

• $E(\epsilon_{i})=0\forall i=1(1)300$

• $Var(\epsilon_{i})=\sigma^{2}\forall i=1(1)300$

• $cov(\epsilon_{i},\epsilon_{j})=0\forall i\ne j$

• $\epsilon_{i}\sim N(0,\sigma^{2})\forall i=1(1)300$

Here the regression equation is of the form:

Product_Sold = $\beta_{0}$+$\beta_{1}$TV+$\beta_{2}$Billboards+$\beta_{3}$Google_Ads+$\beta_{4}$Social_Media+$\beta_{5}$Influencer_Marketing+$\beta_{6}$Affiliate_Marketing+$\epsilon_{i}\forall i=1(1)300$ where $(\beta_{1},\beta_{2},\beta_{3},\beta_{4},\beta_{5},\beta_{6})$ are the coefficients of the effects of the covariates and $\beta_{0}$ is the intercept. $\beta_{i}^{'}s\forall i=1(1)6$ are the average changes in number of units sold corresponding to a unit change in ith covariate. $\beta_{0}$ is the average number of units sold when none of the advertisements are used.

Fitting a linear model by ordinary least square:


Call:
lm(formula = Product_Sold ~ ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.751  -7.248  -0.438   6.874  28.443 

Coefficients:
                     Estimate Std. Error  t value Pr(>|t|)    
(Intercept)          0.088695   2.722564    0.033    0.974    
TV                   2.001110   0.002092  956.708   <2e-16 ***
Billboards           2.998039   0.002180 1375.489   <2e-16 ***
Google_Ads           1.499743   0.002128  704.836   <2e-16 ***
Social_Media         2.500021   0.002195 1138.719   <2e-16 ***
Influencer_Marketing 1.199844   0.002087  574.871   <2e-16 ***
Affiliate_Marketing  3.998916   0.002189 1827.081   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.36 on 293 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.347e+06 on 6 and 293 DF,  p-value: < 2.2e-16

DETECTION OF OUTLIERS AND INFLUENTIAL POINTS:

Now we find whether are any anomalies in the data set.

Detection of influential points by cook’s distance:

Cook’s Distance measures the change in distance in the fitted regression line if an observation is deleted from the regression equation. It therefore combines the outlier and leverage point diagnostics of a measure. The Cook’s Distance statistic is

$D_{i}=\frac{\Sigma_{j=1}^{n}(\hat{y_j}-\hat{y}_{j(i)})^{2}}{ps^{2}}$ where $s^{2}$ is the Mean Squared Error and $\hat{y}_{j(i)}$ is the fitted response value after deleting the ith observation.

If $D_{i}>\frac{4}{n}$ where n is the number of observations then $D_{i}$is tagged as an influential point.

 [1] "20"  "27"  "42"  "48"  "91"  "133" "152" "164" "186" "195" "216" "224"
[13] "272" "283"

The red line denotes the cut off. There are 14 outliers.

Detection of outliers using studentized residuals:

Studentized residuals are a type of standardized residual used in regression analysis to assess the fit of a model. These help in identifying outliers and influential data points. The studentized residual statistic is $e_{i}^{s}=\frac{e_{i}}{\hat\sigma_{(i)} {\sqrt{1-h_{ii}}}}$ where

• $e_{i}=y_{i}-\hat{y}_{i}$ is the value of the ith residual (the difference between the observed value and the predicted value).

• $\hat\sigma_{(i)}$ is the standard deviation of the residuals calculated without the ith observation.

• $h_{ii}$ is the leverage of the ith observation, a measure of the influence of the ith data point on the fitted value.

At 5% level of significance, if $e_{i}^{s}>2$ then the ith observation can be tagged as an outlier.

[1] "3"   "27"  "48"  "186" "259" "283"

The green line denotes the cut off. There are 6 outliers.

Removing the points (which are both influential as well as outlier) from the dataset to clean the dataset and make the data ready for further analysis.

Original dimension: 300 7

New dimension: 296 7

The original dimension shrinks.

TRAIN AND TEST DATA:

To find the model efficiency, we divide the whole data set into two parts:

• training data set - subset to train a model

• test data set - subset to test the trained model.

Here, we divide the data into training and testing data set in the ratio 80:20.

Now, firstly a multiple linear regression is fitted on the regressors using the training data set.


Call:
lm(formula = y.train ~ ., data = X.train)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.6141  -6.7985  -0.5287   6.8264  29.9193 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -0.029360   2.998502   -0.01    0.992    
TV                    2.003122   0.002203  909.13   <2e-16 ***
Billboards            2.997793   0.002299 1303.85   <2e-16 ***
Google_Ads            1.496227   0.002302  649.96   <2e-16 ***
Social_Media          2.499363   0.002354 1061.83   <2e-16 ***
Influencer_Marketing  1.200696   0.002235  537.34   <2e-16 ***
Affiliate_Marketing   3.999571   0.002376 1683.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.849 on 229 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.128e+06 on 6 and 229 DF,  p-value: < 2.2e-16

The RSE improves in the training model more than the previous model.

Now, considering the prediction accuracy of the model by predicting on the test dataset:

R^2: 0.9999633

RSE: 4.843545

The model performs very well on the test data set as the Adjusted $R^{2}$ is nearly 1 and the RSE is quite small.

CHECKING MODEL ASSUMPTIONS:

Heteroscedasticity:

In this model, errors might be heteroscedastic because larger budgets can lead to more varied results, different types of ads have different effects, market conditions can change unpredictably, and seasonal factors affect performance. This causes errors to have uneven spread or variability.

The residuals roughly show a constant horizontal band around the 0 line suggesting that the variances of the error terms are equal, thereby indicating no heteroscedasticity in the model.

Linearity:

The absence of patterns in the residuals in the plot, which are randomly dispersed around the 0 line, supports the assumption of linearity.

Autocorrelation:

The advertising data may contain correlated variables because spending on different channels often moves together, external factors like seasons affect all channels and campaigns target similar audiences. These patterns can make the data seem linked or correlated.

Autocorrelation is defined as the correlation between the members of a series of observations. We need to test if $cov(\epsilon_{i},\epsilon_{j})\ne0\forall i\ne j$

We use durbin watson test for detecting autocorrelation: $d=\frac{\sum_{i=2}^{n}(\epsilon_{i}-\epsilon_{i-1})^{2}}{\sum_{i=1}^{n}\epsilon_{i}^{2}}$

The following assumptions are made to use the statistic d:

• Model includes intercept term

• Explanatory variables are non stochastic

• $\epsilon_{i}'s$ are generated from AR(1) model, i.e., ${i}={i-1}+u_{i}i=1(1)300. $

• $\epsilon_{i}\sim N(0,\sigma^{2})\forall i=1(1)300$

• No missing observations

Sample correlation estimate: $\hat{\rho}=\frac{\sum_{i=1}^{n}\hat{\epsilon_{i}}\hat{\epsilon}_{i-1}}{\sqrt{\sum_{i=2}^{n}\hat{\epsilon}_{i-1}^{2}\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}}}$

Assuming $\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}\approx\sum_{i=2}^{n}\hat{\epsilon}_{i-1}^{2}$,we have $d=2(1-\hat{\rho})$

We want to test $H_{0}:\hat{\rho}=0$ against $H_{1}:\hat{\rho}\ne0$

If $d$ value turns out to be near $2$ then there is no autocorrelation.

Loading required package: carData

 lag Autocorrelation D-W Statistic p-value
   1     -0.03409199      2.066596   0.552
 Alternative hypothesis: rho != 0

The value of Durbin Watson test statistic d turns out to be 1.922939 and p-value is 0.574 (>0.05). Thus, the null hypothesis cannot be rejected at 5% level of significance and conclude that there is no autocorrelation in error terms.

Multicollinearity:

Advertising data often exhibits multicollinearity due to correlated spending across various channels, common external influences, overlapping audiences and synergistic effects between channels. These factors cause variables to move together, complicating regression analysis and leading to unreliable coefficient estimates.

Multicollinearity means the existence of perfect relationship among all explanatory variables in a regression model. In this model, an exact relationship is said to exist if the following condition is satisfied: $\beta_{1}x_{i1}+\beta_{2}x_{i2}+\beta_{3}x_{i3}+\beta_{4}x_{i4}+\beta_{5}x_{i5}+\beta_{6}x_{i6}=0$ where not all coefficients are simultaneously zero. In terms of linear algebra, we explore an issue of multicollinearity if exact linear relationship among the regressors, i.e., at least one column of X will be linear combination of the others and Rank(X) will not be of full column rank and as a result X’X will not be invertible.

In order to detect multicollinearity, we use a standard measure known as Variance Inflation Factor (VIF).

In the model, $Y_{i}=\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\beta_{3}x_{i3}+\beta_{4}x_{i4}+\beta_{5}x_{i5}+\beta_{6}x_{i6}+\epsilon_{i}\forall i=1(1)300$, the VIF of the regressor of the jth regressor is defined as: $VIF_{j}=\frac{1}{1-R_{(j)}^{2}}$ where $R_{(j)}^{2}$ is the coefficient of determination from the equation $X_{i}$ on $(X_{1},X_{2},...,X_{j-1},X_{j+1},...,X_{p})$.$VIF_{j}$ measures the dependence of $X_{j}$ on all other 5 regressors. A large VIF value indicates multicollinearity in the model. As a thumb rule, if $VIF>5$ we conclude that there is multicollinearity in the model.

                  TV           Billboards           Google_Ads 
            1.003094             1.006832             1.035421 
        Social_Media Influencer_Marketing  Affiliate_Marketing 
            1.006750             1.015758             1.032356

Since none of the values are greater than 5, therefore we can declare there is no multicollinearity in the model.

Normality of Errors:

Checking the normality of errors in an advertising regression model is essential for conducting inferential data analysis. Normal errors validate the assumptions required for accurate confidence intervals and hypothesis tests, ensuring reliable predictions and inferences in subsequent analysis.

To check the validity of the assumption, we use Q-Q plot. The Q-Q plot is a tool to help us assess if errors are plausibly generated from normal distribution. If the Q-Q plot is roughly straight, we can conclude that errors are from normal distribution.

The lower tail of the Q-Q plot deviates slightly from the straight line. So we to do a formal test to confirm the validity of normality.

Conducting Shapiro-Wilks Test.

The null hypothesis of interest: H_{0}: The errors are generated from a normal distribution

The test statistic for $H_{0}$ is given by: $W=\frac{(\Sigma_{i=1}^{n}a_{i}\epsilon_{(i)})^{2}}{\Sigma_{i=1}^{n}(\epsilon_{i}-\bar{\epsilon})^{2}}$ where

• $\epsilon_{(i)}$ is the ith ordered error term in the model

• $a_{i}^{'}s$ are calculated using the mean, variance and covariance of the $\epsilon_{i}^{'}s$

W is compared against the tabulated values of this statistic’s distribution. If p-value is being used, then if the p-value is smaller than the desired level will lead to rejection of the null hypothesis.


    Shapiro-Wilk normality test

data:  rstudent(model)
W = 0.99564, p-value = 0.7456

The p-value > 0.05, implying that the distribution of the model are not statistically different from normal distribution. Therefore, the normality assumption can be validated.

INFERENTIAL DATA ANALYSIS

The errors are assumed to follow a normal distribution as validated while checking model assumptions.

$\epsilon_{i}\sim N(0,\sigma^{2})\forall i$

First, we check whether there is a relationship between any of the predictors and the response. To do this a hypothesis test is done to answer the question. The null hypothesis is tested: $H_{0}:\beta_{1}=\beta_{2}=\beta_{3}=\beta_{4}=\beta_{5}=\beta_{6}=0$ versus the alternative hypothesis: $H_{1}:$at least one $\beta_{j}$ is non-zero $\forall j=1(1)6$

To test if the beta coefficient is equal to zero, the t-value and the corresponding p-value for the coefficient is observed from the output of summary(model).


Call:
lm(formula = y.train ~ ., data = X.train)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.6141  -6.7985  -0.5287   6.8264  29.9193 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -0.029360   2.998502   -0.01    0.992    
TV                    2.003122   0.002203  909.13   <2e-16 ***
Billboards            2.997793   0.002299 1303.85   <2e-16 ***
Google_Ads            1.496227   0.002302  649.96   <2e-16 ***
Social_Media          2.499363   0.002354 1061.83   <2e-16 ***
Influencer_Marketing  1.200696   0.002235  537.34   <2e-16 ***
Affiliate_Marketing   3.999571   0.002376 1683.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.849 on 229 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.128e+06 on 6 and 229 DF,  p-value: < 2.2e-16

Since the p-value for all the predictors is <2e-16 (which is less than 0.001), the null hypothesis is rejected and it can be concluded that all the predictors have a significant effect on the response variable Product_Sold. The beta coefficient is significantly different from zero (i.e. $\beta_1\ne0,\beta_2\ne0,\beta_3\ne0,\beta_4\ne0,\beta_5\ne0,\beta_6\ne0$) , indicating that changes in the predictors are associated with changes in the response.

The F-statistic in a linear regression model summary tests the overall significance of the model. It evaluates whether there is a relationship between the response variable and the predictors collectively.

Null hypothesis $H_0$: All regression coefficients are equal to zero, meaning none of the predictors have any explanatory power for the response variable.

Alternative Hypothesis $H_1$: At least one regression coefficient is not equal to zero, meaning at least one predictor has some explanatory power for the response variable.

Since the p-value is <2e-16, which is less than 0.05,the null hypothesis is rejected. This indicates that the regression model with the predictors provides a significantly better fit to the data than a model with no predictors. In other words, at least one of the predictors in the model is significantly related to the response variable. The F-statistic value is quite high (1.347e+06) which indicates that the model explains a significant portion of the variance in the response variable.

All the covariates are statistically significant. The entire model explains the variance in Product_Sold significantly.

CONCLUSION

After analyzing the data, our final model is:

Final Model:

Product_$Sold_i$ = -4.360048 + 2.002190 * $TV_i$ + 3.000949 * $Billboards_i$ + 1.500990 * Google_$Ads_i$ + 2.502018 * Social_$Media_i$ + 1.198624 * Influencer_$Marketing_i$ + 4.001492 * Affiliate_$Marketing_i$ + $\epsilon_{i}\forall i=1(1)300$

It can be concluded that on increasing 1 unit of advertising costs on television the number of units sold increases approximately by 2 units. Similarly, on increasing 1 unit of advertising costs associated with billboards the number of units sold increases approximately by 3 units. By increasing 1 unit advertising costs incurred on Google Ads the number of units sold increases by 1.5 units. The number of units sold increases by approximately 1.5 units by increasing 1 unit of advertising expenses on various social media platforms. Finally on increasing 1 unit of increase in costs related to influencer marketing and expenditure on affiliate marketing efforts, the number of units sold increases by approximately 2 and 4 units respectively.

REGRESSION ANALYSIS ON ADVERTISING DATA

Pallabi Dutta

June 2, 2024