Part 1: Multiple Linear Regression Modeling: Fitting and Interpretation of results

0. Introduction to the Advertising dataset

The Advertising dataset contains information on advertising expenditures in TV, Radio, and Newspaper, along with the corresponding Sales results. It is sourced from GitHub and I selected it for use in multiple linear regression analysis to measure how different advertising channels influence product sales. Based on the homework task of finding a dataset and applying multiple linear regression, I chose this dataset because it has one dependent variable (Sales) and multiple independent variables (TV, Radio, and Newspaper budgets). Therefore, it is suitable for building a multiple regression model in R and interpreting how each advertising channel contributes to changes in sales performance.

1. Import the Advertising dataset into R

# Importing dataset
data<-read.csv("C:\\Users\\JIRAGUHA\\Desktop\\AUCA\\R-PROGRAMING\\Mid_Exam\\Advertising.csv")

# 1.1 View first rows
head(data)
##   X    TV radio newspaper sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2
# 1.2 Structure of dataset
str(data) # This function used to display the structure of a dataset or object. It provides information about the object type, number of observations, number of variables, variable names, data types, and sample values.
## 'data.frame':    200 obs. of  5 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
# 1.3 Viewing dataset summary

summary(data) # This function is used to display descriptive (summary) statistics for each variable in the dataset
##        X                TV             radio          newspaper     
##  Min.   :  1.00   Min.   :  0.70   Min.   : 0.000   Min.   :  0.30  
##  1st Qu.: 50.75   1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75  
##  Median :100.50   Median :149.75   Median :22.900   Median : 25.75  
##  Mean   :100.50   Mean   :147.04   Mean   :23.264   Mean   : 30.55  
##  3rd Qu.:150.25   3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10  
##  Max.   :200.00   Max.   :296.40   Max.   :49.600   Max.   :114.00  
##      sales      
##  Min.   : 1.60  
##  1st Qu.:10.38  
##  Median :12.90  
##  Mean   :14.02  
##  3rd Qu.:17.40  
##  Max.   :27.00

The dataset contains 200 observations and 5 variables, but one variable (X) is an index showing row number (it is not part of the real analysis variables) and the remaining four variables (TV, radio, newspaper, and sales) are numeric which make part of analysis.

2 Exploratory Data Analysis (EDA)

Before fitting the multiple linear regression model, I conduct an Exploratory Data Analysis (EDA) process. This process includes checking for missing values, duplicate observations, outliers, and variable distributions. Furthermore, correlation analysis and multicollinearity diagnostics will be performed to evaluate the suitability of the predictor variables. These procedures will be presented step by step to ensure the reliability and validity of the regression model results.

2.1 Check for Missing Values

# Count missing values in each variable

colSums(is.na(data))# To count the number of missing values (NA) in each column of the dataset. 
##         X        TV     radio newspaper     sales 
##         0         0         0         0         0

The above output shows that each variable has zero missing values (NA). Therefore, the dataset is complete, and no missing-data treatment is required before performing multiple linear regression analysis

# Total missing values
sum(is.na(data)) # All values are 0, there are no missing values in the dataset.
## [1] 0
# Check for duplicate Observations
sum(duplicated(data)) # The output [1] 0 indicates that there are no duplicated rows in the dataset. this means each observation is unique, and there is no repetition of data entries
## [1] 0
#3. Boxplot for all numeric variables to identify Outliers in advertizing dataset 

boxplot(data[, c("TV", "radio", "newspaper", "sales")],
        main = "Boxplots of Advertising Variables(numerical values)",
        col = c("lightblue", "lightgreen", "lightpink", "orange"))

There are two upper outlier values in newspaper variable as shown by the newspaper boxplot.

# by usng boxplot.stats() function i can identify and extract those outlier values

boxplot.stats(data$TV)$out 
## numeric(0)
boxplot.stats(data$radio)$out
## numeric(0)
boxplot.stats(data$newspaper)$out
## [1] 114.0 100.9
boxplot.stats(data$sales)$out
## numeric(0)

Boxplots and boxplot.stats() function revealed and identify a few potential outliers (only 2) in the advertising expenditure variables (newspaper variable (114 and 100.9)) . These observations are retained for further analysis because they represent valid values rather than data entry errors. Further diagnostic analysis indicated that they do not influence on the regression model; therefore, all observations are included in the analysis.

Correlation analysis to examine the relationships between the predictor variables (TV, Radio, and Newspaper advertising expenditures) and the response variable (Sales)

# Relationships between the predictor variables and the response variable

round(cor(data[, c("TV", "radio", "newspaper", "sales")]),2)
##             TV radio newspaper sales
## TV        1.00  0.05      0.06  0.78
## radio     0.05  1.00      0.35  0.58
## newspaper 0.06  0.35      1.00  0.23
## sales     0.78  0.58      0.23  1.00

Interpretation of the correlation Matrix

The above correlation matrix shows that TV advertising has a strong positive relationship with sales (r = 0.78), making it the most influential predictor, followed by radio with a moderate positive correlation (r = 0.58), while newspaper has a weak positive relationship (r = 0.23). Among the predictors, correlations are generally low (TV–radio = 0.05, TV–newspaper = 0.06, radio–newspaper = 0.35), indicating weak associations and no serious multicollinearity. Overall, the predictors are largely independent and the model is reliable for multiple regression analysis.

3. Applying Multiple Linear Regression Model on the Advertising dataset

By using Sales as the dependent variable and TV, radio, and newspaper as the independent variables fitted as: MR_Sales= B0+B1(TV)+B2(radio)+B3(newspaper)

# 3.1 Fit multiple linear regression model
MR_Sales <- lm(sales ~ TV + radio + newspaper, data = data)

# 3.2 View results
summary(MR_Sales)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Interpretation of results

The fitted multiple linear regression model is:

MR_Sales=2.9389+0.0458(TV)+0.1885(radio)−0.001037(newspaper)

Interpretation of the Regression Coefficients

Intercept(B0)= 2.938889

This intercept represents the expected sales when expenditures on TV, radio, and newspaper advertising are all zero. This means by holding all advertising expenditures at zero, the predicted sales are approximately 2.94 units and it this intercept is statistically significante since its p_value of 2e-16 is less than common set P_values of 0.05, 0.01, or 0.001.

TV Advertising (B1)= 0.045765

Holding radio and newspaper advertising constant, a one unit increase in TV advertising expenditure is associated with an average increase of 0.0458 units in sales. This coefficient is highly statistically significant since its P_value of 2e-16 is less than common set P_values of 0.05, 0.01, or 0.001, indicating that advertising on TV has a significant positive effect on sales.

Radio Advertising (B2)= 0.188530

Holding TV and newspaper advertising constant, a one unit increase in radio advertising expenditure is associated with an average increase of 0.1885 units in sales. This coefficient is statistically significant since its P_value of 2e-16 is less than common set P_values of 0.05, 0.01, or 0.001, indicating that advertising on Radio has a significant positive effect on sales.

Newspaper Advertising(B3)= -0.001037

Holding TV and radio advertising constant, a one unit increase in newspaper advertising expenditure is associated with an average decrease of 0.001037 units in sales. The coefficient of Newspaper Advertising(B3) is not statistically significant (p = 0.86 > 0.05), suggesting that newspaper advertising does not have a meaningful effect on sales in this fitted model.

Interpretation R-squared (=0.8972) and adjusted R-squared (=0.8956)

About 89.72% of the variation in sales is explained by TV, radio, and newspaper advertising expenditures (R-squared) and after adjusting for the number of predictors, approximately 89.56% of the variation in sales remains explained by the model .

Multiple Correlation Coefficient (R)

# Multiple Correlation Coefficient
round(sqrt(summary(MR_Sales)$r.squared),3)
## [1] 0.947

Interpretation of this Multiple Correlation Coefficient (R)

In multiple linear regression, the common correlation measure is the multiple correlation coefficient (R), which measures the strength of the relationship between the dependent variable and all predictor variables jointly. It is the square root of the coefficient of determination (R-squared =0.8972), then R=0.947, indicating a very strong relationship between Sales and the combined predictors (TV, Radio, and Newspaper). The model does not have a single overall direction; instead, each predictor influences the response variable according to its own coefficient sign and magnitude as already discussed above.

Diagnostic analysis of the fitted multiple linear regression model

par(mfrow=c(2,2))
plot(MR_Sales)

Interpretation of those obove four diagnostic plots

Plot_1: Residuals vs Fitted

Since the residual points are randomly distributed around zero line with no systematic pattern, the linearity assumption is satisfied.

Plot_2: Normal Q-Q Plot

By observing the Normal Q-Q Plot, most points follow the reference line, the residuals are approximately normally distributed.

Plot_3: Scale-Location Plot

By observing Scale-Location Plot, the points are evenly spread with no funnel pattern, the assumption of homoscedasticity (constant variance of residuals) is satisfied.

Plot_4: Residuals vs Leverage

The Residuals vs Leverage plot indicates that most observations are clustered within the acceptable range, and no points exceed the Cook’s distance boundaries. This suggests that there are no highly influential observations that affect the regression model.

4 Conclusion

The multiple linear regression analysis demonstrated that the model is highly effective in explaining sales performance, accounting for approximately 89.72% of the variation in sales. Among the predictors, TV and radio advertising expenditures have significant positive effects on sales, meaning that increased investments in these communication channels is associated with higher sales. In contrast, newspaper advertising does not have a statistically significant effect on sales. Therefore, the results suggest that both TV and radio are the most important advertising channels for improving sales in the Advertising dataset.

Part 2: Read about variable selection methods

0. Introduction to Variable Selection Methods

The variable selection method is the process of choosing the most important independent variables to include in a regression model. Its main objective is to improve model accuracy, reduce complexity, and eliminate irrelevant predictors (explanatory variables). Many methods may be applied such as forward selection method, backward elimination method, and ste-pwise regression, are commonly used to determine the optimal set of variables to fit a regression model.

To demonstrate how variable selection methods are used in practice, the fitted multiple linear regression model for sales is applied using the advertising dataset. The selection process is guided by statistical criteria such as p-values, AIC, BIC, and Adjusted R-squared in order to identify the most appropriate set of predictors. The initial full model is given by:

sales= B0+B1(TV)+B2(radio)+B3(newspaper)

where TV, radio, and newspaper represent the explanatory variables used to predict sales.

Method_1: Forward Selection

Starts with no variables and adds predictors one by one based on improvement in the model (AIC, BIC, p-values, or Adjusted R²).

# Start with null model
model0 <- lm(sales ~ 1, data = data)

# Add TV (best first predictor)
model1 <- lm(sales ~ TV, data = data)

# Add radio to the best model so far
model2 <- lm(sales ~ TV + radio, data = data)

# Try adding newspaper (final check)
model3 <- lm(sales ~ TV + radio + newspaper, data = data)

# Compare models to select the best one

AIC(model0, model1, model2, model3)
##        df       AIC
## model0  2 1231.3769
## model1  3 1044.0913
## model2  4  780.3941
## model3  5  782.3622
# Compare models to select the best one based on AIC Value
BIC(model0, model1, model2, model3)
##        df       BIC
## model0  2 1237.9736
## model1  3 1053.9863
## model2  4  793.5874
## model3  5  798.8538

By comparing the AIC values of the four candidate models, model2, which includes TV and radio as predictors, has the lowest AIC value (780.3941).

Sales=B0+B1(TV)+B2(radio)

# Compare models to select the best one based on BIC Value
BIC(model0, model1, model2, model3)
##        df       BIC
## model0  2 1237.9736
## model1  3 1053.9863
## model2  4  793.5874
## model3  5  798.8538

The BIC values above indicate that model2, which includes TV and radio as predictors, has the lowest BIC value (793.5874) among all candidate models, so TV and radio advertising expenditures are the most important predictors of sales.

Sales=B0+B1(TV)+B2(radio)

Method_2: Backward Elimination (Remove weakest variable first)

Backward elimination starts with all predictors and removes the least significant variable step by step.

#  full_model is the model containing all explanatory variables
full_model <- lm(sales ~ TV + radio + newspaper, data = data)
summary(full_model)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Sales=2.9389+0.0458(TV)+0.1885(radio)−0.001037(newspaper)

From the outputs fitted model (Sales):

2.1 Identify least important variable

TV: p-value < 2e-16 (1st highly significant)

radio: p-value < 2e-16 (2rd highly significant)

newspaper has the highest p-value (0.86 > 0.05)

The newspaper (p = 0.86) is removed first as it contributes very little to explaining sales and remove newspaper and refit the model as:

Sales= B0+B1(TV)+B2(radio)

As conclusion, by observing results, the backward elimination suggests to drop the newspaper from the model.

2.2 reduced model

# model1 containing only two variables
model1 <- lm(sales ~ TV + radio, data = data)
summary(model1)
## 
## Call:
## lm(formula = sales ~ TV + radio, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## radio        0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

Checking if model has been improved

# Compute full_model and model1 with AIC and BIC to see model with the lowest AIC or BIC for preference

AIC(full_model, model1) # This function computes the Akaike Information Criterion (AIC) for two or more competing regression 
##            df      AIC
## full_model  5 782.3622
## model1      4 780.3941
BIC(full_model, model1) # This function computes the Bayesian Information Criterion (BIC) for model comparison 
##            df      BIC
## full_model  5 798.8538
## model1      4 793.5874
summary(full_model)$adj.r.squared
## [1] 0.8956373
summary(model1)$adj.r.squared
## [1] 0.8961505

The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) values (from my comparison) ,the Adjusted R-squared increased slightly from 0.8956 (full model) to 0.8962 (reduced model). This indicates that removing newspaper did not reduce the explanatory power of the model; instead, the model became slightly more efficient.

Method_3: Stepwise Selection (Forward + Backward combined)

Stepwise selection is a variable selection technique that combines the ideas of forward selection and backward elimination. The method adds significant variables to the model while also checking whether previously included variables should be removed.

# Applying step() to find the model with the lowest AIC value.

stepwise_model <- step(full_model, direction = "both") # his code performs stepwise model selection by automatically adding or removing predictors to find a better model with the Akaike Information Criterion (AIC) values.
## Start:  AIC=212.79
## sales ~ TV + radio + newspaper
## 
##             Df Sum of Sq    RSS    AIC
## - newspaper  1      0.09  556.9 210.82
## <none>                    556.8 212.79
## - radio      1   1361.74 1918.6 458.20
## - TV         1   3058.01 3614.8 584.90
## 
## Step:  AIC=210.82
## sales ~ TV + radio
## 
##             Df Sum of Sq    RSS    AIC
## <none>                    556.9 210.82
## + newspaper  1      0.09  556.8 212.79
## - radio      1   1545.62 2102.5 474.52
## - TV         1   3061.57 3618.5 583.10
summary(stepwise_model)
## 
## Call:
## lm(formula = sales ~ TV + radio, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## radio        0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

Interpretation based on the AIC Values

The stepwise regression procedure started with the full model containing TV, radio, and newspaper, which had an AIC value of 212.79. After removing the variable newspaper, the AIC decreased to 210.82.

Since a lower AIC value indicates a better model, the reduction in AIC from 212.79 to 210.82 suggests that the model containing only TV and radio variables, meaning that newspaper advertising does not contribute sufficiently to explaining sales and can be excluded from the model.

Sales= B0+B1(TV)+B2(radio)

Method_4: Best Subset Selection (Try all combinations)

Best Subset Selection is a variable selection method that evaluates all possible combinations of predictor variables and compares the resulting models using criteria such as AIC, BIC, Adjusted R-squared.

# Testing all possible models
m0 <- lm(sales ~ 1, data = data)
m1 <- lm(sales ~ TV, data = data)
m2 <- lm(sales ~ radio, data = data)
m3 <- lm(sales ~ newspaper, data = data)
m4 <- lm(sales ~ TV + radio, data = data)
m5 <- lm(sales ~ TV + newspaper, data = data)
m6 <- lm(sales ~ radio + newspaper, data = data)
m7 <- lm(sales ~ TV + radio + newspaper, data = data)
# Compare models AIC Values to find the lowest one
AIC(m1, m2, m3, m4, m5, m6, m7)
##    df       AIC
## m1  3 1044.0913
## m2  3 1152.6738
## m3  3 1222.6714
## m4  4  780.3941
## m5  4 1027.7782
## m6  4 1154.4723
## m7  5  782.3622

Key interpretation basing on AIC value

The lowest AIC value is 780.3941 (model m4), which corresponds to the model TV + radio and indicates that m4 is the best model according to the AIC criterion.

# Compare models BIC Values to find the lowest one
BIC(m1, m2, m3, m4, m5, m6, m7)
##    df       BIC
## m1  3 1053.9863
## m2  3 1162.5687
## m3  3 1232.5663
## m4  4  793.5874
## m5  4 1040.9714
## m6  4 1167.6655
## m7  5  798.8538

Key interpretation basing on BIC value

Based on the BIC criterion, the lowest BIC value is 793.5874 (model m4), which corresponds to the model TV + radio, shows that TV and radio advertising are the most important predictors of sales, while newspaper advertising does not contribute significantly and should be excluded from the model.

Overall conclusion on Variable Selection Methods

All methods discussed variable selection methods have been applied in this analysis—Forward Selection, Backward Elimination, Stepwise Regression, and Best Subset Selection and they produced a consistent result when applied to the Advertising dataset.

Across all methods, the comparison of statistical criteria such as AIC, BIC, Adjusted R-squared, and p-values showed that the variable newspaper does not significantly contribute to explaining variations in sales. In contrast, TV and radio consistently stay as strong and significant predictors in all selection approaches.

This consistency across different methods strengthens the reliability of the result. It confirms that TV and radio advertising expenditures are the most important predictors of sales, while newspaper advertising is not statistically useful and can be excluded.

The fitted Sales model is : sales∼TV+radio

THE END.

```