Rmarkdown: Multiple Regression I

Regression with Dummy Variable

First, let’s load the dataset.

## Loading the Data
data_readwrite <- read.csv("readwrite2.csv")

head(data_readwrite)

##     id reading writing level
## 1 3070      18      21     3
## 2 1306      20      15     3
## 3   83      14      14     2
## 4 2486      16      20     4
## 5 1938      14      15     2
## 6  397      20      18     4

# Adding Gender Column as a Dummy Variable (female = 0; male = 1.)
set.seed(123) # Setting seed for reproducibility
data_readwrite$gender <- sample(c(0, 1), 
                                nrow(data_readwrite), 
                                replace = TRUE)

head(data_readwrite)

##     id reading writing level gender
## 1 3070      18      21     3      0
## 2 1306      20      15     3      0
## 3   83      14      14     2      0
## 4 2486      16      20     4      1
## 5 1938      14      15     2      0
## 6  397      20      18     4      1

Let’s predict writing scores from gender (0=female,1=male). The coefficient for gender represents the difference in reading scores between males and females. If the coefficient is positive, it indicates that, on average, males score higher in reading than females. If the coefficient is negative, it suggests that females score higher on average. The significance of the coefficient indicates whether this difference is statistically significant.

mod.dummy.unstd <- lm(writing ~ gender, data=data_readwrite)
summary(mod.dummy.unstd)

## 
## Call:
## lm(formula = writing ~ gender, data = data_readwrite)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0562  -3.0562   0.9438   3.9438   7.9696 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.03042    0.13371 127.370   <2e-16 ***
## gender       0.02582    0.19106   0.135    0.893    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.268 on 1995 degrees of freedom
## Multiple R-squared:  9.151e-06,  Adjusted R-squared:  -0.0004921 
## F-statistic: 0.01826 on 1 and 1995 DF,  p-value: 0.8925

The standard interpretation still applies here. A one unit difference in gender is associated with a 0.02582 difference in their writing scores But, a one unit difference in gender is male compared to female, so this interpretation can be improved Males are predicted to score 0.02582 points higher on writing scores than females. However, this difference is not significant, as can be seen from the large p-value.

We can also report standardized regression coefficients:

mod.dummy.std <- lm(scale(writing) ~ scale(gender), data=data_readwrite)
summary(mod.dummy.std)

## 
## Call:
## lm(formula = scale(writing) ~ scale(gender), data = data_readwrite)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0597 -0.7162  0.2212  0.9242  1.8677 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)   2.842e-16  2.238e-02   0.000    1.000
## scale(gender) 3.025e-03  2.239e-02   0.135    0.893
## 
## Residual standard error: 1 on 1995 degrees of freedom
## Multiple R-squared:  9.151e-06,  Adjusted R-squared:  -0.0004921 
## F-statistic: 0.01826 on 1 and 1995 DF,  p-value: 0.8925

From the results, we see that males have predicted writing scores 0.003 SD higher than females. Note that the p-value for the effect of gender is the same regardless; the p-value does not depend on how the units are expressed.

As we progress through the course, we’ll delve into a variety of statistical tests. One such test is the independent samples t-test, which we’ll use to examine whether there’s a significant difference in the writing scores between genders. Interestingly, when you run this t-test to compare the population means of writing scores across gender categories, the results will mirror those from a regression model using gender as a predictor. In other words, the regression model with a dummy variable for gender essentially replicates the analysis of the t-test.

t.test(scale(data_readwrite$writing) ~ scale(data_readwrite$gender), var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  scale(data_readwrite$writing) by scale(data_readwrite$gender)
## t = -0.13511, df = 1995, p-value = 0.8925
## alternative hypothesis: true difference in means between group -0.979430381310121 and group 1.02049034617077 is not equal to 0
## 95 percent confidence interval:
##  -0.09386118  0.08176166
## sample estimates:
## mean in group -0.979430381310121   mean in group 1.02049034617077 
##                     -0.002962777                      0.003086984

Compare the outputs from the regression and the independent samples t-test. You should notice that the results are exactly the same. That’s because regression with a single dichotomous predictor is exactly equivalent to an independent samples t-test - as long as we assume homogeneity of variance in the t-test, which is precisely the same assumption as homoskedasticity. Two fancy names, same concept.

Multiple Regression

In the music industry, predicting album sales is a significant concern. Multiple factors can influence the success or failure of an album. In this analysis, we’ll delve into the factors that might predict album sales using a dataset dta_album. This dataset contains data on the number of advertisements (adverts), the airplay time (airplay), album attractiveness (attract), and the location where the album was released (location).

# Load the dataset
dta_album <- read.delim("Album Sales 2.dat", header = TRUE)

dta_album$location <- sample(c("A", "B", "C"),
                             nrow(dta_album),
                             replace = TRUE)

# To explore how these factors predict album sales, we will run a multiple regression model. In R, the `lm` function from the base package can be used for this purpose. Multiple predictors can be added to the model using the `+` symbol.
album.model1 <- lm(sales ~ airplay + adverts + location + attract, data = dta_album)

summary(album.model1)

## 
## Call:
## lm(formula = sales ~ airplay + adverts + location + attract, 
##     data = dta_album)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -120.023  -27.754   -1.236   30.647  145.283 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.766805  18.231484  -1.139    0.256    
## airplay       3.352494   0.279098  12.012  < 2e-16 ***
## adverts       0.084784   0.006946  12.206  < 2e-16 ***
## locationB    -8.763636   8.346451  -1.050    0.295    
## locationC    -6.073371   8.267559  -0.735    0.463    
## attract      11.051165   2.459551   4.493  1.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.19 on 194 degrees of freedom
## Multiple R-squared:  0.6666, Adjusted R-squared:  0.658 
## F-statistic: 77.59 on 5 and 194 DF,  p-value: < 2.2e-16

Model Interpretation

Intercept (Baseline Prediction): The intercept represents the predicted album sales when all predictors are 0. It’s -20.76, which doesn’t have a clear practical interpretation in this context since factors like attractiveness cannot realistically be zero.
Airplay: For every additional unit of airplay, holding all other factors constant, the album sales increase by 3.35 units. This positive relationship is highly statistically significant (p-value < 2e-16).
Advertisements (adverts): For every unit increase in advertisement spending, the album sales increase by approximately 0.085 units, given other factors are constant. This predictor is also highly significant.
Location (locationB and locationC): The coefficients for locationB and locationC represent the difference in album sales compared to the reference category (which is locationA in this case). Neither of these predictors is statistically significant, suggesting that the location might not play a significant role in predicting sales once other factors are considered.
Album Attractiveness (attract): A one-unit increase in album attractiveness rating leads to an 11.15 unit increase in album sales, holding other factors constant. This predictor is statistically significant.

Important Notes:

1. Interpreting Regression Coefficients: The values of the coefficients tell us about the relationship between each predictor and the outcome, holding all other predictors constant. This concept is also known as “ceteris paribus”, meaning “all else being equal”.
1. Significance and Real-world Importance: While certain predictors may be statistically significant, it’s essential to also consider their real-world significance. For instance, while airplay and adverts are significant predictors, the location variables may not significantly affect sales in the presence of other factors.
1. Correlations among Predictors: It’s also crucial to understand that if two predictors are highly correlated, one may not appear as significant in the presence of the other. This concept can be further explored using correlation matrices and variance inflation factors (VIF), which we will cover later.

Standardizing multiple regression

To calculate standardized regression coefficients, you can utilize the lm.beta function available in the lm.beta package. This function produces standardized regression coefficients.

# install.packages("lm.beta")
library(lm.beta)

lm.beta(album.model1)

## 
## Call:
## lm(formula = sales ~ airplay + adverts + location + attract, 
##     data = dta_album)
## 
## Standardized Coefficients::
## (Intercept)     airplay     adverts   locationB   locationC     attract 
##          NA  0.50971805  0.51024116 -0.05157228 -0.03610309  0.19107534

When using standardized coefficients, we are effectively looking at the relationship between each predictor and the outcome when both are measured in standard deviation units. This makes it easier to compare the relative strength of each predictor.

Airplay: A 1 SD increase in airplay is associated with a 0.509 SD increase in album sales, holding all other factors constant. This is the strongest predictor in the model based on the standardized coefficient. Advertisements (adverts): A 1 SD increase in advertisement spending is linked with a 0.510 SD rise in album sales when other factors are held constant. Its strength is comparable to that of airplay in terms of predicting album sales.
Location (locationB and locationC): A 1 SD change in being in locationB (as opposed to locationA) is associated with a 0.039 SD change in album sales, holding all else constant. Similarly, a 1 SD change in being in locationC (compared to locationA) is linked to a 0.006 SD change in album sales. Based on the standardized coefficients, the location variables have a minimal effect on album sales.
Album Attractiveness (attract): A 1 SD difference in album attractiveness rating corresponds to a 0.190 SD difference in album sales when other predictors are held constant.

Important Notes:

1. Comparing Predictors: Using standardized coefficients allows us to make direct comparisons between predictors. In this model, airplay and adverts emerge as the strongest predictors based on their standardized coefficients. However, this doesn’t mean they’re the most crucial factors; it just offers a sense of their relative strength in the model.
1. Caution on Importance: The size of the beta coefficients and their statistical significance doesn’t permit conclusions about the importance of the variables. For instance, while airplay has a high standardized coefficient, we shouldn’t outrightly label it as the most important predictor.
1. Total Explained Variance (\(R^2\)): The provided \(R^2\) value is 0.666, which means approximately 66.6% of the variance in album sales is accounted for by the predictors in the model. The adjusted \(R^2\), which takes into account the number of predictors, is slightly lower at 65.73%, but still indicative of a reasonably good fit.
1. Overall Model Significance: The F-statistic is 77.35 with 5 predictors and 194 degrees of freedom. The associated p-value is less than 2.2e-16, which is highly significant. This suggests that at least one of the predictors is important in predicting album sales and that the model as a whole does a good job in explaining the variability in sales.
1. Potential Pitfalls: While the overall F-statistic indicates that the model is significant, individual predictors might not be. Always consider both the individual predictors and the model as a whole when drawing conclusions.

Model Comparison

The primary objective of model comparison is to determine which among a set of nested models provides the best fit to the data. In the context of linear regression, this is often achieved by comparing models with different predictors to determine whether the inclusion or exclusion of certain predictors significantly improves the fit.

Below are the specified models used for comparison:

Full Model (album.model.full): This model includes the predictors airplay, adverts, location, and attract.

album.model.full <- lm(sales ~ airplay + adverts + location + attract, data = dta_album)
summary(album.model.full)

## 
## Call:
## lm(formula = sales ~ airplay + adverts + location + attract, 
##     data = dta_album)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -120.023  -27.754   -1.236   30.647  145.283 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.766805  18.231484  -1.139    0.256    
## airplay       3.352494   0.279098  12.012  < 2e-16 ***
## adverts       0.084784   0.006946  12.206  < 2e-16 ***
## locationB    -8.763636   8.346451  -1.050    0.295    
## locationC    -6.073371   8.267559  -0.735    0.463    
## attract      11.051165   2.459551   4.493  1.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.19 on 194 degrees of freedom
## Multiple R-squared:  0.6666, Adjusted R-squared:  0.658 
## F-statistic: 77.59 on 5 and 194 DF,  p-value: < 2.2e-16

Model 1 (album.model.1): This model includes the predictors airplay, adverts, and attract.

album.model.1 <- lm(sales ~ airplay + adverts + attract, data = dta_album)
summary(album.model.1)

## 
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = dta_album)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -121.324  -28.336   -0.451   28.967  144.132 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -26.612958  17.350001  -1.534    0.127    
## airplay       3.367425   0.277771  12.123  < 2e-16 ***
## adverts       0.084885   0.006923  12.261  < 2e-16 ***
## attract      11.086335   2.437849   4.548 9.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared:  0.6647, Adjusted R-squared:  0.6595 
## F-statistic: 129.5 on 3 and 196 DF,  p-value: < 2.2e-16

Model 2 (album.model.2): This model includes the predictors airplay and adverts.

album.model.2 <- lm(sales ~ airplay + adverts, data = dta_album)
summary(album.model.2)

## 
## Call:
## lm(formula = sales ~ airplay + adverts, data = dta_album)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.121  -30.027    3.952   32.072  155.498 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.123811   9.330952   4.407 1.72e-05 ***
## airplay      3.588789   0.286807  12.513  < 2e-16 ***
## adverts      0.086887   0.007246  11.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.38 on 197 degrees of freedom
## Multiple R-squared:  0.6293, Adjusted R-squared:  0.6255 
## F-statistic: 167.2 on 2 and 197 DF,  p-value: < 2.2e-16

Using the anova() Function: In R, the anova() function is used for comparing nested models. It’s essential to note that R is case sensitive; thus, the lowercase anova() function is distinct from the uppercase Anova() function.

Interpretation of Results

For model comparison, the null hypothesis is that the inclusion of additional predictors doesn’t significantly improve the fit, i.e., the change in \(R^2\) (denoted as \(\Delta R^2\)) is 0. If the p-value associated with the F-statistic is below a certain significance level (typically 0.05), we reject the null hypothesis, indicating that the full model (with more predictors) provides a significantly better fit to the data than the reduced model.

Change in \(R^2\)

The change in \(R^2\) can be calculated as:

\[ \Delta R^2 = R^2_{\text{full model}} - R^2_{\text{reduced model}} \]

This value indicates the additional variability explained by the full model compared to the reduced model.

Comparison of album.model.full and album.model.1

Using the anova() function:

anova(album.model.full, album.model.1)

## Analysis of Variance Table
## 
## Model 1: sales ~ airplay + adverts + location + attract
## Model 2: sales ~ airplay + adverts + attract
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    194 432017                           
## 2    196 434575 -2   -2557.6 0.5743 0.5641

The p-value is 0.6887, which is greater than 0.05, indicating that the inclusion of the location predictor in album.model.full does not significantly improve the fit compared to album.model.1.

In addition, The change in \(\Delta R^2\) is:

\[ \Delta R^2 = 0.666−0.6647=0.0013 \]

This indicates that the location predictor explains an additional 0.13% of the variability in sales, which is negligible.

Comparison of album.model.1 and album.model.2

anova(album.model.1, album.model.2)

## Analysis of Variance Table
## 
## Model 1: sales ~ airplay + adverts + attract
## Model 2: sales ~ airplay + adverts
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    196 434575                                  
## 2    197 480428 -1    -45853 20.681 9.492e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than 0.001, suggesting that the inclusion of the attract predictor in album.model.1 provides a significantly better fit to the data than album.model.2, which does not include this predictor.

In addition, The change in \(\Delta R^2\) is:

\[ \Delta R^2 = 0.6647−0.6293=0.0354 \]

This means that the attract predictor explains an additional 3.54% of the variability in sales.

Conclusion

From the above analysis, we can conclude that the model album.model.1, which includes the predictors airplay, adverts, and attract, provides the best fit among the three models. While the inclusion of the location predictor in album.model.full did not significantly improve the fit, the exclusion of the attract predictor in album.model.2 significantly worsened the fit.

Multicollinearity

Multicollinearity arises when two or more predictor variables in a regression model are closely correlated to each other, meaning that one can be linearly predicted from the others with a significant degree of accuracy.

Issues with Multicollinearity

Coefficient Estimates: Multicollinearity can lead to large standard errors of the coefficient estimates, making them unstable. Unstable coefficients make it difficult to ascertain the effect of predictor variables on the response variable.
P-values: The significance of predictor variables can be affected, potentially leading to a variable that should be significant being considered otherwise.
Model Interpretation: It becomes challenging to determine the effect of individual predictors on the response variable due to their high intercorrelations.

Checking for Multicollinearity: Variance Inflation Factor (VIF)

The Variance Inflation Factor (VIF) provides an index that measures how much the variance of an estimated regression coefficient is increased due to multicollinearity. Typically, a VIF above 10 indicates high multicollinearity, and further investigations might be warranted.

Using the vif() function from the car package, we can calculate the VIF for each predictor variable in album.model.1:

library(car)

## Loading required package: carData

vif_results <- car::vif(album.model.1)
vif_results

##  airplay  adverts  attract 
## 1.042504 1.014593 1.038455

Interpretation

If any of the VIF values are greater than 10, it suggests that the corresponding predictor might be collinear with other predictors in the model. In such cases, it’s advisable to consider remedial actions, such as removing the predictor, combining it with other predictors, or applying regularization techniques.

All the VIFs- the variance inflation factors- are well below 10. There appears to be no particular issues with multicollinearity.

Categorical Predictors with >2 Groups

While we touched upon this in an earlier section, the topic deserves dedicated attention given its importance in regression modeling.

In R, categorical variables, regardless of whether they have two or more levels, should be treated as factor variables. This ensures that the model doesn’t mistakenly treat them as continuous numeric variables. This distinction is particularly vital when categorical variables are coded numerically, as with values like 1, 2, 3, etc. In such cases, it’s crucial to ensure that these numbers aren’t misinterpreted as quantitative values but rather as distinct categories or groups.

Let’s demonstrate this by creating a categorical variable with numeric codes:

# Creating a categorical variable represented by numbers
dta_album$location_num <- sample(c(1, 2, 3),
                                 nrow(dta_album),
                                 replace = TRUE)

# Converting the numeric-coded variable into a factor
dta_album$location_num <- as.factor(dta_album$location_num)

If factor is included as a predictor in the lm function, dummy coding happens automatically with the lowest category as the reference.

model.dummy <- lm(sales ~ location_num, data = dta_album)
summary(model.dummy)

## 
## Call:
## lm(formula = sales ~ location_num, data = dta_album)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184.46  -60.69    5.00   50.56  165.54 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    194.462     10.032  19.384   <2e-16 ***
## location_num2   -9.462     14.243  -0.664    0.507    
## location_num3    4.975     13.885   0.358    0.720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.88 on 197 degrees of freedom
## Multiple R-squared:  0.005531,   Adjusted R-squared:  -0.004565 
## F-statistic: 0.5479 on 2 and 197 DF,  p-value: 0.5791

How to interpret the left column: location_num2 means location_num variable, group = 2 compared to reference. You can understand R’s choice for the reference category by noting which group is absent from the output. Given the presence of location_num2 and location_num3, the reference category must be location_num1.

Interpreting the regression coefficients:

The coefficient for location_num2 suggests that being in category 2 is associated with an increase of 20.50 in sales compared to the reference category (location_num1).
The coefficient for location_num3 implies that being in category 3 is associated with an increase of 11.97 in sales compared to the reference category (location_num1).
However, neither of these increases is statistically significant, as indicated by their p-values being greater than 0.05.

Now, if we wish to compare location_num2 and location_num3 directly, we could compute the difference using the coefficients. Still, we wouldn’t be able to perform a formal statistical test for this specific comparison based on this output. Once we’ve ascertained the difference between location_num1 and the other categories, there’s no residual information for direct comparisons between location_num2 and location_num3.

The relevel function allows you to chose the reference category.

model.dummy.relevel <- lm(sales ~ relevel(factor(location_num),ref="3"), data = dta_album)
summary(model.dummy.relevel)

## 
## Call:
## lm(formula = sales ~ relevel(factor(location_num), ref = "3"), 
##     data = dta_album)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184.46  -60.69    5.00   50.56  165.54 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                199.437      9.599  20.777   <2e-16
## relevel(factor(location_num), ref = "3")1   -4.975     13.885  -0.358    0.720
## relevel(factor(location_num), ref = "3")2  -14.437     13.941  -1.036    0.302
##                                              
## (Intercept)                               ***
## relevel(factor(location_num), ref = "3")1    
## relevel(factor(location_num), ref = "3")2    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.88 on 197 degrees of freedom
## Multiple R-squared:  0.005531,   Adjusted R-squared:  -0.004565 
## F-statistic: 0.5479 on 2 and 197 DF,  p-value: 0.5791

One key question when working with regression models is determining the significance of a particular variable—in this case, the location_num. Given the coefficients, none of them directly answer this, as each reflects the contrast between a particular location and the reference category (which is location_num3 in this instance). To holistically understand the role of the location_num in predicting sales, we must assess all coefficients concurrently.

Null Hypothesis (H0): All the coefficients are equal to 0. This indicates that location_num does not significantly affect sales.
Alternative Hypothesis (H1): At least one coefficient is not equal to 0, suggesting some levels of location_num do have an effect on sales.

If even one coefficient isn’t zero, it’s indicative of differences in sales across the various location_num categories. This collective assessment is known as the ‘omnibus test’ or ‘joint significance’, which is encapsulated by the F-test.

Procedure:

Test Statistic: Start with the F-ratio or F-statistic.
Initial Assumption: The null hypothesis assumes all coefficients are 0.
Sampling Distribution: Given the standard regression assumptions and the null hypothesis, the F-ratio follows an F distribution. The numerator degrees of freedom (df) corresponds to the number of predictors, in this case, 2 (since we have two dummy variables). The denominator df = N - p - 1, giving a denominator df of 197.
Observed F Value: The observed F value in the model is 0.5479. If the null hypothesis holds true, the probability of observing an F value this size (or larger) stands at 0.5791.
Decision: The p-value is considerably greater than the common alpha threshold of 0.05, so we do not reject the null hypothesis. This implies that sales might not significantly vary across the location_num categories. It’s important to note that this test only affirms the absence of overarching differences without pinpointing where these differences lie. Such distinctions could be made within an ANOVA framework, using post-hoc tests. Alternatively, a more precise approach often involves choosing contrasts based on theoretical knowledge, known as planned comparisons. These are more directly manageable in regression and tend to be more powerful than post-hoc tests as they use each data point optimally.

This framework can be applied to any discrete variable with:

The number of coefficients being the number of categories minus one.
Ensuring nonsingularity, meaning no contrast is derived from another, thereby utilizing all the information within the discrete variable optimally.