First, let’s load the dataset.
## Loading the Data
data_readwrite <- read.csv("readwrite2.csv")
head(data_readwrite)
## id reading writing level
## 1 3070 18 21 3
## 2 1306 20 15 3
## 3 83 14 14 2
## 4 2486 16 20 4
## 5 1938 14 15 2
## 6 397 20 18 4
# Adding Gender Column as a Dummy Variable (female = 0; male = 1.)
set.seed(123) # Setting seed for reproducibility
data_readwrite$gender <- sample(c(0, 1),
nrow(data_readwrite),
replace = TRUE)
head(data_readwrite)
## id reading writing level gender
## 1 3070 18 21 3 0
## 2 1306 20 15 3 0
## 3 83 14 14 2 0
## 4 2486 16 20 4 1
## 5 1938 14 15 2 0
## 6 397 20 18 4 1
Let’s predict writing scores from gender (0=female,1=male). The coefficient for gender represents the difference in reading scores between males and females. If the coefficient is positive, it indicates that, on average, males score higher in reading than females. If the coefficient is negative, it suggests that females score higher on average. The significance of the coefficient indicates whether this difference is statistically significant.
mod.dummy.unstd <- lm(writing ~ gender, data=data_readwrite)
summary(mod.dummy.unstd)
##
## Call:
## lm(formula = writing ~ gender, data = data_readwrite)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0562 -3.0562 0.9438 3.9438 7.9696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.03042 0.13371 127.370 <2e-16 ***
## gender 0.02582 0.19106 0.135 0.893
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.268 on 1995 degrees of freedom
## Multiple R-squared: 9.151e-06, Adjusted R-squared: -0.0004921
## F-statistic: 0.01826 on 1 and 1995 DF, p-value: 0.8925
The standard interpretation still applies here. A one unit difference in gender is associated with a 0.02582 difference in their writing scores But, a one unit difference in gender is male compared to female, so this interpretation can be improved Males are predicted to score 0.02582 points higher on writing scores than females. However, this difference is not significant, as can be seen from the large p-value.
We can also report standardized regression coefficients:
mod.dummy.std <- lm(scale(writing) ~ scale(gender), data=data_readwrite)
summary(mod.dummy.std)
##
## Call:
## lm(formula = scale(writing) ~ scale(gender), data = data_readwrite)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0597 -0.7162 0.2212 0.9242 1.8677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.842e-16 2.238e-02 0.000 1.000
## scale(gender) 3.025e-03 2.239e-02 0.135 0.893
##
## Residual standard error: 1 on 1995 degrees of freedom
## Multiple R-squared: 9.151e-06, Adjusted R-squared: -0.0004921
## F-statistic: 0.01826 on 1 and 1995 DF, p-value: 0.8925
From the results, we see that males have predicted writing scores 0.003 SD higher than females. Note that the p-value for the effect of gender is the same regardless; the p-value does not depend on how the units are expressed.
As we progress through the course, we’ll delve into a variety of statistical tests. One such test is the independent samples t-test, which we’ll use to examine whether there’s a significant difference in the writing scores between genders. Interestingly, when you run this t-test to compare the population means of writing scores across gender categories, the results will mirror those from a regression model using gender as a predictor. In other words, the regression model with a dummy variable for gender essentially replicates the analysis of the t-test.
t.test(scale(data_readwrite$writing) ~ scale(data_readwrite$gender), var.equal = TRUE)
##
## Two Sample t-test
##
## data: scale(data_readwrite$writing) by scale(data_readwrite$gender)
## t = -0.13511, df = 1995, p-value = 0.8925
## alternative hypothesis: true difference in means between group -0.979430381310121 and group 1.02049034617077 is not equal to 0
## 95 percent confidence interval:
## -0.09386118 0.08176166
## sample estimates:
## mean in group -0.979430381310121 mean in group 1.02049034617077
## -0.002962777 0.003086984
Compare the outputs from the regression and the independent samples t-test. You should notice that the results are exactly the same. That’s because regression with a single dichotomous predictor is exactly equivalent to an independent samples t-test - as long as we assume homogeneity of variance in the t-test, which is precisely the same assumption as homoskedasticity. Two fancy names, same concept.
In the music industry, predicting album sales is a significant
concern. Multiple factors can influence the success or failure of an
album. In this analysis, we’ll delve into the factors that might predict
album sales using a dataset dta_album
. This dataset
contains data on the number of advertisements (adverts
),
the airplay time (airplay
), album attractiveness
(attract
), and the location where the album was released
(location
).
# Load the dataset
dta_album <- read.delim("Album Sales 2.dat", header = TRUE)
dta_album$location <- sample(c("A", "B", "C"),
nrow(dta_album),
replace = TRUE)
# To explore how these factors predict album sales, we will run a multiple regression model. In R, the `lm` function from the base package can be used for this purpose. Multiple predictors can be added to the model using the `+` symbol.
album.model1 <- lm(sales ~ airplay + adverts + location + attract, data = dta_album)
summary(album.model1)
##
## Call:
## lm(formula = sales ~ airplay + adverts + location + attract,
## data = dta_album)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.023 -27.754 -1.236 30.647 145.283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.766805 18.231484 -1.139 0.256
## airplay 3.352494 0.279098 12.012 < 2e-16 ***
## adverts 0.084784 0.006946 12.206 < 2e-16 ***
## locationB -8.763636 8.346451 -1.050 0.295
## locationC -6.073371 8.267559 -0.735 0.463
## attract 11.051165 2.459551 4.493 1.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.19 on 194 degrees of freedom
## Multiple R-squared: 0.6666, Adjusted R-squared: 0.658
## F-statistic: 77.59 on 5 and 194 DF, p-value: < 2.2e-16
Intercept (Baseline Prediction): The intercept represents the predicted album sales when all predictors are 0. It’s -20.76, which doesn’t have a clear practical interpretation in this context since factors like attractiveness cannot realistically be zero.
Airplay: For every additional unit of airplay, holding all other factors constant, the album sales increase by 3.35 units. This positive relationship is highly statistically significant (p-value < 2e-16).
Advertisements (adverts): For every unit increase in advertisement spending, the album sales increase by approximately 0.085 units, given other factors are constant. This predictor is also highly significant.
Location (locationB and locationC): The coefficients for locationB and locationC represent the difference in album sales compared to the reference category (which is locationA in this case). Neither of these predictors is statistically significant, suggesting that the location might not play a significant role in predicting sales once other factors are considered.
Album Attractiveness (attract): A one-unit increase in album attractiveness rating leads to an 11.15 unit increase in album sales, holding other factors constant. This predictor is statistically significant.
Important Notes:
To calculate standardized regression coefficients, you can utilize
the lm.beta
function available in the lm.beta
package. This function produces standardized regression
coefficients.
# install.packages("lm.beta")
library(lm.beta)
lm.beta(album.model1)
##
## Call:
## lm(formula = sales ~ airplay + adverts + location + attract,
## data = dta_album)
##
## Standardized Coefficients::
## (Intercept) airplay adverts locationB locationC attract
## NA 0.50971805 0.51024116 -0.05157228 -0.03610309 0.19107534
When using standardized coefficients, we are effectively looking at the relationship between each predictor and the outcome when both are measured in standard deviation units. This makes it easier to compare the relative strength of each predictor.
Airplay: A 1 SD increase in airplay is associated with a 0.509 SD increase in album sales, holding all other factors constant. This is the strongest predictor in the model based on the standardized coefficient. Advertisements (adverts): A 1 SD increase in advertisement spending is linked with a 0.510 SD rise in album sales when other factors are held constant. Its strength is comparable to that of airplay in terms of predicting album sales.
Location (locationB and locationC): A 1 SD change in being in locationB (as opposed to locationA) is associated with a 0.039 SD change in album sales, holding all else constant. Similarly, a 1 SD change in being in locationC (compared to locationA) is linked to a 0.006 SD change in album sales. Based on the standardized coefficients, the location variables have a minimal effect on album sales.
Album Attractiveness (attract): A 1 SD difference in album attractiveness rating corresponds to a 0.190 SD difference in album sales when other predictors are held constant.
Important Notes:
The primary objective of model comparison is to determine which among a set of nested models provides the best fit to the data. In the context of linear regression, this is often achieved by comparing models with different predictors to determine whether the inclusion or exclusion of certain predictors significantly improves the fit.
Below are the specified models used for comparison:
album.model.full
): This
model includes the predictors airplay
,
adverts
, location
, and
attract
.album.model.full <- lm(sales ~ airplay + adverts + location + attract, data = dta_album)
summary(album.model.full)
##
## Call:
## lm(formula = sales ~ airplay + adverts + location + attract,
## data = dta_album)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.023 -27.754 -1.236 30.647 145.283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.766805 18.231484 -1.139 0.256
## airplay 3.352494 0.279098 12.012 < 2e-16 ***
## adverts 0.084784 0.006946 12.206 < 2e-16 ***
## locationB -8.763636 8.346451 -1.050 0.295
## locationC -6.073371 8.267559 -0.735 0.463
## attract 11.051165 2.459551 4.493 1.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.19 on 194 degrees of freedom
## Multiple R-squared: 0.6666, Adjusted R-squared: 0.658
## F-statistic: 77.59 on 5 and 194 DF, p-value: < 2.2e-16
airplay
, adverts
, and
attract.
album.model.1 <- lm(sales ~ airplay + adverts + attract, data = dta_album)
summary(album.model.1)
##
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = dta_album)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.324 -28.336 -0.451 28.967 144.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.612958 17.350001 -1.534 0.127
## airplay 3.367425 0.277771 12.123 < 2e-16 ***
## adverts 0.084885 0.006923 12.261 < 2e-16 ***
## attract 11.086335 2.437849 4.548 9.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.6595
## F-statistic: 129.5 on 3 and 196 DF, p-value: < 2.2e-16
airplay
and adverts.
album.model.2 <- lm(sales ~ airplay + adverts, data = dta_album)
summary(album.model.2)
##
## Call:
## lm(formula = sales ~ airplay + adverts, data = dta_album)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.121 -30.027 3.952 32.072 155.498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.123811 9.330952 4.407 1.72e-05 ***
## airplay 3.588789 0.286807 12.513 < 2e-16 ***
## adverts 0.086887 0.007246 11.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.38 on 197 degrees of freedom
## Multiple R-squared: 0.6293, Adjusted R-squared: 0.6255
## F-statistic: 167.2 on 2 and 197 DF, p-value: < 2.2e-16
Using the anova()
Function: In R, the
anova()
function is used for comparing nested models. It’s
essential to note that R is case sensitive; thus, the lowercase
anova()
function is distinct from the uppercase
Anova()
function.
Interpretation of Results
For model comparison, the null hypothesis is that the inclusion of additional predictors doesn’t significantly improve the fit, i.e., the change in \(R^2\) (denoted as \(\Delta R^2\)) is 0. If the p-value associated with the F-statistic is below a certain significance level (typically 0.05), we reject the null hypothesis, indicating that the full model (with more predictors) provides a significantly better fit to the data than the reduced model.
Change in \(R^2\)
The change in \(R^2\) can be calculated as:
\[ \Delta R^2 = R^2_{\text{full model}} - R^2_{\text{reduced model}} \]
This value indicates the additional variability explained by the full model compared to the reduced model.
Comparison of album.model.full
and
album.model.1
Using the anova()
function:
anova(album.model.full, album.model.1)
## Analysis of Variance Table
##
## Model 1: sales ~ airplay + adverts + location + attract
## Model 2: sales ~ airplay + adverts + attract
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 194 432017
## 2 196 434575 -2 -2557.6 0.5743 0.5641
The p-value is 0.6887, which is greater than 0.05, indicating that the inclusion of the location predictor in album.model.full does not significantly improve the fit compared to album.model.1.
In addition, The change in \(\Delta R^2\) is:
\[ \Delta R^2 = 0.666−0.6647=0.0013 \]
This indicates that the location predictor explains an additional 0.13% of the variability in sales, which is negligible.
Comparison of album.model.1
and
album.model.2
anova(album.model.1, album.model.2)
## Analysis of Variance Table
##
## Model 1: sales ~ airplay + adverts + attract
## Model 2: sales ~ airplay + adverts
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 196 434575
## 2 197 480428 -1 -45853 20.681 9.492e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is less than 0.001, suggesting that the inclusion of the attract predictor in album.model.1 provides a significantly better fit to the data than album.model.2, which does not include this predictor.
In addition, The change in \(\Delta R^2\) is:
\[ \Delta R^2 = 0.6647−0.6293=0.0354 \]
This means that the attract predictor explains an additional 3.54% of the variability in sales.
Conclusion
From the above analysis, we can conclude that the model
album.model.1
, which includes the predictors
airplay
, adverts
, and attract
,
provides the best fit among the three models. While the inclusion of the
location
predictor in album.model.full
did not
significantly improve the fit, the exclusion of the attract
predictor in album.model.2
significantly worsened the
fit.
Multicollinearity arises when two or more predictor variables in a regression model are closely correlated to each other, meaning that one can be linearly predicted from the others with a significant degree of accuracy.
Issues with Multicollinearity
Coefficient Estimates: Multicollinearity can lead to large standard errors of the coefficient estimates, making them unstable. Unstable coefficients make it difficult to ascertain the effect of predictor variables on the response variable.
P-values: The significance of predictor variables can be affected, potentially leading to a variable that should be significant being considered otherwise.
Model Interpretation: It becomes challenging to determine the effect of individual predictors on the response variable due to their high intercorrelations.
Checking for Multicollinearity: Variance Inflation Factor (VIF)
The Variance Inflation Factor (VIF) provides an index that measures how much the variance of an estimated regression coefficient is increased due to multicollinearity. Typically, a VIF above 10 indicates high multicollinearity, and further investigations might be warranted.
Using the vif()
function from the car
package, we can calculate the VIF for each predictor variable in
album.model.1
:
library(car)
## Loading required package: carData
vif_results <- car::vif(album.model.1)
vif_results
## airplay adverts attract
## 1.042504 1.014593 1.038455
Interpretation
If any of the VIF values are greater than 10, it suggests that the corresponding predictor might be collinear with other predictors in the model. In such cases, it’s advisable to consider remedial actions, such as removing the predictor, combining it with other predictors, or applying regularization techniques.
All the VIFs- the variance inflation factors- are well below 10. There appears to be no particular issues with multicollinearity.
While we touched upon this in an earlier section, the topic deserves dedicated attention given its importance in regression modeling.
In R, categorical variables, regardless of whether they have two or more levels, should be treated as factor variables. This ensures that the model doesn’t mistakenly treat them as continuous numeric variables. This distinction is particularly vital when categorical variables are coded numerically, as with values like 1, 2, 3, etc. In such cases, it’s crucial to ensure that these numbers aren’t misinterpreted as quantitative values but rather as distinct categories or groups.
Let’s demonstrate this by creating a categorical variable with numeric codes:
# Creating a categorical variable represented by numbers
dta_album$location_num <- sample(c(1, 2, 3),
nrow(dta_album),
replace = TRUE)
# Converting the numeric-coded variable into a factor
dta_album$location_num <- as.factor(dta_album$location_num)
If factor is included as a predictor in the lm
function,
dummy coding happens automatically with the lowest category as the
reference.
model.dummy <- lm(sales ~ location_num, data = dta_album)
summary(model.dummy)
##
## Call:
## lm(formula = sales ~ location_num, data = dta_album)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184.46 -60.69 5.00 50.56 165.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.462 10.032 19.384 <2e-16 ***
## location_num2 -9.462 14.243 -0.664 0.507
## location_num3 4.975 13.885 0.358 0.720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.88 on 197 degrees of freedom
## Multiple R-squared: 0.005531, Adjusted R-squared: -0.004565
## F-statistic: 0.5479 on 2 and 197 DF, p-value: 0.5791
How to interpret the left column: location_num2
means
location_num variable, group = 2 compared to reference. You can
understand R’s choice for the reference category by noting which group
is absent from the output. Given the presence of
location_num2
and location_num3
, the reference
category must be location_num1.
Interpreting the regression coefficients:
The coefficient for location_num2 suggests that being in category
2 is associated with an increase of 20.50 in sales compared to the
reference category (location_num1
).
The coefficient for location_num3 implies that being in category
3 is associated with an increase of 11.97 in sales compared to the
reference category (location_num1
).
However, neither of these increases is statistically significant, as indicated by their p-values being greater than 0.05.
Now, if we wish to compare location_num2
and
location_num3
directly, we could compute the difference
using the coefficients. Still, we wouldn’t be able to perform a formal
statistical test for this specific comparison based on this output. Once
we’ve ascertained the difference between location_num1
and
the other categories, there’s no residual information for direct
comparisons between location_num2
and
location_num3.
The relevel
function allows you to chose the reference
category.
model.dummy.relevel <- lm(sales ~ relevel(factor(location_num),ref="3"), data = dta_album)
summary(model.dummy.relevel)
##
## Call:
## lm(formula = sales ~ relevel(factor(location_num), ref = "3"),
## data = dta_album)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184.46 -60.69 5.00 50.56 165.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 199.437 9.599 20.777 <2e-16
## relevel(factor(location_num), ref = "3")1 -4.975 13.885 -0.358 0.720
## relevel(factor(location_num), ref = "3")2 -14.437 13.941 -1.036 0.302
##
## (Intercept) ***
## relevel(factor(location_num), ref = "3")1
## relevel(factor(location_num), ref = "3")2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.88 on 197 degrees of freedom
## Multiple R-squared: 0.005531, Adjusted R-squared: -0.004565
## F-statistic: 0.5479 on 2 and 197 DF, p-value: 0.5791
One key question when working with regression models is determining
the significance of a particular variable—in this case, the
location_num.
Given the coefficients, none of them directly
answer this, as each reflects the contrast between a particular location
and the reference category (which is location_num3
in this
instance). To holistically understand the role of the location_num in
predicting sales, we must assess all coefficients concurrently.
Null Hypothesis (H0): All the coefficients are
equal to 0. This indicates that location_num
does not
significantly affect sales.
Alternative Hypothesis (H1): At least one
coefficient is not equal to 0, suggesting some levels of
location_num
do have an effect on sales.
If even one coefficient isn’t zero, it’s indicative of differences in
sales across the various location_num
categories. This
collective assessment is known as the ‘omnibus test’ or ‘joint
significance’, which is encapsulated by the F-test.
Procedure:
Test Statistic: Start with the F-ratio or F-statistic.
Initial Assumption: The null hypothesis assumes all coefficients are 0.
Sampling Distribution: Given the standard regression assumptions and the null hypothesis, the F-ratio follows an F distribution. The numerator degrees of freedom (df) corresponds to the number of predictors, in this case, 2 (since we have two dummy variables). The denominator df = N - p - 1, giving a denominator df of 197.
Observed F Value: The observed F value in the model is 0.5479. If the null hypothesis holds true, the probability of observing an F value this size (or larger) stands at 0.5791.
Decision: The p-value is considerably greater
than the common alpha threshold of 0.05, so we do not reject the null
hypothesis. This implies that sales might not significantly vary across
the location_num
categories. It’s important to note that
this test only affirms the absence of overarching differences without
pinpointing where these differences lie. Such distinctions could be made
within an ANOVA framework, using post-hoc tests. Alternatively, a more
precise approach often involves choosing contrasts based on theoretical
knowledge, known as planned comparisons. These are more directly
manageable in regression and tend to be more powerful than post-hoc
tests as they use each data point optimally.
This framework can be applied to any discrete variable with: