Introduction

The purpose of this assignment is to perform multiple linear regression on the Carseats data set. This data set contains child car seat sales for 400 store locations including prices, competitor prices, and other information about the store locations and communities in which they operate.

Lab

This data set contains sales of child car seats at 400 different stores. There are 11 variables, including unit sales (Sales), prices charged by competitor (CompPrice), community income level (Income), local advertising budget (Advertising), population size in the region (Population), price charged for the car seat (Price), quality of shelving location for the car seats (ShelveLoc), average age of the local population (Age), education level at each location (Education), whether the location is urban or not (Urban), and whether the location is in the US or not (US).

1) Pick 2-3 predictors (independent variables) and one response (dependent variable). List them. Perform appropriate data explorations. State your research questions.

As the first step of our data exploration, we can load the Carseats data set and use the summary() command to get acquainted with our data. This gives us the min and max values, the mean, median, and 1st and 3rd quartiles for most variables. For others, we see count by level.

library(ISLR)
data("Carseats")
summary(Carseats)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
## 

We can also run str() to see the structure of our data. For this analysis, I am going to keep it to numeric variables only, so this command helps me easily weed out the non-numeric variables.

str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Now that we understand the structure of our data and which variables are numeric, we can plot these variables against each other to understand their distributions. The command pairs() allows you to plot select variables as shown below.

pairs(~Sales+CompPrice+Income+Advertising+Population+Price+Age+Education, data=Carseats)

Upon plotting we can see below that Education looks like it may be a factor rather than a numeric variable, so we can re-plot without this variable.

In the analysis titled MSDS 660 Week 2 Assignment, I tested the linear relationship between Advertising budget and Sales. This time, I would like to understand what other variables in addition to Advertising have a relationship with Sales.

pairs(~Sales+CompPrice+Income+Advertising+Population+Price+Age, data=Carseats)

Let’s start by testing Advertising, Income, and Price’s relationship with Sales. We can write out our model as follows.

model <- lm(Sales~Price+Advertising+Income, data=Carseats)

When testing multiple variables, we need to be cognizant of collinearity and make sure that this is not present amongst our predictor variables. Collinearity occurs when there is strong correlation between two variables; and multi-collinearity occurs when this exists between two or more variables.

As a first step, we can use cor() to evaluate correlation. This will populate a correlation matrix as shown below.

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
cor(select(Carseats,1,3,4,6))
##                  Sales      Income Advertising       Price
## Sales        1.0000000  0.15195098  0.26950678 -0.44495073
## Income       0.1519510  1.00000000  0.05899471 -0.05669820
## Advertising  0.2695068  0.05899471  1.00000000  0.04453687
## Price       -0.4449507 -0.05669820  0.04453687  1.00000000

We can see in the above that some correlation is present, however it is difficult to determine simply by looking at this output whether collinearity exists. To test for collinearity, we can calculate VIF. VIF stands for Variance Inflation Factor. It is equal to the ratio of the overall model variance to the variance of the model that only includes the single independent variable. If VIF is equal to 1, there is no correlation. If equal to 1-5, there is moderate correlation. If greater than 10, there is high correlation.

We can calculate VIF by using vif() from the car package, against our model as shown below. Since all VIFs ar equal to 1, we can confirm that there is no collinearity.

#install.packages("car")
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(model)
##       Price Advertising      Income 
##    1.005546    1.005814    1.007056

2) Test the entire model (or significance of the model) using a global F-test. State both null and alternative hypothesis. What is the value of the F-statistic test or p-value? What is your conclusion? Discuss.

H0: β1 = β2 = β3 = 0

H1: At least one β <> 0

We can run the global F-test by using the command summary() against our model.

F-statistic is an indicator of whether there is a relationship between our predictor and response variables. If the null hypothesis is true, you can expect F to have a value close to 1.0 most of the time. A large F ratio means that the variation among group means is more than you’d expect to see by chance. (GraphPad Prism 9 Statistics Guide - Interpreting Results: One-Way ANOVA, n.d.)

summary(model)
## 
## Call:
## lm(formula = Sales ~ Price + Advertising + Income, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4568 -1.5420 -0.0753  1.4975  6.7877 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.172701   0.682736  17.829  < 2e-16 ***
## Price       -0.053836   0.005051 -10.658  < 2e-16 ***
## Advertising  0.120237   0.017985   6.685 7.85e-11 ***
## Income       0.011066   0.004276   2.588     0.01 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.382 on 396 degrees of freedom
## Multiple R-squared:  0.2938, Adjusted R-squared:  0.2884 
## F-statistic: 54.91 on 3 and 396 DF,  p-value: < 2.2e-16

The value of the F-statistic is 54.19 (p-value < 2.2e-16) which clearly means we must reject the null hypothesis at the 0.05 significance level.

3) Test significance of each explanatory variable (X). (You can use some results from the previous test with this one). State both the null and alternative hypotheses of each explanatory variable (list each in pairs). Which predictor(s) will cause you to reject the null hypothesis? Discuss. Give a brief interpretation of each coefficient in the model

H0: β1 = 0
H1: β1 <> 0

H0: β2 = 0
H1: β2 <> 0

H0: β3 = 0
H1: β3 <> 0

Based on the results from the test in the step above, Price and Advertising will result in rejecting the null hypothesis, however Income will not.
- Price: p-value is 2e-16 which is smaller than the 0.05 significance level. The slope is approximately -0.06 which means for every $1 increase in price, sales will decrease by $56.
- Advertising: p-value is 7.85e-11 which is smaller than the 0.05 significance level. The slope is approximately 0.12 which means for every $1,000 increase in Advertising budget, sales will increase by $120. - Income: p-value is 0.01 which is larger than the 0.05 significance level. This means there is no relationship between Income and Sales.

We can also see that adjusted R-squared is 0.2884. R-squared is a measure of the fit to the model, with 1 representing a perfect fit. At 0.2884, we can see that the linear model does fit to some extent, however it is not perfect.

4) Now, try a different model (e.g., include more predictors or use fewer predictors).Which model fits the data better? How do you select the model? Explain your answers.

Since Income does not have a relationship with Sales, we can remove it from our model to see if this helps it fit the data better.

#install.packages("mctest")
library(mctest)
new_model <- lm(Sales~Price+Advertising, data=Carseats)

By running anova() below, we can compare the two models.

Null hypothesis: β3 = 0
Alternative hypothesis: β3 <> 0

We can see below that F=6.6955 (p-value = 0.01), which means we cannot reject the null hypothesis. This means Income do not significantly contribute to Sales when Price and Advertising are present in the model.

anova(new_model, model)

5) Are there any other tests you should include in the analysis

We can use plot() to check for model assumptions:
- Residuals vs. Fitted: This plot shows if residuals have non-linear patterns. Residuals are the differences between observed values and fitted values on the line of best fit. If residuals are spread equally around the the horizontal line, that is a good indication that you have a linear relationship. In this case, the points look about equally spread.
- Normal Q-Q: If residuals on this plot follow a straight line, then they are normally distributed. In this case, residuals look fairly straight.
- Scale-Location: The plot shows if residuals are spread equally along the ranges of predictors, which is how we can check the assumption of equal variance (homoscedasticity). In this case, the line is relatively flat so the variance should be fairly constant.
- Residuals vs. Leverage: This helps us to find influential outliers. If the the red dashed Cook’s distance line is not visible as shown below, we do not have any influential outliers.

plot(new_model)

6) Address any other concerns you might have.

While Price and Advertising have a statistically significant relationship with Sales, I do have some concerns:

  • Advertising:
    • For every $1,000 increase in Advertising budget, we are only increasing Sales by $120. This means they are losing money when they invest in Advertising. So although Advertising budget can be a predictor of Sales, this model helps confirm that they may want to consider not Advertising at all if they stick with their current strategy
    • There are a lot of factors that go into an Advertising strategy (ad creative, where the ads are placed, who they are targeting, etc.). Companies should constantly be aiming to increase the effectiveness of their advertising by optimizing their strategy; so unless this company is not changing their Advertising strategy at all, this model will be inaccurate
  • Price:
    • The concept of how Price impacts Sales would need to also take costs and profit into account to be provide real guidance. For example, Sales increases when Price goes down. However, if Price goes down so much that it is lower than what the company purchased the materials for, the company is not going to make any profit. Therefore, this model won’t be helpful in driving decision making unless all factors are taken into account

Conclusion

We can conclude from our multiple linear regression model that Price and Advertising are predictors of Sales, however Income level of the community is not.

Sources

GraphPad Prism 9 Statistics Guide—Interpreting results: One-way ANOVA. (n.d.). Retrieved March 27, 2021, from https://www.graphpad.com/guides/prism/latest/statistics/f_ratio_and_anova_table_(one-way_anova).htm