The purpose of this assignment is to perform multiple linear regression on the Carseats data set. This data set contains child car seat sales for 400 store locations including prices, competitor prices, and other information about the store locations and communities in which they operate.
This data set contains sales of child car seats at 400 different stores. There are 11 variables, including unit sales (Sales), prices charged by competitor (CompPrice), community income level (Income), local advertising budget (Advertising), population size in the region (Population), price charged for the car seat (Price), quality of shelving location for the car seats (ShelveLoc), average age of the local population (Age), education level at each location (Education), whether the location is urban or not (Urban), and whether the location is in the US or not (US).
As the first step of our data exploration, we can load the Carseats data set and use the summary() command to get acquainted with our data. This gives us the min and max values, the mean, median, and 1st and 3rd quartiles for most variables. For others, we see count by level.
library(ISLR)
data("Carseats")
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
We can also run str() to see the structure of our data. For this analysis, I am going to keep it to numeric variables only, so this command helps me easily weed out the non-numeric variables.
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
Now that we understand the structure of our data and which variables are numeric, we can plot these variables against each other to understand their distributions. The command pairs() allows you to plot select variables as shown below.
pairs(~Sales+CompPrice+Income+Advertising+Population+Price+Age+Education, data=Carseats)
Upon plotting we can see below that Education looks like it may be a factor rather than a numeric variable, so we can re-plot without this variable.
In the analysis titled MSDS 660 Week 2 Assignment, I tested the linear relationship between Advertising budget and Sales. This time, I would like to understand what other variables in addition to Advertising have a relationship with Sales.
pairs(~Sales+CompPrice+Income+Advertising+Population+Price+Age, data=Carseats)
Let’s start by testing Advertising, Income, and Price’s relationship with Sales. We can write out our model as follows.
model <- lm(Sales~Price+Advertising+Income, data=Carseats)
When testing multiple variables, we need to be cognizant of collinearity and make sure that this is not present amongst our predictor variables. Collinearity occurs when there is strong correlation between two variables; and multi-collinearity occurs when this exists between two or more variables.
As a first step, we can use cor() to evaluate correlation. This will populate a correlation matrix as shown below.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cor(select(Carseats,1,3,4,6))
## Sales Income Advertising Price
## Sales 1.0000000 0.15195098 0.26950678 -0.44495073
## Income 0.1519510 1.00000000 0.05899471 -0.05669820
## Advertising 0.2695068 0.05899471 1.00000000 0.04453687
## Price -0.4449507 -0.05669820 0.04453687 1.00000000
We can see in the above that some correlation is present, however it is difficult to determine simply by looking at this output whether collinearity exists. To test for collinearity, we can calculate VIF. VIF stands for Variance Inflation Factor. It is equal to the ratio of the overall model variance to the variance of the model that only includes the single independent variable. If VIF is equal to 1, there is no correlation. If equal to 1-5, there is moderate correlation. If greater than 10, there is high correlation.
We can calculate VIF by using vif() from the car package, against our model as shown below. Since all VIFs ar equal to 1, we can confirm that there is no collinearity.
#install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model)
## Price Advertising Income
## 1.005546 1.005814 1.007056
H0: β1 = β2 = β3 = 0
H1: At least one β <> 0
We can run the global F-test by using the command summary() against our model.
F-statistic is an indicator of whether there is a relationship between our predictor and response variables. If the null hypothesis is true, you can expect F to have a value close to 1.0 most of the time. A large F ratio means that the variation among group means is more than you’d expect to see by chance. (GraphPad Prism 9 Statistics Guide - Interpreting Results: One-Way ANOVA, n.d.)
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + Advertising + Income, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4568 -1.5420 -0.0753 1.4975 6.7877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.172701 0.682736 17.829 < 2e-16 ***
## Price -0.053836 0.005051 -10.658 < 2e-16 ***
## Advertising 0.120237 0.017985 6.685 7.85e-11 ***
## Income 0.011066 0.004276 2.588 0.01 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.382 on 396 degrees of freedom
## Multiple R-squared: 0.2938, Adjusted R-squared: 0.2884
## F-statistic: 54.91 on 3 and 396 DF, p-value: < 2.2e-16
The value of the F-statistic is 54.19 (p-value < 2.2e-16) which clearly means we must reject the null hypothesis at the 0.05 significance level.
H0: β1 = 0
H1: β1 <> 0
H0: β2 = 0
H1: β2 <> 0
H0: β3 = 0
H1: β3 <> 0
Based on the results from the test in the step above, Price and Advertising will result in rejecting the null hypothesis, however Income will not.
- Price: p-value is 2e-16 which is smaller than the 0.05 significance level. The slope is approximately -0.06 which means for every $1 increase in price, sales will decrease by $56.
- Advertising: p-value is 7.85e-11 which is smaller than the 0.05 significance level. The slope is approximately 0.12 which means for every $1,000 increase in Advertising budget, sales will increase by $120. - Income: p-value is 0.01 which is larger than the 0.05 significance level. This means there is no relationship between Income and Sales.
We can also see that adjusted R-squared is 0.2884. R-squared is a measure of the fit to the model, with 1 representing a perfect fit. At 0.2884, we can see that the linear model does fit to some extent, however it is not perfect.
Since Income does not have a relationship with Sales, we can remove it from our model to see if this helps it fit the data better.
#install.packages("mctest")
library(mctest)
new_model <- lm(Sales~Price+Advertising, data=Carseats)
By running anova() below, we can compare the two models.
Null hypothesis: β3 = 0
Alternative hypothesis: β3 <> 0
We can see below that F=6.6955 (p-value = 0.01), which means we cannot reject the null hypothesis. This means Income do not significantly contribute to Sales when Price and Advertising are present in the model.
anova(new_model, model)
We can use plot() to check for model assumptions:
- Residuals vs. Fitted: This plot shows if residuals have non-linear patterns. Residuals are the differences between observed values and fitted values on the line of best fit. If residuals are spread equally around the the horizontal line, that is a good indication that you have a linear relationship. In this case, the points look about equally spread.
- Normal Q-Q: If residuals on this plot follow a straight line, then they are normally distributed. In this case, residuals look fairly straight.
- Scale-Location: The plot shows if residuals are spread equally along the ranges of predictors, which is how we can check the assumption of equal variance (homoscedasticity). In this case, the line is relatively flat so the variance should be fairly constant.
- Residuals vs. Leverage: This helps us to find influential outliers. If the the red dashed Cook’s distance line is not visible as shown below, we do not have any influential outliers.
plot(new_model)
While Price and Advertising have a statistically significant relationship with Sales, I do have some concerns:
We can conclude from our multiple linear regression model that Price and Advertising are predictors of Sales, however Income level of the community is not.
GraphPad Prism 9 Statistics Guide—Interpreting results: One-way ANOVA. (n.d.). Retrieved March 27, 2021, from https://www.graphpad.com/guides/prism/latest/statistics/f_ratio_and_anova_table_(one-way_anova).htm