The purpose of this assignment is to perform linear regression on two variables from the Carseats data set. This data set contains child car seat sales for 400 store locations including prices, competitor prices, and other information about the store locations and communities in which they operate.
To start, we will install the ISLR package which contains our Carseats data set. We can do this by running the install.packages() command. We can see below that this was installed successfully.
install.packages("ISLR")
## Installing package into 'C:/Users/linds/OneDrive/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'ISLR' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linds\AppData\Local\Temp\Rtmpy2QxqK\downloaded_packages
Next, we need to load in our ISLR package, and then the Carseats data set from that package. We can load the package with the library() command, and the data set with the data() command.
library(ISLR)
data("Carseats")
View(Carseats)
We can check that these loaded successfully by running the head() command, which will preview the first 6 rows of data with all of our variables.
head(Carseats)
This data set contains sales of child car seats at 400 different stores. As confirmed by the nrow() command which counts the number of rows, each row represents an individual store meaning there are 400 observations. There are 11 variables, as we can see in the chart above, and also confirmed by ncol() which counts the number of columns.
Variables included in this data set are unit sales (Sales), prices charged by competitor (CompPrice), community income level (Income), local advertising budget (Advertising), population size in the region (Population), price charged for the car seat (Price), quality of shelving location for the car seats (ShelveLoc), average age of the local population (Age), education level at each location (Education), whether the location is urban or not (Urban), and whether the location is in the US or not (US).
nrow(Carseats)
## [1] 400
ncol(Carseats)
## [1] 11
As the first step of our data exploration, we can use the summary() command. This gives us the min and max values, the mean, median, and 1st and 3rd quartiles for most variables. For others, we see count by level.
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
We can use the plot() command to see pairwise scatterplots and their distributions. However, given there are so many variables, this is plotting very small.
plot(Carseats)
To narrow this down, we can use the str() command to look at the structure of our data. We can see that ShelveLoc, Urban, and US are all factors; these will not work seamlessly with linear regression, so let’s remove from this analysis.
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
In addition to removing factor data types, we can also use a package called pairsD3 to increase the size of our pairs scatterplots and make them a bit easier to read.
Now that we can see the plots better, although Education was classified as a number, the data looks more similar to that of a factor.
install.packages("pairsD3")
## Installing package into 'C:/Users/linds/OneDrive/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'pairsD3' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linds\AppData\Local\Temp\Rtmpy2QxqK\downloaded_packages
library(pairsD3)
pairsD3(Carseats[,c(1,2,3,4,5,6,8,9)], opacity = 0.89, cex = 2, width = 1200, big=TRUE)
#(Change Plot Size of Pairs Plot in R, n.d.)
Using the unique() command, we can see that Education looks like it is a factor with 9 levels.
sort(unique(Carseats$Education, decreasing = FALSE))
## [1] 10 11 12 13 14 15 16 17 18
Now plotting without Education, we can see how the data is distributed.
Looking at these distributions, there appears to be a few different linear relationships.
It is important to note that correlation does not imply causation. For example, competitor price and price are both correlated. Does this mean that one is causing the other, or rather could their prices both be impacted by similar factors - cost of materials, demand, etc.?
I am interested in understanding which variables have a relationship with sales, so my focus will be on the Sales scatterplots. Price and Sales appear to have a strong linear relationship, however I think this one is a bit obvious and I don’t think it would be that valuable to perform regression on it without taking profit into account. The next variable that appears to have a correlation is Advertising. Given I am currently in the Advertising field, I think it would be interesting to test if Advertising budget has a relationship with Sales.
pairsD3(Carseats[,c(1,2,3,4,5,6,8)], opacity = 0.89, cex = 2, width = 1200, big=TRUE)
To pull the Advertising Budget vs. Sales scatterplot out to review more closely, we can narrow down to the below.
plot(Carseats$Advertising, main = "Advertising Budget vs. Car Seat Sales", xlab = "Advertising Budget (in thousands of dollars)", ylab = "Car Seat Sales (in thousands of dollars)", Carseats$Sales, col = "red")
Before we decide to use Advertising Budget in our test, we can make sure that the correlation coefficient is strong relative to other variables.
Using the cor() command, we can see how closely two variables are correlated. This will output a correlation coefficient between -1 and 1; the closer to -1 or 1, the stronger the correlation. A negative correlation coefficient would imply a negative correlation, in that an increase in one variable is correlated with a decrease in the other; and vice versa for a positive correlation coefficient.
We can only run cor() against numeric data. To only select numeric variables, We can use select() from the dplyr package.
The below output shows there is a correlation coefficient of 0.27 between Advertising and Sales, which for Sales is the second highest after Price. This means that there is a positive correlation between Advertising and Sales; as one increases, so should the other. While 0.27 is not the strongest correlation, it is worth exploring whether or not this correlation is significant.
install.packages("dplyr")
## Installing package into 'C:/Users/linds/OneDrive/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linds\AppData\Local\Temp\Rtmpy2QxqK\downloaded_packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cor(select(Carseats,1:6,8:9))
## Sales CompPrice Income Advertising Population
## Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984
## CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516
## Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994
## Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145
## Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000
## Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620
## Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355
## Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231
## Price Age Education
## Sales -0.44495073 -0.231815440 -0.051955242
## CompPrice 0.58484777 -0.100238817 0.025197050
## Income -0.05669820 -0.004670094 -0.056855422
## Advertising 0.04453687 -0.004557497 -0.033594307
## Population -0.01214362 -0.042663355 -0.106378231
## Price 1.00000000 -0.102176839 0.011746599
## Age -0.10217684 1.000000000 0.006488032
## Education 0.01174660 0.006488032 1.000000000
Using the PerformanceAnalytics package, we can see the distribution of each variable - Advertising Budget and Sales - as well as the 0.27 correlation coefficient that we identified above. We can also see the same scatterplot we had created above, however this time plotted with a line of best fit.
install.packages("PerformanceAnalytics")
## Installing package into 'C:/Users/linds/OneDrive/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'PerformanceAnalytics' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linds\AppData\Local\Temp\Rtmpy2QxqK\downloaded_packages
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
data(quadprog)
## Warning in data(quadprog): data set 'quadprog' not found
AdvertisingSales <- data.frame(Carseats$Advertising,Carseats$Sales)
colnames(AdvertisingSales) <- c("Advertising Budget","Car Seat Sales")
chart.Correlation(AdvertisingSales, histogram=TRUE,pch=19)
Since I am interested in seeing if Advertising budget has a relationship with sales, Advertising will be my predictor variable and Sales my response variable.
We can perform linear regression on these variables by running lm(). The output tells us that the line of best fit intersects at 6.7370, as we can also see in the plot below. lm() also tells us the slope, which is 0.1144. This means for every thousand ad dollars, sales will increase by $114.40.
lm_model <- lm(Carseats$Sales~Carseats$Advertising)
lm_model
##
## Call:
## lm(formula = Carseats$Sales ~ Carseats$Advertising)
##
## Coefficients:
## (Intercept) Carseats$Advertising
## 6.7370 0.1144
plot(Carseats$Advertising, Carseats$Sales, xlab = "Advertising Budget (in thousands of dollars)", ylab = "Car Seat Sales (in thousands of dolalrs)", main = "Advertising Budget vs. Car Seat Sales")
abline(lm(Carseats$Sales~Carseats$Advertising), col="red")
Null Hypothesis: β1 = 0, there is no relationship between Advertising budget and car seat sales.
Alternative Hypothesis: β1 <> 0, there is a relationship between Advertising budget and car seat sales.
By running the command plot(), we can see 4 different plots that will help us determine if a linear regression is the right fit for this data set and if there might be any violations in normality or other required assumptions:
- Residuals vs. Fitted: This plot shows if residuals have non-linear patterns. Residuals are the differences between observed values and fitted values on the line of best fit. If residuals are spread equally around the the horizontal line, that is a good indication that you have a linear relationship. In this case, the points look about equally spread.
- Normal Q-Q: If residuals on this plot follow a straight line, then they are normally distributed. In this case, residuals look pretty straight.
- Scale-Location: The plot shows if residuals are spread equally along the ranges of predictors, which is how we can check the assumption of equal variance (homoscedasticity). In this case, the line is relatively flat so the variance should be fairly constant.
- Residuals vs. Leverage: This helps us to find influential outliers. If the the red dashed Cook’s distance line is not visible as shown below, we do not have any influential outliers.
(Understanding Diagnostic Plots for Linear Regression Analysis | University of Virginia Library Research Data Services + Sciences, n.d.)
plot(lm_model)
By running summary() against our linear regression model, we can see a summary of residual data points, the intercept and slope, as well as the below information:
- Standard Error: The average amount that the coefficient estimates vary from the actual average value of our response variable. We had previously calculated that the slope was 0.114. In this case, our standard error is 0.0205, which means our slope could vary by that amount
- t value: A measure of how many standard deviations our coefficient estimate is away from 0. If it is far away from 0, then this might indicate we can reject the null hypothesis. In this case, the t-statistic values are relatively far from zero and large relative to the standard error. This indicates that a relationship likely exists
- Pr(<|t|): This relates to the probability of observing any value equal to or larger than t. 3 asterisks indicate a highly significant p-value
- Residual standard error: This measures the quality of the linear regression fit, or in other words, the average amount that the response will deviate from the true regression line. In this case, actual sales can deviate from the true regression line by 2.723
- Multiple R-squared: Provides a measure of how well the model is fitting the actual data. The result will fall between 0 and 1, with 1 representing that it is a perfect linear relationship. In this case, R-squared is equal to 0.07263, which means it is not the strongest linear relationship
- F-statistic: An indicator of whether there is a relationship between our predictor and response variables. The further away from 1, the better. In this case, the F-statistic is 31.17 which is somewhat large relative to our data size
(Quick Guide: Interpreting Simple Linear Model Output in R, n.d.)
summary(lm_model)
##
## Call:
## lm(formula = Carseats$Sales ~ Carseats$Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3770 -1.9634 -0.1037 1.7222 8.3208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7370 0.1925 35.007 < 2e-16 ***
## Carseats$Advertising 0.1144 0.0205 5.583 4.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.723 on 398 degrees of freedom
## Multiple R-squared: 0.07263, Adjusted R-squared: 0.0703
## F-statistic: 31.17 on 1 and 398 DF, p-value: 4.378e-08
Since the p-value is less than 0.05, we must reject the null hypothesis that Advertising budget has no relationship with car seat sales. This means that the relationship between Advertising and Sales is statistically significant.
There is evidence that somewhat of a linear relationship exists between Advertising and Sales, however this linear relationship is not strong. We can gather this from the plots I shared above, as well as the R-squared statistic which measures the goodness of fit to the model. This is only equal to 0.07263. We would want to see this closer to 1 in order to help conclude that it is a strong linear relationship.
This does not mean that Advertising budget does not have a relationship with Sales; it simply means that it is not a strong linear relationship.
I did not find that the model was in violation of any assumptions. The model assumptions are:
- The sample is representative of the population for the inference prediction
- Independence: the value of each outcome variable is independent from each other (need to know how data were collected)
- Linearity: the relationship between predictor and outcome variable is a reasonably straight line. This can be detected by plotting the data between observed values and predicted values. The points should be distributed along a diagonal line
- Normality: for a fixed value of x,the response y varies according to a normal distribution. Non-normally distributed (e.g. highly skewed, kurtosis) can distort relationship. This can be examined by a histogram, which should be close to normal distribution. The Kolmogorov-Smirnov, Anderson-Darling, and Shapiro-Wilk test provides inference statistics test on normality
- Homoscedasticity: the prediction error should be spread with the same degree (or constant) for the entire data range. In other words, probability distribution of the errors has constant variance. Violation of homoscedasticity is known as Heteroscedasticity, where error spread with different degree and with many shapes such as fan or bow-tie. This assumption can be verified by plotting the residual against the predicted values, the residuals should randomly scattered around the horizontal line and distribute relatively consistent
- Independence and normality of error: the prediction error should be independent from each other error and should be normally distributed. Independence can be detected by plotting residuals (e) against the predicted value (). The error distributed normality can be checked by normal quantile plot of residuals (theoretical standardized error and actual standardized error). The points should line close to the diagonal reference line
(From the Expert: Simple Linear Regression - MSDS660_X70_Statistical Meth Exp Design, n.d.)
In this test we found that there is in fact a relationship between Advertising budget and car seats sales. Even though the relationship is statistically significant, it is not all that significant in reality. An increase in sales by $144 for every $1,000 of advertising is not a strong return on ad spend. In Advertising, return on ad spend (ROAS) is calculated by dividing revenue by ad spend. In this case, ROAS comes out to $0.14. That means for every dollar spent on advertising, we are only getting $0.14 back, so we are losing money.
Change plot size of pairs plot in R. (n.d.). Stack Overflow. Retrieved March 24, 2021, from https://stackoverflow.com/questions/22396322/change-plot-size-of-pairs-plot-in-r
From the Expert: Simple Linear Regression—MSDS660_X70_Statistical Meth Exp Design. (n.d.). Retrieved March 23, 2021, from https://worldclass.regis.edu/d2l/le/content/264031/viewContent/3829839/View
Quick Guide: Interpreting Simple Linear Model Output in R. (n.d.). Retrieved March 23, 2021, from https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
Understanding Diagnostic Plots for Linear Regression Analysis | University of Virginia Library Research Data Services + Sciences. (n.d.). Retrieved March 23, 2021, from https://data.library.virginia.edu/diagnostic-plots/