Whenever we have two variables, we can measure the “relationship” between them by computing their correlation coefficient. However, we can do a little bit more by using linear regression. Linear regression allows us to either perform causal analysis and/or prediction. Linear regression is essentially a method that fits a straight line through a cloud of points, whose results have a very nice interpretation. (This is the message of this assignment.)
The general equation between two (or more) variables is:
\(outcome_i\) = model + \(error_i\)
which means that the outcome for person i can be modeled by whatever model we fit to the data plus some error.
The model we will focus on for now is the linear model:
model = \(\alpha\) + \(\beta * covariate_i\)
So the equation that models the linear relationship between an outcome for person i, call it \(Y_i\) and that person’s observed explanatory variable, call it \(X_i\) is:
\[Y_i = \alpha + \beta * X_i + error_i\]
What we observe are the outcomes and the covariates, and our goal is to estimate \(\alpha\) (the intercept) and \(\beta\) (the slope). We use a mathematical technique called the method of least squares to establish the line that best described our data according to the linear equation above. This method fits the data by finding the \(\alpha\) and \(\beta\) that minimize the sum of squared residuals:
\[\sum_i^n (Y_i - \alpha - \beta X_i)^2\]
The solution to this minimization problem is given by \(\hat{\alpha}\) and \(\hat{\beta}\) – these are our estimators.
This is quite abstact, so let’s look at an example.
Suppose that I would like to know how many albums I should expect to sell given that I spend money on advertising music albuns. If I had data on album sales, \(Y\), and on how much money is spent on adversiting, \(X\). Then I would run the regression:
\[Y_i = \alpha + \beta X_i + error_i\]
I would get \(\hat{\alpha}\) and \(\hat{\beta}\) and then I would be able to:
So let’s do this.
Go to canvas and download the file called Album Sales 1.dat. This file contains the data set that we will use for this exercise. Save this file in a folder.
Open R and set your working directory to the folder where the file is saved. For this type:
setwd(“C:/…/Folder name”)
where “C:/…/” represents the place on your computer where the file is found. You can find this place by righ-click on the file name, under Location.
One way you can do this is to type:
Album.Sales.1 <- read.delim(“C:/…/Album Sales 1.dat”)
where “C:/…/” stands for the place on your computer where the file is found.
Or you can import the data via Tools -> Import Dataset -> From Local File.
Note: The “Tools -> Import” option will not work when you use Markdown. So if you use Markdown, you have to import the data by typing Album.Sales.1 <- read.delim(“C:/…/Album Sales 1.dat”)
You should now have a data set called Album.Sales.1 with two columns called adverts and sales. Each column has 200 observations, and each observation represents one album.
Do a scatterplot of adverts on sales:
plot(Album.Sales.1$adverts,Album.Sales.1$sales,
main = "Scatterplot: Album sales vs Amount spent promoting the album",
xlab = "Amount Spent on Adverts (thousands of dollars)",
ylab = "Record Sales (thousands)")
What type of relationship is there between the two variables, sales and adverts? What do you expect the sign of the slope parameter to be?
First, let’s look at the correlation between the two variables. To find the correlation type:
cor(Album.Sales.1$adverts,Album.Sales.1$sales)
What is this number? Does it confirm your answer to (6) above?
The correlation will not tell us much beyond the fact that when adverts increase, the sales …increase/decrease…
\[Sales_i = \alpha + \beta Adverts_i + error_i, i=1,2,...,n\]
where i represents an album (not an individual). How many albums do we have in the data set?
length(Album.Sales.1$sales)
newModel <- lm(outcome ~ predictor, data=dataFrame)
where
regSales <- lm(sales ~ adverts, data=Album.Sales.1)
The line above creates an object in R called regSales that contains the results of our regression. We can show this object by:
summary(regSales)
For this data, the R squared is 0.3346. Since we only have one covariate, this R squared is the square of the correlation between sales and adverts. Check to this that this is the case.
sqrt(0.3346)
Is this the same as the answer that you got for (7)? It should be.
What percentage of the variation in sales is explained by adverts? What percentage is it not explained?
What other variables do you think affect sales except for adverts?
This variable is statistically significant since its p-value is 2.2e-16 = \(2.2 *10^-16\), which is less than 0.001. This result tells us that \(\hat{\beta}\) is significantly different from zero, so we conclude that the advertising budget makes a signidicant contribution to predicting album sales. More about this next lecture – but you can get a head start by reading Chapter 4 on this.
The value of \(\hat{\alpha}\) is 134.14 and the value of \(\hat{\beta}\) is 0.096. This tells us that if we increase the value of adverts by $1000, our sales will increase on average by how much? Remember that adverts is measured in thousands.
Suppose that we want now to predict what our average sales for one particular album if we spend $100,000 on adverts for that album. We would compute:
\[\hat{Y} = 134.14 + 0.096*100\]
Use R to compute this answer. You must include the code you used in your assignment.
newAds <- data.frame(adverts = c(0, 100,200,300,1000))
preds <- predict(regSales,newAds)
preds
What do you get when you invest 0 dollars? Is that equal to \(\hat{\alpha}\)? Do you get the same answer via predict.lm for inversting $100,000 as you did in (12)?
plot(Album.Sales.1$adverts,Album.Sales.1$sales,
main = "Linear regression results",
xlab = "Amount Spent on Adverts (thousands of dollars",
ylab = "Record Sales (thousands)")
abline(regSales,lty=2)
And now we want to plot confidence interval as well:
con <- predict(regSales, interval ='confidence', level = .95) ## creates 95% confidence levels
plot(Album.Sales.1$adverts,Album.Sales.1$sales,
main = "Linear regression results",
xlab = "Amount Spent on Adverts (thousands of dollars",
ylab = "Record Sales (thousands)")
abline(regSales,lty=1, col="red")
matlines(Album.Sales.1$adverts,con[,c("lwr","upr")],col="blue",lty=1)
Let’s see the 95% confidence intervals for our parameters:
confint(regSales, level = 0.95)