Whenever we have two variables, we can measure the “relationship” between them by computing their correlation coefficient. However, we can do a little bit more by using linear regression. Linear regression allows us to either perform causal analysis and/or prediction. Linear regression is essentially a method that fits a straight line through a cloud of points, whose results have a very nice interpretation. (This is the message of this assignment.)

The general equation between two (or more) variables is:

\(outcome_i\) = model + \(error_i\)

which means that the outcome for person i can be modeled by whatever model we fit to the data plus some error.

The model we will focus on for now is the linear model:

model = \(\alpha\) + \(\beta * covariate_i\)

So the equation that models the linear relationship between an outcome for person i, call it \(Y_i\) and that person’s observed explanatory variable, call it \(X_i\) is:

\[Y_i = \alpha + \beta * X_i + error_i\]

What we observe are the outcomes and the covariates, and our goal is to estimate \(\alpha\) (the intercept) and \(\beta\) (the slope). We use a mathematical technique called the method of least squares to establish the line that best described our data according to the linear equation above. This method fits the data by finding the \(\alpha\) and \(\beta\) that minimize the sum of squared residuals:

\[\sum_i^n (Y_i - \alpha - \beta X_i)^2\]

The solution to this minimization problem is given by \(\hat{\alpha}\) and \(\hat{\beta}\) – these are our estimators.

This is quite abstact, so let’s look at an example.


Suppose that I would like to know how many albums I should expect to sell given that I spend money on advertising music albuns. If I had data on album sales, \(Y\), and on how much money is spent on adversiting, \(X\). Then I would run the regression:

\[Y_i = \alpha + \beta X_i + error_i\]

I would get \(\hat{\alpha}\) and \(\hat{\beta}\) and then I would be able to:

  1. say that I would expect to sell \(\hat{\beta}\) more albums if I were to spend $1 more on advertising
  2. predict how many albums I would sell on average if I spent $10,000 on advertising, i.e. \[\hat{Y} = \hat{\alpha} + \hat{\beta}*10,000\]

So let’s do this.

  1. Go to data and download the data set in a folder. The data set is called Album Sales 1.dat

  2. Open R and set your working directory to the folder where the file is saved. For this type:

setwd(“C:/…/Folder name”)

where “C:/…/” represents the place on your computer where the file is found. You can find this place by righ-click on the file name, under Location.

  1. Import the data set in R.

One way you can do this is to type:

Album.Sales.1 <- read.delim(“C:/…/Album Sales 1.dat”)

where “C:/…/” stands for the place on your computer where the file is found.

Or you can import the data via Tools -> Import Dataset -> From Local File.

Note: The “Tools -> Import” option will not work when you use Markdown. So if you use Markdown, you have to import the data by typing Album.Sales.1 <- read.delim(“C:/…/Album Sales 1.dat”)

  1. You should now have a data set called Album.Sales.1 with two columns called adverts and sales. Each column has 200 observations, and each observation represents one album.

  2. Do a scatterplot of adverts on sales:

plot(Album.Sales.1$adverts,Album.Sales.1$sales, 
     main = "Scatterplot: Album sales vs Amount spent promoting the album", 
     xlab = "Amount Spent on Adverts (thousands of dollars)", 
     ylab = "Record Sales (thousands)")

  1. What type of relationship is there between the two variables, sales and adverts? What do you expect the sign of the slope parameter to be?

  2. First, let’s look at the correlation between the two variables. To find the correlation type:

cor(Album.Sales.1$adverts,Album.Sales.1$sales)
## [1] 0.5784877

What is this number? Does it confirm your answer to (6) above?

The correlation will not tell us much beyond the fact that when adverts increase, the sales …increase/decrease…

  1. If we want to be able to predict by how much sales will change on average when adverts change, we will regress the numbers of sales on adverts. The model is

\[Sales_i = \alpha + \beta Adverts_i + error_i, i=1,2,...,n\]

where i represents an album (not an individual). How many albums do we have in the data set?

length(Album.Sales.1$sales)
## [1] 200
  1. To run the regression we use the function lm() which stands for linear model. This function takes the general form:

newModel <- lm(outcome ~ predictor, data=dataFrame)

where

regSales <- lm(sales ~ adverts, data=Album.Sales.1)

The line above creates an object in R called regSales that contains the results of our regression. We can show this object by:

summary(regSales)
## 
## Call:
## lm(formula = sales ~ adverts, data = Album.Sales.1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -152.949  -43.796   -0.393   37.040  211.866 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.341e+02  7.537e+00  17.799   <2e-16 ***
## adverts     9.612e-02  9.632e-03   9.979   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared:  0.3346, Adjusted R-squared:  0.3313 
## F-statistic: 99.59 on 1 and 198 DF,  p-value: < 2.2e-16
  1. To read this output let’s begin with R squared.

For this data, the R squared is 0.3346. Since we only have one covariate, this R squared is the square of the correlation between sales and adverts. Check to this that this is the case.

sqrt(0.3346)
## [1] 0.5784462
  1. Is this the same as the answer that you got for (7)? It should be.

  2. What percentage of the variation in sales is explained by adverts? What percentage is it not explained?

  3. What other variables do you think affect sales except for adverts?

This variable is statistically significant since its p-value is 2.2e-16 = \(2.2 *10^-16\), which is less than 0.001. This result tells us that \(\hat{\beta}\) is significantly different from zero, so we conclude that the advertising budget makes a signidicant contribution to predicting album sales. More about this next lecture – but you can get a head start by reading Chapter 4 on this.

  1. The value of \(\hat{\alpha}\) is 134.14 and the value of \(\hat{\beta}\) is 0.096. This tells us that if we increase the value of adverts by $1000, our sales will increase on average by how much? Remember that adverts is measured in thousands.

  2. Suppose that we want now to predict what our average sales for one particular album if we spend $100,000 on adverts for that album. We would compute:

\[\hat{Y} = 134.14 + 0.096*100\]

Use R to compute this answer. You must include the code you used in your assignment.

  1. We can use predict.lm to predict more values of sales for more values of adverts. That is suppose we want to know how much we expect to sale if our advertising budget were 0, 100,000, 200,000, and 300,000, 1,000,000. Then we would type:
newAds <- data.frame(adverts = c(0, 100,200,300,1000))
preds  <- predict(regSales,newAds)
preds
##        1        2        3        4        5 
## 134.1399 143.7524 153.3648 162.9773 230.2644

What do you get when you invest 0 dollars? Is that equal to \(\hat{\alpha}\)? Do you get the same answer via predict.lm for inversting $100,000 as you did in (12)?

  1. Finally, let’s plot the output of this linear regression.
plot(Album.Sales.1$adverts,Album.Sales.1$sales,
     main = "Linear regression results", 
     xlab = "Amount Spent on Adverts (thousands of dollars", 
     ylab = "Record Sales (thousands)")
abline(regSales,lty=2)

And now we want to plot confidence interval as well:

con <- predict(regSales, interval ='confidence', level = .95) ## creates 95% confidence levels
plot(Album.Sales.1$adverts,Album.Sales.1$sales,
     main = "Linear regression results", 
     xlab = "Amount Spent on Adverts (thousands of dollars", 
     ylab = "Record Sales (thousands)")
abline(regSales,lty=1, col="red")
matlines(Album.Sales.1$adverts,con[,c("lwr","upr")],col="blue",lty=1)

Let’s see the 95% confidence intervals for our parameters:

confint(regSales, level = 0.95)
##                    2.5 %      97.5 %
## (Intercept) 119.27768082 149.0021948
## adverts       0.07712929   0.1151197