Preliminaries: Open the data (available on brightspace), and load the tidyverse and stargazer packages, as well as any other packages you plan to use.

apples <- read_csv("apple.csv")
## Rows: 660 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): regprc, ecoprc, reglbs, ecolbs
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Below are some pointers about markdown syntax, which is useful for writing about your results.

Bold the things you really care about. Italicize the things you want to emphasize.

You can make a list:

  1. item 1
  2. item 2
  3. item 3

Or a bulleted list:

You can even write some math: \(Y_i = \beta_1 + \beta_2 X_i + u_i\).

OLS picks the \(\hat{\beta}_1\) and \(\hat{\beta}_2\) that minimize RSS.

Centered equations require two dollar signs on each side:

\[\hat{\beta}_2 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}\]

Data description:

The file contains data from an experimental survey. The survey presented participants with randomly determined prices for “eco-labeled” apples and regular apples and then asked how many eco-labeled and regular apples they would buy at those prices. For reference, eco-labeling helps consumers identify sustainably-produced (or “green”) products and helps firms command higher prices for their products. You will estimate the demand for eco-labeled and regular apples by running regressions of apple quantity on prices. The fact that the prices were randomly assigned means that the exogeneity assumption holds so long as both prices are included in the model.

Variables in the dataset: - reglbs - Pounds of regular apples demanded - ecolbs - Pounds of eco-labeled apples demanded - regprc - Price of regular apples (per pound) - ecoprc - Price of eco-labeled price (per pound)

Exercise 1: Run a regression of reglbs on regprc. Interpret the slope coefficient. Is the sign of the slope consistent with what you know about demand curves?

model1 <- lm(reglbs ~ regprc, data = apples)
summary(model1)
## 
## Call:
## lm(formula = reglbs ~ regprc, data = apples)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.484 -1.277 -1.071  0.536 40.723 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.8896     0.4243   4.454 9.92e-06 ***
## regprc       -0.6879     0.4632  -1.485    0.138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.907 on 658 degrees of freedom
## Multiple R-squared:  0.00334,    Adjusted R-squared:  0.001826 
## F-statistic: 2.205 on 1 and 658 DF,  p-value: 0.138

The classical demand curve shows that as quantity increases demand decreases. In our model, the slope is negative, -0.6879 which means that as reglbs (quantity) increases regprc (price) decreases – the model is consistent with the classical demand curve.

Exercise 2: Run a regression of ecolbs on ecoprc. Interpret the intercept coefficient.

model2 <- lm(ecolbs ~ ecoprc, data = apples)
summary(model2)
## 
## Call:
## lm(formula = ecolbs ~ ecoprc, data = apples)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.889 -1.298 -0.467  0.533 40.618 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3881     0.3717   6.426 2.52e-10 ***
## ecoprc       -0.8452     0.3315  -2.550    0.011 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.515 on 658 degrees of freedom
## Multiple R-squared:  0.009783,   Adjusted R-squared:  0.008279 
## F-statistic: 6.501 on 1 and 658 DF,  p-value: 0.01101
# graphing for me
ggplot(apples, aes(ecoprc,ecolbs)) +
  geom_point() +
  geom_smooth(method = "lm")

The slope for this model is also negative (-0.845) and thus also consistent with the classical demand curve. The intercept coefficient of 2.39 means that at the price of 0 we can expect the consumer to buy 2.39 apples.

Exercise 3: Run a regression of reglbs on regprc and ecoprc.How does the estimated coefficient on regprc change? What does this tell you about the correlation between regprc and ecoprc? Justify your answer and then use R to verify.

model3 <- lm(reglbs ~ regprc + ecoprc, data = apples)
summary(model3)
## 
## Call:
## lm(formula = reglbs ~ regprc + ecoprc, data = apples)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.661 -1.278 -0.895  0.546 40.897 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.7187     0.4448   3.864 0.000123 ***
## regprc       -1.5689     0.8318  -1.886 0.059723 .  
## ecoprc        0.8771     0.6880   1.275 0.202823    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.906 on 657 degrees of freedom
## Multiple R-squared:  0.0058, Adjusted R-squared:  0.002773 
## F-statistic: 1.916 on 2 and 657 DF,  p-value: 0.148

By including both regprc and ecoprc on our regression on reglbs our estimated coefficient on regprc became more negative, -1.5689 as compared to -0.6879 previously. This suggests that the correlation between regprc and ecoprc strongly positive, we can verify this by using the cor() function.

cor(apples$regprc, apples$ecoprc)
## [1] 0.8307587

A correlation of 0.8307587 suggests a strong relationship between the two variables seeing as correlation can only be between 1 and -1, thus as the price in one variable rises or falls the other follows.

Exercise 4: Run a regression of ecolbs on regprc and ecoprc. Identify and interpret r-squared.

model4 <- lm(ecolbs ~ regprc + ecoprc, data = apples)
summary(model4)
## 
## Call:
## lm(formula = ecolbs ~ regprc + ecoprc, data = apples)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.087 -1.087 -0.537  0.560 39.913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.9653     0.3801   5.171 3.10e-07 ***
## regprc        3.0289     0.7108   4.261 2.33e-05 ***
## ecoprc       -2.9265     0.5879  -4.978 8.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.483 on 657 degrees of freedom
## Multiple R-squared:  0.03641,    Adjusted R-squared:  0.03348 
## F-statistic: 12.41 on 2 and 657 DF,  p-value: 5.107e-06

The \(R^2\) is 0.03641 meaning that only 3.6% of the variation in the data can be explained by the variables regprc and ecoprc.

Exercise 5: Summarize your regression results in a table. Which two of the four regressions have the highest R-squared? Why?

stargazer(model1, model2, model3, model4,
          type = "html")
Dependent variable:
reglbs ecolbs reglbs ecolbs
(1) (2) (3) (4)
regprc -0.688 -1.569* 3.029***
(0.463) (0.832) (0.711)
ecoprc -0.845** 0.877 -2.926***
(0.331) (0.688) (0.588)
Constant 1.890*** 2.388*** 1.719*** 1.965***
(0.424) (0.372) (0.445) (0.380)
Observations 660 660 660 660
R2 0.003 0.010 0.006 0.036
Adjusted R2 0.002 0.008 0.003 0.033
Residual Std. Error 2.907 (df = 658) 2.515 (df = 658) 2.906 (df = 657) 2.483 (df = 657)
F Statistic 2.205 (df = 1; 658) 6.501** (df = 1; 658) 1.916 (df = 2; 657) 12.414*** (df = 2; 657)
Note: p<0.1; p<0.05; p<0.01

As we can see, model 2 (ecolbs on regprc and ecoprc) and model 4 (ecolbs on ecoprc) had the highest \(R^2\), with 0.010 and 0.036 respectively. In both of these models we used ecolbs as our dependent variable and either only ecoprc or both ecoprc and regprc as the independent variables. It makes sense that as we added another variable from model 2 to 4 our \(R^2\) increased seeing as there more are more variables explaining variance in the data.

Also we should note that repgrc and reglbs are weakly correlated (p-value ≥ 0.1) and thus even though we added another variable (ecoprc) in model 3, the \(R^2\) only went up slightly, and it was lower than either model 2 or 4.

Exercise 6: Construct a 99 percent confidence interval for the ecoprc coefficient from the regression described in exercise 4.

confint(model4, level = 0.99)
##                  0.5 %    99.5 %
## (Intercept)  0.9834318  2.947175
## regprc       1.1925995  4.865227
## ecoprc      -4.4452828 -1.407650

The output shows the lower and upper bounds for each coefficient, in this case the 99% confidence interval for regprc is between 1.19 and 4.87 and for ecoprc it is -4.45 and -1.41. This means that there is a 99% chance that the real coefficient of the these variables will fall into these ranges. Also, we should note that none of the confidence intervals include 0 so the relationship between the independent and dependent variable is most likely statistically significant and we can reject the null hypothesis.