Module1_Discussion1

Author

Will Brewster

Part I.

  • Data set #1:
    1. The data set “Guns” is a panel data containing 1,173 observations on 13 variables, including the murder rate and robbery rate for all 50 states between 1977 and 1999.
    2. The data set is a panel because it repeats the same observations regarding crime rates over on a yearly basis. For example, there are 23 observations of the Alabama’s various crime rates.
    3. We can show a brief overview of the data for the entire data set and focusing on Massachusetts, including a scatter plot :
  year violent murder robbery prisoners     afam     cauc     male population
1 1977   414.4   14.2    96.8        83 8.384873 55.12291 18.17441   3.780403
2 1978   419.1   13.3    99.1        94 8.352101 55.14367 17.99408   3.831838
3 1979   413.3   13.2   109.5       144 8.329575 55.13586 17.83934   3.866248
4 1980   448.5   13.2   132.1       141 8.408386 54.91259 17.73420   3.900368
5 1981   470.5   11.9   126.5       149 8.483435 54.92513 17.67372   3.918531
6 1982   447.7   10.6   112.0       183 8.514000 54.89621 17.51052   3.925229
    income   density   state law
1 9563.148 0.0745524 Alabama  no
2 9932.000 0.0755667 Alabama  no
3 9877.028 0.0762453 Alabama  no
4 9541.428 0.0768288 Alabama  no
5 9548.351 0.0771866 Alabama  no
6 9478.919 0.0773185 Alabama  no

Interestingly, we see a pretty sharp drop in the murder rate in 1996, although there was no change in the laws at that point as observed in the study.

  • Data set #2:

    1. I picked the data set “SmokeBan”, also from the package AER.
    2. SmokeBan contains 10,000 observations on 7 variables, with the question of whether workplace smoking bans have an effect on indoor smoking. The categories that are included in the data set are smoking status, ban status, age, education level, African-American/Hispanic, and gender.
    3. The data is cross-sectional since it was taken at a specific point in time. The data dates to 1991 and is collected from the National Health Interview Survey. Below we see that a positive smoking status decreases as education levels increase. Using the same format, we can see that when there is a ban in place, there are fewer smokers:

Part II.

  1. Covariance is a measure of how two different variables move together. For example, if y increases as x increases, then the covariance is positive. Variance, on the other hand is an indicator of the dispersion of the data and is the square of the variable’s deviation from the mean (standard deviation).

  2. We learned that the \(\beta_1\) coefficient (as in the slope of the line) can be written as the \(\frac{cov(X,Y)}{var(X)}\) . Thinking intuively, we know that the slope can also be defined at \(\frac{y_1 - y_0}{x_1-x_0}\) for the points \((x_0,y_0)\) and \((x_1, y_1)\). If y is dependent on x, then the covariance can be thought of as the vertical direction of the two variables, and the variance of x being similar to the bins of a histogram on the horizontal axis, leading to an average value predicted by the slope. I’m not sure if that makes sense!

    If we expand these formulas, that gives:

    \[ \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{n}\div{\sum(x_i - \bar{x})^2p(x)} \]

    \[ =\frac{\sum x_i(y_i - \bar{y})}{{\sum x_i^2 - n\bar{x} ^ 2}} \]

    \[ =\frac{\sum x_iy_i - n\overline{xy})}{{\sum x_i^2 - n\bar{x} ^ 2}} \]

It almost seems like the n and p(x) cancel each other? Not sure where to go from here… any tips?

  1. Applying this to another data set “Life Expectancy” (OpenIntro), we have a data frame with 3142 observations of life expectancy and median income in U.S. counties:

Constructing a linear model, we have:


Call:
lm(formula = expectancy ~ income, data = life_exp)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4420  -1.1942   0.1866   1.3092   5.5884 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.298e+01  1.192e-01  612.35   <2e-16 ***
income      1.043e-04  2.832e-06   36.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.799 on 3080 degrees of freedom
  (60 observations deleted due to missingness)
Multiple R-squared:  0.3056,    Adjusted R-squared:  0.3054 
F-statistic:  1356 on 1 and 3080 DF,  p-value: < 2.2e-16

Using the covariance/variance formula:

cov(life_exp$income, life_exp$expectancy, use = "pairwise.complete.obs")/var(life_exp$income, use = "pairwise.complete.obs")
[1] 0.0001042713
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 60 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 801 rows containing missing values or values outside the scale range
(`geom_point()`).

So, we see that the \(\beta_1\) value turns out to 0.000104 in both cases. Due to the wide range of incomes the slope appears more sloped than it really is (zooming in will show that it is not as steep).