1) File Load

#Reading the Data File
data_table <- "C:\\Users\\Swoop\\Documents\\birthweight_sample.csv"
read_data_table <- read.csv(data_table)
head(read_data_table)

##   faminc cigtax cigprice bwght fatheduc motheduc parity male white cigs
## 1   13.5    2.0    107.6   143       12       12      1    1     1    0
## 2   27.5   21.0    127.7   135       12       12      1    0     1    0
## 3   32.5   25.0    133.0   138       11       11      2    0     1    0
## 4   65.0   18.0    122.8   114       18       16      1    0     1    0
## 5    8.5    2.5    109.4   160       12       13      1    1     1    0
## 6   22.5    2.0    107.6    94        8        8      2    0     1   30
##     lbwght bwghtlbs packs  lfaminc state
## 1 4.962845   8.9375   0.0 2.602690     2
## 2 4.905275   8.4375   0.0 3.314186    19
## 3 4.927254   8.6250   0.0 3.481240    25
## 4 4.736198   7.1250   0.0 4.174387    13
## 5 5.075174  10.0000   0.0 2.140066     3
## 6 4.543295   5.8750   1.5 3.113515     2

2) DataCleaning

The fatheredu and the motheredu variables have a few N/A’s or missing entries. This can result in biased entries and at times no entries. Here I have changed the n/a’s to 0 so that they have no effect to the regressions(or I could have deleted the entries, but I choose to do the 0 method).

3)Detailed Summary of Variables

summary(read_data_table)

##      faminc         cigtax        cigprice         bwght      
##  Min.   : 0.5   Min.   : 2.0   Min.   :103.8   Min.   : 23.0  
##  1st Qu.:14.5   1st Qu.:13.0   1st Qu.:122.8   1st Qu.:107.0  
##  Median :27.5   Median :20.0   Median :130.7   Median :120.0  
##  Mean   :29.4   Mean   :19.4   Mean   :130.4   Mean   :118.9  
##  3rd Qu.:42.5   3rd Qu.:26.0   3rd Qu.:137.0   3rd Qu.:132.0  
##  Max.   :65.0   Max.   :38.0   Max.   :152.5   Max.   :271.0  
##     fatheduc        motheduc         parity           male       
##  Min.   :-2.00   Min.   : 0.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:11.00   1st Qu.:12.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :12.00   Median :12.00   Median :1.000   Median :1.0000  
##  Mean   :11.32   Mean   :12.93   Mean   :1.637   Mean   :0.5142  
##  3rd Qu.:15.00   3rd Qu.:14.00   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :18.00   Max.   :18.00   Max.   :6.000   Max.   :1.0000  
##      white             cigs            lbwght         bwghtlbs     
##  Min.   :0.0000   Min.   : 0.000   Min.   :3.135   Min.   : 1.438  
##  1st Qu.:1.0000   1st Qu.: 0.000   1st Qu.:4.673   1st Qu.: 6.688  
##  Median :1.0000   Median : 0.000   Median :4.787   Median : 7.500  
##  Mean   :0.7806   Mean   : 2.139   Mean   :4.761   Mean   : 7.429  
##  3rd Qu.:1.0000   3rd Qu.: 0.000   3rd Qu.:4.883   3rd Qu.: 8.250  
##  Max.   :1.0000   Max.   :50.000   Max.   :5.602   Max.   :16.938  
##      packs          lfaminc            state      
##  Min.   :0.000   Min.   :-0.6931   Min.   : 1.00  
##  1st Qu.:0.000   1st Qu.: 2.6741   1st Qu.:13.00  
##  Median :0.000   Median : 3.3142   Median :22.00  
##  Mean   :0.107   Mean   : 3.0778   Mean   :22.29  
##  3rd Qu.:0.000   3rd Qu.: 3.7495   3rd Qu.:32.00  
##  Max.   :2.500   Max.   : 4.1744   Max.   :44.00

4) Histogram

We see that histogram_bwgth has a normal distribution centered around 110.
We see that histogram_cigs follows a long right tail distribution. We can see that people either get more addicted to smoking or give it up completely (around cigs = 20).

histogram_bwght <- read_data_table %>% ggplot(aes(bwght)) + geom_histogram(binwidth = 10)
histogram_cigs <- read_data_table %>% ggplot(aes(cigs)) + geom_histogram(binwidth = 10)

histogram_bwght

histogram_cigs

3 A Simple Regression

1) Univariate Regression

The coefficent is -.4996 with an SD of .09537 meaning a one-unit increase in cig. smoking leads to approximately .5 oz, decrease in birth weight. Also, the p-value here is less than alpha at .05 (as we shall see in the table in the next R function) therefore the confidence interval is statistically signifcant.

uni_var_reg <- lm(bwght ~ cigs, data = read_data_table) 
uni_var_reg

## 
## Call:
## lm(formula = bwght ~ cigs, data = read_data_table)
## 
## Coefficients:
## (Intercept)         cigs  
##    119.9381      -0.4996

2) Table of Results and Summaries

summary(uni_var_reg)

## 
## Call:
## lm(formula = bwght ~ cigs, data = read_data_table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.938 -11.814   0.062  13.062 151.062 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.93805    0.62191 192.854  < 2e-16 ***
## cigs         -0.49962    0.09537  -5.239 1.91e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.3 on 1192 degrees of freedom
## Multiple R-squared:  0.02251,    Adjusted R-squared:  0.02169 
## F-statistic: 27.45 on 1 and 1192 DF,  p-value: 1.909e-07

confint(uni_var_reg, level = .95)

##                   2.5 %      97.5 %
## (Intercept) 118.7178860 121.1582169
## cigs         -0.6867292  -0.3125138

uni_var_reg

## 
## Call:
## lm(formula = bwght ~ cigs, data = read_data_table)
## 
## Coefficients:
## (Intercept)         cigs  
##    119.9381      -0.4996

3) Reproducible Graph

Reproducible Graph The scatterplot itself shows no correlation, but a best-fit line shows the negative effects of smoking on birth weight.

uni_reg_model <- read_data_table %>% ggplot(aes(x = cigs, y = bwght)) + geom_point() + geom_smooth(method = lm) + labs(x = "cigs", y = "bwght", title = "Scatter Plot of Birth Weight on Maternal Smoking")
uni_reg_model

More Regressions

Multiple Regression The table shows that a cigarette smoked leads to a -.4398 decrease and its t-value indicates that it is statistically signifcant (you can also look at the ’*’s to check). Fatheredu and parity are statistically signficant and contribute to a .393 and 1.43 oz increase in birth weight. Motheredu is not statistically signifcant however.

mul_reg_model <- lm(bwght ~ cigs + motheduc + fatheduc + parity, data = read_data_table)
mul_reg_model

## 
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity, data = read_data_table)
## 
## Coefficients:
## (Intercept)         cigs     motheduc     fatheduc       parity  
##   112.04339     -0.43982      0.07487      0.39266      1.43842

summary(mul_reg_model)

## 
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity, data = read_data_table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -96.09 -11.75   0.62  13.05 150.97 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 112.04339    3.61026  31.035  < 2e-16 ***
## cigs         -0.43982    0.09770  -4.502 7.41e-06 ***
## motheduc      0.07487    0.27750   0.270  0.78736    
## fatheduc      0.39266    0.12558   3.127  0.00181 ** 
## parity        1.43842    0.66187   2.173  0.02996 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.18 on 1189 degrees of freedom
## Multiple R-squared:  0.03622,    Adjusted R-squared:  0.03298 
## F-statistic: 11.17 on 4 and 1189 DF,  p-value: 6.71e-09

confidence_int_mul_reg <- confint(mul_reg_model, level = .95)

2) Dummy Variable Regression

Adding the dummy variables barely made a change (about a -.02 “increase”) in cigs coffeicent. But both variables show that they are statistically signifcant in terms of baby weight (white babies are 5.5 oz heavier and males are 3.37 oz heavier on average).Furthermore the R squared has jumped by approximately .01 which means its explains .1% more of the variations than earlier.

mul_reg_model_with_dummy <- lm(bwght ~ cigs + motheduc + fatheduc + parity + white + male, data = read_data_table)
mul_reg_model_with_dummy

## 
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity + white + 
##     male, data = read_data_table)
## 
## Coefficients:
## (Intercept)         cigs     motheduc     fatheduc       parity  
##    107.1142      -0.4587       0.1186       0.2255       1.6154  
##       white         male  
##      5.4688       3.3778

summary(mul_reg_model_with_dummy)

## 
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity + white + 
##     male, data = read_data_table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.237 -11.938   0.521  12.801 151.984 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.11421    3.74089  28.633  < 2e-16 ***
## cigs         -0.45872    0.09705  -4.727 2.56e-06 ***
## motheduc      0.11858    0.27569   0.430  0.66717    
## fatheduc      0.22551    0.13323   1.693  0.09079 .  
## parity        1.61540    0.65851   2.453  0.01431 *  
## white         5.46877    1.51523   3.609  0.00032 ***
## male          3.37784    1.15974   2.913  0.00365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.02 on 1187 degrees of freedom
## Multiple R-squared:  0.05305,    Adjusted R-squared:  0.04826 
## F-statistic: 11.08 on 6 and 1187 DF,  p-value: 4.715e-12

confidence_int_dummy_var <- confint(mul_reg_model_with_dummy, level = .95)

3) Creating a new variable + regression

Here the R squared has increased by .001 which is not very sginifcant of an increase. A one-unit increase is csissq leads to a .0123 increase in birth weight, but this is a flawed assumption. More smoking should not help in baby weight. This is why the cigssq is not statistically signifcant (no stars).

cigssq <- sapply(read_data_table$cigs, function(x) x^2)
new_read_data_table <- read_data_table %>% mutate(cigssq = cigssq)
head(new_read_data_table)

##   faminc cigtax cigprice bwght fatheduc motheduc parity male white cigs
## 1   13.5    2.0    107.6   143       12       12      1    1     1    0
## 2   27.5   21.0    127.7   135       12       12      1    0     1    0
## 3   32.5   25.0    133.0   138       11       11      2    0     1    0
## 4   65.0   18.0    122.8   114       18       16      1    0     1    0
## 5    8.5    2.5    109.4   160       12       13      1    1     1    0
## 6   22.5    2.0    107.6    94        8        8      2    0     1   30
##     lbwght bwghtlbs packs  lfaminc state cigssq
## 1 4.962845   8.9375   0.0 2.602690     2      0
## 2 4.905275   8.4375   0.0 3.314186    19      0
## 3 4.927254   8.6250   0.0 3.481240    25      0
## 4 4.736198   7.1250   0.0 4.174387    13      0
## 5 5.075174  10.0000   0.0 2.140066     3      0
## 6 4.543295   5.8750   1.5 3.113515     2    900

mul_reg_model_with_dummy_and_cigsq <- lm(bwght ~ cigs + motheduc + fatheduc + parity + white + male + cigssq, data = new_read_data_table)
mul_reg_model_with_dummy_and_cigsq

## 
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity + white + 
##     male + cigssq, data = new_read_data_table)
## 
## Coefficients:
## (Intercept)         cigs     motheduc     fatheduc       parity  
##   107.76874     -0.79336      0.08447      0.22462      1.55634  
##       white         male       cigssq  
##     5.62268      3.28062      0.01268

summary(mul_reg_model_with_dummy_and_cigsq)

## 
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity + white + 
##     male + cigssq, data = new_read_data_table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.315 -12.069   0.688  12.829 151.725 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.768738   3.758583  28.673  < 2e-16 ***
## cigs         -0.793361   0.222696  -3.563 0.000382 ***
## motheduc      0.084471   0.276239   0.306 0.759819    
## fatheduc      0.224618   0.133134   1.687 0.091837 .  
## parity        1.556338   0.658967   2.362 0.018348 *  
## white         5.622675   1.516898   3.707 0.000220 ***
## male          3.280621   1.160335   2.827 0.004773 ** 
## cigssq        0.012679   0.007596   1.669 0.095328 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.01 on 1186 degrees of freedom
## Multiple R-squared:  0.05527,    Adjusted R-squared:  0.04969 
## F-statistic: 9.912 on 7 and 1186 DF,  p-value: 4.565e-12

confidence_int_cissq_var <- confint(mul_reg_model_with_dummy_and_cigsq, level = .95)

4) Rerun w/ lbwght

This model uses percentage change in birth weight rather than unit change. The R squared falls by .01 which shows that this model explains approximately 1% less than the initial regression. The coefficent of the cigs variable has fallen drastically from approximately -.5 to -.003 (near positive). This shows that the effect of cigarette smoking is much less on log basis then on actual basis. All other independent variables show this trend.

mul_reg_model_with_lbwght <- lm(lbwght ~ cigs + motheduc + fatheduc + parity, data = new_read_data_table)

summary(mul_reg_model_with_lbwght)

## 
## Call:
## lm(formula = lbwght ~ cigs + motheduc + fatheduc + parity, data = new_read_data_table)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62687 -0.08691  0.02168  0.12039  0.83153 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.6994355  0.0341584 137.578  < 2e-16 ***
## cigs        -0.0038322  0.0009244  -4.145 3.63e-05 ***
## motheduc     0.0006611  0.0026256   0.252  0.80126    
## fatheduc     0.0034525  0.0011882   2.906  0.00373 ** 
## parity       0.0135633  0.0062623   2.166  0.03052 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.191 on 1189 degrees of freedom
## Multiple R-squared:  0.03151,    Adjusted R-squared:  0.02825 
## F-statistic:  9.67 on 4 and 1189 DF,  p-value: 1.07e-07

confidence_int_with_lbwght <- confint(mul_reg_model_with_lbwght, level = .95)

5) State Fixed Effects/Data Wrangling

The multiple R-squared increases by approximately double (I am suprised the R-squared is the same however). Nonetheless, factoring by states (or fixed effects) makes no difference as they are not statistically signifcant (none of the 44). States, thus don’t have a huge effect.

mul_reg_model_with_fixed_effects <- lm(bwght ~ cigs + parity + motheduc + fatheduc + factor(state), data = new_read_data_table)

summary(mul_reg_model_with_fixed_effects)

## 
## Call:
## lm(formula = bwght ~ cigs + parity + motheduc + fatheduc + factor(state), 
##     data = new_read_data_table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.042 -11.164   0.097  12.866 147.984 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     109.05895    6.18619  17.629  < 2e-16 ***
## cigs             -0.42440    0.09976  -4.254 2.27e-05 ***
## parity            1.42823    0.67958   2.102  0.03580 *  
## motheduc          0.09445    0.28453   0.332  0.74000    
## fatheduc          0.35950    0.12738   2.822  0.00485 ** 
## factor(state)2    3.11073    6.53010   0.476  0.63390    
## factor(state)3    4.00046    6.48031   0.617  0.53714    
## factor(state)4   -0.74617    8.15419  -0.092  0.92711    
## factor(state)5    3.98982   12.71616   0.314  0.75376    
## factor(state)6    2.96769    7.26867   0.408  0.68314    
## factor(state)7   -0.28961    6.94358  -0.042  0.96674    
## factor(state)8    5.78445    6.58284   0.879  0.37974    
## factor(state)9    1.19472    5.57102   0.214  0.83023    
## factor(state)10   4.43278    6.41818   0.691  0.48992    
## factor(state)11   3.93176    6.42487   0.612  0.54069    
## factor(state)12   1.67577    7.03942   0.238  0.81188    
## factor(state)13   6.74629    5.66554   1.191  0.23399    
## factor(state)14   5.16202    8.43880   0.612  0.54086    
## factor(state)15 -20.47692   15.14547  -1.352  0.17664    
## factor(state)16   0.31180    6.06772   0.051  0.95903    
## factor(state)17   3.10737    5.32633   0.583  0.55974    
## factor(state)18   5.66794   11.29437   0.502  0.61588    
## factor(state)19   4.66666    6.37199   0.732  0.46409    
## factor(state)20  -0.10199   12.74940  -0.008  0.99362    
## factor(state)21   3.08353    9.15953   0.337  0.73644    
## factor(state)22  -3.95676    6.22016  -0.636  0.52483    
## factor(state)23  -3.27744    7.26005  -0.451  0.65176    
## factor(state)24   6.17358    5.93484   1.040  0.29845    
## factor(state)25  -5.54199    5.76594  -0.961  0.33667    
## factor(state)26   2.83710    6.52938   0.435  0.66400    
## factor(state)27   7.14660    9.16498   0.780  0.43569    
## factor(state)28  -4.61122   11.31596  -0.407  0.68372    
## factor(state)29   1.53428    6.72445   0.228  0.81956    
## factor(state)30  -2.14960    7.54433  -0.285  0.77575    
## factor(state)31   1.13434    5.60685   0.202  0.83971    
## factor(state)32   5.00373    5.39653   0.927  0.35401    
## factor(state)33   6.29543    5.96010   1.056  0.29107    
## factor(state)34   4.85738    6.19588   0.784  0.43322    
## factor(state)35   7.68724    6.41984   1.197  0.23139    
## factor(state)36   6.90507    7.39217   0.934  0.35045    
## factor(state)37  15.53786   15.14431   1.026  0.30511    
## factor(state)38  -1.17287    9.68775  -0.121  0.90366    
## factor(state)39   6.71025    7.55426   0.888  0.37458    
## factor(state)40  14.59239   10.34584   1.410  0.15868    
## factor(state)41   1.49220    6.05090   0.247  0.80526    
## factor(state)42 -19.45726   20.81176  -0.935  0.35003    
## factor(state)43   1.92450    7.40086   0.260  0.79488    
## factor(state)44  11.41853    5.98108   1.909  0.05650 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.19 on 1146 degrees of freedom
## Multiple R-squared:  0.07057,    Adjusted R-squared:  0.03245 
## F-statistic: 1.851 on 47 and 1146 DF,  p-value: 0.000508

Instrumental Variables

1) Motheredu regression

Motheredu is statistically signifcant and attributes a -.54 decrease in birth weight. It is possible that poorer women tend to have worse lifestyles that contribute to baby weight. Thus, this is a good instrumental variable to use.

first_stage <- lm(cigs ~ motheduc, data = new_read_data_table)
summary(first_stage)

## 
## Call:
## lm(formula = cigs ~ motheduc, data = new_read_data_table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.168 -2.647 -2.103 -0.473 47.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.16814    0.94595   9.692   <2e-16 ***
## motheduc    -0.54347    0.07189  -7.560    8e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.023 on 1192 degrees of freedom
## Multiple R-squared:  0.04576,    Adjusted R-squared:  0.04496 
## F-statistic: 57.16 on 1 and 1192 DF,  p-value: 8.003e-14

2)TSLS

The coeffeicent of cigs has gone up signficantly to -1.19 which is nearly double what our earlier regressions indicated. Thus, there must be an effect of education on smoking…but is there?

tsls_reg <- tsls(formula = bwght ~ cigs, ~ motheduc, data = read_data_table)
summary(tsls_reg)

## 
##  2SLS Estimates
## 
## Model Formula: bwght ~ cigs
## 
## Instruments: ~motheduc
## 
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -98.4200 -11.5100   0.5775   0.0000  12.5800 149.6000 
## 
##                Estimate  Std. Error   t value Pr(>|t|)    
## (Intercept) 121.4225328   1.1447154 106.07225  < 2e-16 ***
## cigs         -1.1936195   0.4556352  -2.61968 0.008913 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.746851 on 1192 degrees of freedom

3) Explanation

However, this is a poor instrumental variable since we know that education has little to do with smoking. Both the illiterate and the highly successful smoke. Thus, correlation does not mean causation and thus we can’t use education as an IV.

Replication Excercise

Swarup Sahu

April 12, 2016