#Reading the Data File
data_table <- "C:\\Users\\Swoop\\Documents\\birthweight_sample.csv"
read_data_table <- read.csv(data_table)
head(read_data_table)
## faminc cigtax cigprice bwght fatheduc motheduc parity male white cigs
## 1 13.5 2.0 107.6 143 12 12 1 1 1 0
## 2 27.5 21.0 127.7 135 12 12 1 0 1 0
## 3 32.5 25.0 133.0 138 11 11 2 0 1 0
## 4 65.0 18.0 122.8 114 18 16 1 0 1 0
## 5 8.5 2.5 109.4 160 12 13 1 1 1 0
## 6 22.5 2.0 107.6 94 8 8 2 0 1 30
## lbwght bwghtlbs packs lfaminc state
## 1 4.962845 8.9375 0.0 2.602690 2
## 2 4.905275 8.4375 0.0 3.314186 19
## 3 4.927254 8.6250 0.0 3.481240 25
## 4 4.736198 7.1250 0.0 4.174387 13
## 5 5.075174 10.0000 0.0 2.140066 3
## 6 4.543295 5.8750 1.5 3.113515 2
The fatheredu and the motheredu variables have a few N/A’s or missing entries. This can result in biased entries and at times no entries. Here I have changed the n/a’s to 0 so that they have no effect to the regressions(or I could have deleted the entries, but I choose to do the 0 method).
summary(read_data_table)
## faminc cigtax cigprice bwght
## Min. : 0.5 Min. : 2.0 Min. :103.8 Min. : 23.0
## 1st Qu.:14.5 1st Qu.:13.0 1st Qu.:122.8 1st Qu.:107.0
## Median :27.5 Median :20.0 Median :130.7 Median :120.0
## Mean :29.4 Mean :19.4 Mean :130.4 Mean :118.9
## 3rd Qu.:42.5 3rd Qu.:26.0 3rd Qu.:137.0 3rd Qu.:132.0
## Max. :65.0 Max. :38.0 Max. :152.5 Max. :271.0
## fatheduc motheduc parity male
## Min. :-2.00 Min. : 0.00 Min. :1.000 Min. :0.0000
## 1st Qu.:11.00 1st Qu.:12.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :12.00 Median :12.00 Median :1.000 Median :1.0000
## Mean :11.32 Mean :12.93 Mean :1.637 Mean :0.5142
## 3rd Qu.:15.00 3rd Qu.:14.00 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :18.00 Max. :18.00 Max. :6.000 Max. :1.0000
## white cigs lbwght bwghtlbs
## Min. :0.0000 Min. : 0.000 Min. :3.135 Min. : 1.438
## 1st Qu.:1.0000 1st Qu.: 0.000 1st Qu.:4.673 1st Qu.: 6.688
## Median :1.0000 Median : 0.000 Median :4.787 Median : 7.500
## Mean :0.7806 Mean : 2.139 Mean :4.761 Mean : 7.429
## 3rd Qu.:1.0000 3rd Qu.: 0.000 3rd Qu.:4.883 3rd Qu.: 8.250
## Max. :1.0000 Max. :50.000 Max. :5.602 Max. :16.938
## packs lfaminc state
## Min. :0.000 Min. :-0.6931 Min. : 1.00
## 1st Qu.:0.000 1st Qu.: 2.6741 1st Qu.:13.00
## Median :0.000 Median : 3.3142 Median :22.00
## Mean :0.107 Mean : 3.0778 Mean :22.29
## 3rd Qu.:0.000 3rd Qu.: 3.7495 3rd Qu.:32.00
## Max. :2.500 Max. : 4.1744 Max. :44.00
histogram_bwght <- read_data_table %>% ggplot(aes(bwght)) + geom_histogram(binwidth = 10)
histogram_cigs <- read_data_table %>% ggplot(aes(cigs)) + geom_histogram(binwidth = 10)
histogram_bwght
histogram_cigs
The coefficent is -.4996 with an SD of .09537 meaning a one-unit increase in cig. smoking leads to approximately .5 oz, decrease in birth weight. Also, the p-value here is less than alpha at .05 (as we shall see in the table in the next R function) therefore the confidence interval is statistically signifcant.
uni_var_reg <- lm(bwght ~ cigs, data = read_data_table)
uni_var_reg
##
## Call:
## lm(formula = bwght ~ cigs, data = read_data_table)
##
## Coefficients:
## (Intercept) cigs
## 119.9381 -0.4996
summary(uni_var_reg)
##
## Call:
## lm(formula = bwght ~ cigs, data = read_data_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.938 -11.814 0.062 13.062 151.062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.93805 0.62191 192.854 < 2e-16 ***
## cigs -0.49962 0.09537 -5.239 1.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.3 on 1192 degrees of freedom
## Multiple R-squared: 0.02251, Adjusted R-squared: 0.02169
## F-statistic: 27.45 on 1 and 1192 DF, p-value: 1.909e-07
confint(uni_var_reg, level = .95)
## 2.5 % 97.5 %
## (Intercept) 118.7178860 121.1582169
## cigs -0.6867292 -0.3125138
uni_var_reg
##
## Call:
## lm(formula = bwght ~ cigs, data = read_data_table)
##
## Coefficients:
## (Intercept) cigs
## 119.9381 -0.4996
Reproducible Graph The scatterplot itself shows no correlation, but a best-fit line shows the negative effects of smoking on birth weight.
uni_reg_model <- read_data_table %>% ggplot(aes(x = cigs, y = bwght)) + geom_point() + geom_smooth(method = lm) + labs(x = "cigs", y = "bwght", title = "Scatter Plot of Birth Weight on Maternal Smoking")
uni_reg_model
mul_reg_model <- lm(bwght ~ cigs + motheduc + fatheduc + parity, data = read_data_table)
mul_reg_model
##
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity, data = read_data_table)
##
## Coefficients:
## (Intercept) cigs motheduc fatheduc parity
## 112.04339 -0.43982 0.07487 0.39266 1.43842
summary(mul_reg_model)
##
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity, data = read_data_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.09 -11.75 0.62 13.05 150.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.04339 3.61026 31.035 < 2e-16 ***
## cigs -0.43982 0.09770 -4.502 7.41e-06 ***
## motheduc 0.07487 0.27750 0.270 0.78736
## fatheduc 0.39266 0.12558 3.127 0.00181 **
## parity 1.43842 0.66187 2.173 0.02996 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.18 on 1189 degrees of freedom
## Multiple R-squared: 0.03622, Adjusted R-squared: 0.03298
## F-statistic: 11.17 on 4 and 1189 DF, p-value: 6.71e-09
confidence_int_mul_reg <- confint(mul_reg_model, level = .95)
Adding the dummy variables barely made a change (about a -.02 “increase”) in cigs coffeicent. But both variables show that they are statistically signifcant in terms of baby weight (white babies are 5.5 oz heavier and males are 3.37 oz heavier on average).Furthermore the R squared has jumped by approximately .01 which means its explains .1% more of the variations than earlier.
mul_reg_model_with_dummy <- lm(bwght ~ cigs + motheduc + fatheduc + parity + white + male, data = read_data_table)
mul_reg_model_with_dummy
##
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity + white +
## male, data = read_data_table)
##
## Coefficients:
## (Intercept) cigs motheduc fatheduc parity
## 107.1142 -0.4587 0.1186 0.2255 1.6154
## white male
## 5.4688 3.3778
summary(mul_reg_model_with_dummy)
##
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity + white +
## male, data = read_data_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.237 -11.938 0.521 12.801 151.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.11421 3.74089 28.633 < 2e-16 ***
## cigs -0.45872 0.09705 -4.727 2.56e-06 ***
## motheduc 0.11858 0.27569 0.430 0.66717
## fatheduc 0.22551 0.13323 1.693 0.09079 .
## parity 1.61540 0.65851 2.453 0.01431 *
## white 5.46877 1.51523 3.609 0.00032 ***
## male 3.37784 1.15974 2.913 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.02 on 1187 degrees of freedom
## Multiple R-squared: 0.05305, Adjusted R-squared: 0.04826
## F-statistic: 11.08 on 6 and 1187 DF, p-value: 4.715e-12
confidence_int_dummy_var <- confint(mul_reg_model_with_dummy, level = .95)
Here the R squared has increased by .001 which is not very sginifcant of an increase. A one-unit increase is csissq leads to a .0123 increase in birth weight, but this is a flawed assumption. More smoking should not help in baby weight. This is why the cigssq is not statistically signifcant (no stars).
cigssq <- sapply(read_data_table$cigs, function(x) x^2)
new_read_data_table <- read_data_table %>% mutate(cigssq = cigssq)
head(new_read_data_table)
## faminc cigtax cigprice bwght fatheduc motheduc parity male white cigs
## 1 13.5 2.0 107.6 143 12 12 1 1 1 0
## 2 27.5 21.0 127.7 135 12 12 1 0 1 0
## 3 32.5 25.0 133.0 138 11 11 2 0 1 0
## 4 65.0 18.0 122.8 114 18 16 1 0 1 0
## 5 8.5 2.5 109.4 160 12 13 1 1 1 0
## 6 22.5 2.0 107.6 94 8 8 2 0 1 30
## lbwght bwghtlbs packs lfaminc state cigssq
## 1 4.962845 8.9375 0.0 2.602690 2 0
## 2 4.905275 8.4375 0.0 3.314186 19 0
## 3 4.927254 8.6250 0.0 3.481240 25 0
## 4 4.736198 7.1250 0.0 4.174387 13 0
## 5 5.075174 10.0000 0.0 2.140066 3 0
## 6 4.543295 5.8750 1.5 3.113515 2 900
mul_reg_model_with_dummy_and_cigsq <- lm(bwght ~ cigs + motheduc + fatheduc + parity + white + male + cigssq, data = new_read_data_table)
mul_reg_model_with_dummy_and_cigsq
##
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity + white +
## male + cigssq, data = new_read_data_table)
##
## Coefficients:
## (Intercept) cigs motheduc fatheduc parity
## 107.76874 -0.79336 0.08447 0.22462 1.55634
## white male cigssq
## 5.62268 3.28062 0.01268
summary(mul_reg_model_with_dummy_and_cigsq)
##
## Call:
## lm(formula = bwght ~ cigs + motheduc + fatheduc + parity + white +
## male + cigssq, data = new_read_data_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.315 -12.069 0.688 12.829 151.725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.768738 3.758583 28.673 < 2e-16 ***
## cigs -0.793361 0.222696 -3.563 0.000382 ***
## motheduc 0.084471 0.276239 0.306 0.759819
## fatheduc 0.224618 0.133134 1.687 0.091837 .
## parity 1.556338 0.658967 2.362 0.018348 *
## white 5.622675 1.516898 3.707 0.000220 ***
## male 3.280621 1.160335 2.827 0.004773 **
## cigssq 0.012679 0.007596 1.669 0.095328 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.01 on 1186 degrees of freedom
## Multiple R-squared: 0.05527, Adjusted R-squared: 0.04969
## F-statistic: 9.912 on 7 and 1186 DF, p-value: 4.565e-12
confidence_int_cissq_var <- confint(mul_reg_model_with_dummy_and_cigsq, level = .95)
This model uses percentage change in birth weight rather than unit change. The R squared falls by .01 which shows that this model explains approximately 1% less than the initial regression. The coefficent of the cigs variable has fallen drastically from approximately -.5 to -.003 (near positive). This shows that the effect of cigarette smoking is much less on log basis then on actual basis. All other independent variables show this trend.
mul_reg_model_with_lbwght <- lm(lbwght ~ cigs + motheduc + fatheduc + parity, data = new_read_data_table)
summary(mul_reg_model_with_lbwght)
##
## Call:
## lm(formula = lbwght ~ cigs + motheduc + fatheduc + parity, data = new_read_data_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62687 -0.08691 0.02168 0.12039 0.83153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6994355 0.0341584 137.578 < 2e-16 ***
## cigs -0.0038322 0.0009244 -4.145 3.63e-05 ***
## motheduc 0.0006611 0.0026256 0.252 0.80126
## fatheduc 0.0034525 0.0011882 2.906 0.00373 **
## parity 0.0135633 0.0062623 2.166 0.03052 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.191 on 1189 degrees of freedom
## Multiple R-squared: 0.03151, Adjusted R-squared: 0.02825
## F-statistic: 9.67 on 4 and 1189 DF, p-value: 1.07e-07
confidence_int_with_lbwght <- confint(mul_reg_model_with_lbwght, level = .95)
The multiple R-squared increases by approximately double (I am suprised the R-squared is the same however). Nonetheless, factoring by states (or fixed effects) makes no difference as they are not statistically signifcant (none of the 44). States, thus don’t have a huge effect.
mul_reg_model_with_fixed_effects <- lm(bwght ~ cigs + parity + motheduc + fatheduc + factor(state), data = new_read_data_table)
summary(mul_reg_model_with_fixed_effects)
##
## Call:
## lm(formula = bwght ~ cigs + parity + motheduc + fatheduc + factor(state),
## data = new_read_data_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.042 -11.164 0.097 12.866 147.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.05895 6.18619 17.629 < 2e-16 ***
## cigs -0.42440 0.09976 -4.254 2.27e-05 ***
## parity 1.42823 0.67958 2.102 0.03580 *
## motheduc 0.09445 0.28453 0.332 0.74000
## fatheduc 0.35950 0.12738 2.822 0.00485 **
## factor(state)2 3.11073 6.53010 0.476 0.63390
## factor(state)3 4.00046 6.48031 0.617 0.53714
## factor(state)4 -0.74617 8.15419 -0.092 0.92711
## factor(state)5 3.98982 12.71616 0.314 0.75376
## factor(state)6 2.96769 7.26867 0.408 0.68314
## factor(state)7 -0.28961 6.94358 -0.042 0.96674
## factor(state)8 5.78445 6.58284 0.879 0.37974
## factor(state)9 1.19472 5.57102 0.214 0.83023
## factor(state)10 4.43278 6.41818 0.691 0.48992
## factor(state)11 3.93176 6.42487 0.612 0.54069
## factor(state)12 1.67577 7.03942 0.238 0.81188
## factor(state)13 6.74629 5.66554 1.191 0.23399
## factor(state)14 5.16202 8.43880 0.612 0.54086
## factor(state)15 -20.47692 15.14547 -1.352 0.17664
## factor(state)16 0.31180 6.06772 0.051 0.95903
## factor(state)17 3.10737 5.32633 0.583 0.55974
## factor(state)18 5.66794 11.29437 0.502 0.61588
## factor(state)19 4.66666 6.37199 0.732 0.46409
## factor(state)20 -0.10199 12.74940 -0.008 0.99362
## factor(state)21 3.08353 9.15953 0.337 0.73644
## factor(state)22 -3.95676 6.22016 -0.636 0.52483
## factor(state)23 -3.27744 7.26005 -0.451 0.65176
## factor(state)24 6.17358 5.93484 1.040 0.29845
## factor(state)25 -5.54199 5.76594 -0.961 0.33667
## factor(state)26 2.83710 6.52938 0.435 0.66400
## factor(state)27 7.14660 9.16498 0.780 0.43569
## factor(state)28 -4.61122 11.31596 -0.407 0.68372
## factor(state)29 1.53428 6.72445 0.228 0.81956
## factor(state)30 -2.14960 7.54433 -0.285 0.77575
## factor(state)31 1.13434 5.60685 0.202 0.83971
## factor(state)32 5.00373 5.39653 0.927 0.35401
## factor(state)33 6.29543 5.96010 1.056 0.29107
## factor(state)34 4.85738 6.19588 0.784 0.43322
## factor(state)35 7.68724 6.41984 1.197 0.23139
## factor(state)36 6.90507 7.39217 0.934 0.35045
## factor(state)37 15.53786 15.14431 1.026 0.30511
## factor(state)38 -1.17287 9.68775 -0.121 0.90366
## factor(state)39 6.71025 7.55426 0.888 0.37458
## factor(state)40 14.59239 10.34584 1.410 0.15868
## factor(state)41 1.49220 6.05090 0.247 0.80526
## factor(state)42 -19.45726 20.81176 -0.935 0.35003
## factor(state)43 1.92450 7.40086 0.260 0.79488
## factor(state)44 11.41853 5.98108 1.909 0.05650 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.19 on 1146 degrees of freedom
## Multiple R-squared: 0.07057, Adjusted R-squared: 0.03245
## F-statistic: 1.851 on 47 and 1146 DF, p-value: 0.000508
Motheredu is statistically signifcant and attributes a -.54 decrease in birth weight. It is possible that poorer women tend to have worse lifestyles that contribute to baby weight. Thus, this is a good instrumental variable to use.
first_stage <- lm(cigs ~ motheduc, data = new_read_data_table)
summary(first_stage)
##
## Call:
## lm(formula = cigs ~ motheduc, data = new_read_data_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.168 -2.647 -2.103 -0.473 47.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.16814 0.94595 9.692 <2e-16 ***
## motheduc -0.54347 0.07189 -7.560 8e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.023 on 1192 degrees of freedom
## Multiple R-squared: 0.04576, Adjusted R-squared: 0.04496
## F-statistic: 57.16 on 1 and 1192 DF, p-value: 8.003e-14
The coeffeicent of cigs has gone up signficantly to -1.19 which is nearly double what our earlier regressions indicated. Thus, there must be an effect of education on smoking…but is there?
tsls_reg <- tsls(formula = bwght ~ cigs, ~ motheduc, data = read_data_table)
summary(tsls_reg)
##
## 2SLS Estimates
##
## Model Formula: bwght ~ cigs
##
## Instruments: ~motheduc
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -98.4200 -11.5100 0.5775 0.0000 12.5800 149.6000
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.4225328 1.1447154 106.07225 < 2e-16 ***
## cigs -1.1936195 0.4556352 -2.61968 0.008913 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.746851 on 1192 degrees of freedom
However, this is a poor instrumental variable since we know that education has little to do with smoking. Both the illiterate and the highly successful smoke. Thus, correlation does not mean causation and thus we can’t use education as an IV.