##Question 1 - Set WD and Load Data
library(car)
setwd("/Users/jaredmaslin/Documents/W271_Assignments")
load("birthweight_w271.rdata")
##There are 1388 observations and 14 variables in the dataset.
##Question 2 - Initial examination of the data
desc
## variable label
## 1 faminc 1988 family income, $1000s
## 2 cigtax cig. tax in home state, 1988
## 3 cigprice cig. price in home state, 1988
## 4 bwght birth weight, ounces
## 5 fatheduc father's yrs of educ
## 6 motheduc mother's yrs of educ
## 7 parity birth order of child
## 8 male =1 if male child
## 9 white =1 if white
## 10 cigs cigs smked per day while preg
## 11 lbwght log of bwght
## 12 bwghtlbs birth weight, pounds
## 13 packs packs smked per day while preg
## 14 lfaminc log(faminc)
str(data);
## 'data.frame': 1388 obs. of 14 variables:
## $ faminc : num 13.5 7.5 0.5 15.5 27.5 7.5 65 27.5 27.5 37.5 ...
## $ cigtax : num 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 ...
## $ cigprice: num 122 122 122 122 122 ...
## $ bwght : num 109 133 129 126 134 118 140 86 121 129 ...
## $ fatheduc: int 12 6 NA 12 14 12 16 12 12 16 ...
## $ motheduc: int 12 12 12 12 12 14 14 14 17 18 ...
## $ parity : int 1 2 2 2 2 6 2 2 2 2 ...
## $ male : int 1 1 0 1 1 1 0 0 0 0 ...
## $ white : int 1 0 0 0 1 0 1 0 1 1 ...
## $ cigs : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lbwght : num 4.69 4.89 4.86 4.84 4.9 ...
## $ bwghtlbs: num 6.81 8.31 8.06 7.88 8.38 ...
## $ packs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ lfaminc : num 2.603 2.015 -0.693 2.741 3.314 ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%8.0g" ...
## - attr(*, "types")= int 254 254 254 252 251 251 251 251 251 251 ...
## - attr(*, "val.labels")= chr "" "" "" "" ...
## - attr(*, "var.labels")= chr "1988 family income, $1000s" "cig. tax in home state, 1988" "cig. price in home state, 1988" "birth weight, ounces" ...
## - attr(*, "version")= int 10
summary(data);
## faminc cigtax cigprice bwght
## Min. : 0.50 Min. : 2.00 Min. :103.8 Min. : 0.0
## 1st Qu.:14.50 1st Qu.:15.00 1st Qu.:122.8 1st Qu.:106.0
## Median :27.50 Median :20.00 Median :130.8 Median :119.0
## Mean :29.03 Mean :19.55 Mean :130.6 Mean :117.9
## 3rd Qu.:37.50 3rd Qu.:26.00 3rd Qu.:137.0 3rd Qu.:132.0
## Max. :65.00 Max. :38.00 Max. :152.5 Max. :271.0
##
## fatheduc motheduc parity male
## Min. : 1.00 Min. : 2.00 Min. :1.000 Min. :0.0000
## 1st Qu.:12.00 1st Qu.:12.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :12.00 Median :12.00 Median :1.000 Median :1.0000
## Mean :13.19 Mean :12.94 Mean :1.633 Mean :0.5209
## 3rd Qu.:16.00 3rd Qu.:14.00 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :18.00 Max. :18.00 Max. :6.000 Max. :1.0000
## NA's :196 NA's :1
## white cigs lbwght bwghtlbs
## Min. :0.0000 Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.: 0.000 1st Qu.:4.663 1st Qu.: 6.625
## Median :1.0000 Median : 0.000 Median :4.779 Median : 7.438
## Mean :0.7846 Mean : 2.087 Mean :4.726 Mean : 7.366
## 3rd Qu.:1.0000 3rd Qu.: 0.000 3rd Qu.:4.883 3rd Qu.: 8.250
## Max. :1.0000 Max. :50.000 Max. :5.602 Max. :16.938
##
## packs lfaminc
## Min. :0.0000 Min. :-0.6931
## 1st Qu.:0.0000 1st Qu.: 2.6741
## Median :0.0000 Median : 3.3142
## Mean :0.1044 Mean : 3.0713
## 3rd Qu.:0.0000 3rd Qu.: 3.6243
## Max. :2.5000 Max. : 4.1744
##
##Question 3 - Diving into "bwght"
summary(data$bwght)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 106.0 119.0 117.9 132.0 271.0
quantile(data$bwght, probs = c(0.01,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.99), na.rm = TRUE)
## 1% 5% 10% 25% 50% 75% 90% 95% 99%
## 42.35 83.00 93.00 106.00 119.00 132.00 143.00 149.00 160.13
hist(data$bwght, main='Histogram of Birthweight Observations', xlab='Birthweight ("bwght")', ylab='Quantity of Births', breaks=4)

hist(data$bwght, main='Histogram of Birthweight Observations', xlab='Birthweight ("bwght")', ylab='Quantity of Births', breaks=10)

hist(data$bwght, main='Histogram of Birthweight Observations', xlab='Birthweight ("bwght")', ylab='Quantity of Births', breaks=16)

##(Part 4) Have you noticed anything "strange" with the bwght variable and the shape of histogram this variable? If so, please elaborate on your observations and investigate any issues you have identified.
##One item of note is that there appear to be observations for which the birthweight was 0, which of course, isn't a measurement considered either intuitive or even possible. It may be appropriate to account for this in our analysis, looking only at observations for which the birthweight is greater than 0.
##Question 4 - Diving into "cigs"
summary(data$cigs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.087 0.000 50.000
quantile(data$cigs, probs = c(0.01,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.99), na.rm = TRUE)
## 1% 5% 10% 25% 50% 75% 90% 95% 99%
## 0 0 0 0 0 0 10 20 20
hist(data$cigs, main='Histogram of Cigarette Usage in Observations', xlab='Quantity of Cigarettes ("cigs")', ylab='Quantity of Users', breaks=4)

hist(data$cigs, main='Histogram of Cigarette Usage in Observations', xlab='Quantity of Cigarettes ("cigs")', ylab='Quantity of Users', breaks=10)

hist(data$cigs, main='Histogram of Cigarette Usage in Observations', xlab='Quantity of Cigarettes ("cigs")', ylab='Quantity of Users', breaks=16)

##In examining the "cigs" variable in the same manner as before (with "bwght"), we see that there exists a substantial positive skew in the distribution, with the majority of observations having no cigarette usage in birth mothers.
##Question 5 - Scatterplot of Birthweight vs. Cigarette Usage
scatterplot(data$bwght, data$cigs, main="Birthweight vs. Cigarette Usage", xlab='Birthweight ("bwght")', ylab='Quantity of Cigarettes Used')
## Warning in smoother(.x, .y, col = col[2], log.x = logged("x"), log.y =
## logged("y"), : could not fit smooth

##From looking at the scatterplot only, it doesn't appear to explain an overwhelming amount of the variation in bwght. Our regression line, while negative in slope, possesses a very gradual slope at that. Additional analysis will help us identify whether additional regressors (or substitutes) can offer a more powerful lens into the variation of bwght.
##Question 6 - Linear Regression
##Standard Model
Q6_Model1 = lm(bwght~cigs, data=data);
summary(Q6_Model1)
##
## Call:
## lm(formula = bwght ~ cigs, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.821 -11.179 1.179 14.179 152.179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.8211 0.6401 185.638 < 2e-16 ***
## cigs -0.4642 0.1012 -4.587 4.91e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.51 on 1386 degrees of freedom
## Multiple R-squared: 0.01495, Adjusted R-squared: 0.01424
## F-statistic: 21.04 on 1 and 1386 DF, p-value: 4.911e-06
##The resulting coeffcient estimates and associated standard errors (as shown in the output above) include:
##Intercept -- coefficient of 118.8211 and s.e. of 0.6401
##"cigs" -- coefficient -0.4642 and s.e. of 0.1012.
##The results suggest to us that a reduction of 1 cigarette consumed by birth mothers will reduce the resulting birthweight of their child by approximately 0.4642 lbs. The p-values for our model are both statistically significant at a 1% level (being less than 0.01). We note, though, that this model allows for birthweights of 0 lbs. We should adjust for this abnormality and re-test below.
##Adjust for the occurrence of observations with 0.0 birthweight
Q6_Model2 = lm(bwght~cigs, data[data$bwght > 0, ]);
summary(Q6_Model2)
##
## Call:
## lm(formula = bwght ~ cigs, data = data[data$bwght > 0, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.790 -11.790 0.357 13.210 151.210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.78960 0.57595 207.987 < 2e-16 ***
## cigs -0.51470 0.09073 -5.673 1.71e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.17 on 1376 degrees of freedom
## Multiple R-squared: 0.02285, Adjusted R-squared: 0.02214
## F-statistic: 32.18 on 1 and 1376 DF, p-value: 1.711e-08
##Now, we have adjusted the model to exclude observations with a birthweight of 0 lbs. The resulting coeffcient estimates and associated standard errors (as shown in the output above) include:
##Intercept -- coefficient of 119.7896 and s.e. of 0.57595
##"cigs" -- coefficient -0.51470 and s.e. of 0.09073.
##The p-values corresponding to our model are both statistically significant at a 1% level (being less than 0.01). The results of our "cigs" regressor also suggest to us that a reduction of 1 cigarette consumed by birth mothers will reduce the resulting birthweight of their child by approximately 0.5147 lbs., which is a larger (more negative) coefficient value than when we included observations with a birthweight of zero. In doing so, we also saw our Adjusted R-squared value increase from 0.01424 originally to 0.02214 now, which suggests that more variation in birthweight may be explained by "cigs" once zero-value birthweights are excluded.
##Question 7 - Scatterplot introducing "faminc", shown as a matrix
hist(data$cigs, main='Histogram of Family Income in Observations', xlab='Family Income (in thousands of dollars)', ylab='Quantity of Families', breaks=4)

##Similar to "cigs", the variable "faminc" (for family income) shows a positively skewed distribution, having a large percentage of observations under $5,000. To that end, we note that the dataset stems from the 1988 National Health Interview Survey, so we should not just to comparing family income in the same lens as in present day.
hist(data$cigs, main='Histogram of Family Income in Observations', xlab='Family Income (in thousands of dollars)', ylab='Quantity of Families', breaks=10)

hist(data$cigs, main='Histogram of Family Income in Observations', xlab='Family Income (in thousands of dollars)', ylab='Quantity of Families', breaks=16)

scatterplotMatrix(~bwght+cigs+faminc, data=data, main='Scatterplot Matrix: "bwght", "cigs", "faminc"')
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

##Question 8 - Regression of birthweight using cigarette usage and family income
Q8_Model = lm(bwght~cigs+faminc, data[data$bwght > 0, ])
summary(Q8_Model)
##
## Call:
## lm(formula = bwght ~ cigs + faminc, data = data[data$bwght >
## 0, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.075 -11.592 0.722 13.262 150.062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.97933 1.05363 111.025 < 2e-16 ***
## cigs -0.46407 0.09182 -5.054 4.91e-07 ***
## faminc 0.09314 0.02928 3.181 0.0015 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.11 on 1375 degrees of freedom
## Multiple R-squared: 0.02999, Adjusted R-squared: 0.02858
## F-statistic: 21.25 on 2 and 1375 DF, p-value: 8.109e-10
##The resulting coeffcient estimates and associated standard errors for our model, which now includes both "cigs" and "faminc" as regressors, (as shown in the output above) include the following:
##Intercept -- coefficient of 116.97933 and s.e. of 1.05363
##"cigs" -- coefficient of -0.46407 and s.e. of 0.09182
##"faminc" -- coefficient of 0.09314 and s.e. of 0.02928
##Our resulting p-values for all three items are statistically significant at a 1% level and our resulting Adjusted R-squared value 0.02858, suggesting that more variation in birthweight is being explained in our new model than those done previously.
##Question 9 - Explain, in your own words, what the coeffcient on cigs in the multiple regression means, and how it is different than the coeffcient on cigs in the simple regression? Please provide the intuition to explain the difference, if any.
## The coefficient on "cigs" in the multiple regression represents the relationship between changes in cigarette usage and resulting birthweight, holding all other variables constant except for the additional variable "faminc". In our simple regression, though, we only look into "cigs" and not "faminc". As such, our simple regression points to a relationship between "cigs" and "bwght", with all else being held equal (an assumption for which "faminc" is held equal, as well).
##Question 10 - Which cofficient for cigs is more negative than the other? Suggest an explanation for why this is so.
##From the regressions considered in previous steps, the coefficient on "cigs" was most negative (at -0.51470) in our second model, which was our simple regression from Question 6 that excluded observations with birthweight entries of 0 lbs.
##A potential explanation for this would be that the two variables, being "cigs" and "faminc" have a negative correlation, where the correlation between "cigs" and "bwght" is negative, but the correlation between "faminc" and "bwght" is positive. If this is the case, our model excluding "faminc" could reflect a effect of higher magnitude than when "faminc" is also included.