##Question 1 - Set WD and Load Data
library(car)
setwd("/Users/jaredmaslin/Documents/W271_Assignments")
load("birthweight_w271.rdata")
##There are 1388 observations and 14 variables in the dataset.
##Question 2 - Initial examination of the data
desc
##    variable                          label
## 1    faminc     1988 family income, $1000s
## 2    cigtax   cig. tax in home state, 1988
## 3  cigprice cig. price in home state, 1988
## 4     bwght           birth weight, ounces
## 5  fatheduc           father's yrs of educ
## 6  motheduc           mother's yrs of educ
## 7    parity           birth order of child
## 8      male               =1 if male child
## 9     white                    =1 if white
## 10     cigs  cigs smked per day while preg
## 11   lbwght                   log of bwght
## 12 bwghtlbs           birth weight, pounds
## 13    packs packs smked per day while preg
## 14  lfaminc                    log(faminc)
str(data);
## 'data.frame':    1388 obs. of  14 variables:
##  $ faminc  : num  13.5 7.5 0.5 15.5 27.5 7.5 65 27.5 27.5 37.5 ...
##  $ cigtax  : num  16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 ...
##  $ cigprice: num  122 122 122 122 122 ...
##  $ bwght   : num  109 133 129 126 134 118 140 86 121 129 ...
##  $ fatheduc: int  12 6 NA 12 14 12 16 12 12 16 ...
##  $ motheduc: int  12 12 12 12 12 14 14 14 17 18 ...
##  $ parity  : int  1 2 2 2 2 6 2 2 2 2 ...
##  $ male    : int  1 1 0 1 1 1 0 0 0 0 ...
##  $ white   : int  1 0 0 0 1 0 1 0 1 1 ...
##  $ cigs    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lbwght  : num  4.69 4.89 4.86 4.84 4.9 ...
##  $ bwghtlbs: num  6.81 8.31 8.06 7.88 8.38 ...
##  $ packs   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ lfaminc : num  2.603 2.015 -0.693 2.741 3.314 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
##  - attr(*, "formats")= chr  "%9.0g" "%9.0g" "%9.0g" "%8.0g" ...
##  - attr(*, "types")= int  254 254 254 252 251 251 251 251 251 251 ...
##  - attr(*, "val.labels")= chr  "" "" "" "" ...
##  - attr(*, "var.labels")= chr  "1988 family income, $1000s" "cig. tax in home state, 1988" "cig. price in home state, 1988" "birth weight, ounces" ...
##  - attr(*, "version")= int 10
summary(data);
##      faminc          cigtax         cigprice         bwght      
##  Min.   : 0.50   Min.   : 2.00   Min.   :103.8   Min.   :  0.0  
##  1st Qu.:14.50   1st Qu.:15.00   1st Qu.:122.8   1st Qu.:106.0  
##  Median :27.50   Median :20.00   Median :130.8   Median :119.0  
##  Mean   :29.03   Mean   :19.55   Mean   :130.6   Mean   :117.9  
##  3rd Qu.:37.50   3rd Qu.:26.00   3rd Qu.:137.0   3rd Qu.:132.0  
##  Max.   :65.00   Max.   :38.00   Max.   :152.5   Max.   :271.0  
##                                                                 
##     fatheduc        motheduc         parity           male       
##  Min.   : 1.00   Min.   : 2.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:12.00   1st Qu.:12.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :12.00   Median :12.00   Median :1.000   Median :1.0000  
##  Mean   :13.19   Mean   :12.94   Mean   :1.633   Mean   :0.5209  
##  3rd Qu.:16.00   3rd Qu.:14.00   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :18.00   Max.   :18.00   Max.   :6.000   Max.   :1.0000  
##  NA's   :196     NA's   :1                                       
##      white             cigs            lbwght         bwghtlbs     
##  Min.   :0.0000   Min.   : 0.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.: 0.000   1st Qu.:4.663   1st Qu.: 6.625  
##  Median :1.0000   Median : 0.000   Median :4.779   Median : 7.438  
##  Mean   :0.7846   Mean   : 2.087   Mean   :4.726   Mean   : 7.366  
##  3rd Qu.:1.0000   3rd Qu.: 0.000   3rd Qu.:4.883   3rd Qu.: 8.250  
##  Max.   :1.0000   Max.   :50.000   Max.   :5.602   Max.   :16.938  
##                                                                    
##      packs           lfaminc       
##  Min.   :0.0000   Min.   :-0.6931  
##  1st Qu.:0.0000   1st Qu.: 2.6741  
##  Median :0.0000   Median : 3.3142  
##  Mean   :0.1044   Mean   : 3.0713  
##  3rd Qu.:0.0000   3rd Qu.: 3.6243  
##  Max.   :2.5000   Max.   : 4.1744  
## 
##Question 3 - Diving into "bwght"
summary(data$bwght)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   106.0   119.0   117.9   132.0   271.0
quantile(data$bwght, probs = c(0.01,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.99), na.rm = TRUE)
##     1%     5%    10%    25%    50%    75%    90%    95%    99% 
##  42.35  83.00  93.00 106.00 119.00 132.00 143.00 149.00 160.13
hist(data$bwght, main='Histogram of Birthweight Observations', xlab='Birthweight ("bwght")', ylab='Quantity of Births', breaks=4)

hist(data$bwght, main='Histogram of Birthweight Observations', xlab='Birthweight ("bwght")', ylab='Quantity of Births', breaks=10)

hist(data$bwght, main='Histogram of Birthweight Observations', xlab='Birthweight ("bwght")', ylab='Quantity of Births', breaks=16)

##(Part 4) Have you noticed anything "strange" with the bwght variable and the shape of histogram this variable? If so, please elaborate on your observations and investigate any issues you have identified.

##One item of note is that there appear to be observations for which the birthweight was 0, which of course, isn't a measurement considered either intuitive or even possible. It may be appropriate to account for this in our analysis, looking only at observations for which the birthweight is greater than 0.
##Question 4 - Diving into "cigs"
summary(data$cigs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.087   0.000  50.000
quantile(data$cigs, probs = c(0.01,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.99), na.rm = TRUE)
##  1%  5% 10% 25% 50% 75% 90% 95% 99% 
##   0   0   0   0   0   0  10  20  20
hist(data$cigs, main='Histogram of Cigarette Usage in Observations', xlab='Quantity of Cigarettes ("cigs")', ylab='Quantity of Users', breaks=4)

hist(data$cigs, main='Histogram of Cigarette Usage in Observations', xlab='Quantity of Cigarettes ("cigs")', ylab='Quantity of Users', breaks=10)

hist(data$cigs, main='Histogram of Cigarette Usage in Observations', xlab='Quantity of Cigarettes ("cigs")', ylab='Quantity of Users', breaks=16)

##In examining the "cigs" variable in the same manner as before (with "bwght"), we see that there exists a substantial positive skew in the distribution, with the majority of observations having no cigarette usage in birth mothers. 
##Question 5 - Scatterplot of Birthweight vs. Cigarette Usage
scatterplot(data$bwght, data$cigs, main="Birthweight vs. Cigarette Usage", xlab='Birthweight ("bwght")', ylab='Quantity of Cigarettes Used')
## Warning in smoother(.x, .y, col = col[2], log.x = logged("x"), log.y =
## logged("y"), : could not fit smooth

##From looking at the scatterplot only, it doesn't appear to explain an overwhelming amount of the variation in bwght. Our regression line, while negative in slope, possesses a very gradual slope at that. Additional analysis will help us identify whether additional regressors (or substitutes) can offer a more powerful lens into the variation of bwght.
##Question 6 - Linear Regression
##Standard Model
Q6_Model1 = lm(bwght~cigs, data=data);
summary(Q6_Model1)
## 
## Call:
## lm(formula = bwght ~ cigs, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -118.821  -11.179    1.179   14.179  152.179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 118.8211     0.6401 185.638  < 2e-16 ***
## cigs         -0.4642     0.1012  -4.587 4.91e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.51 on 1386 degrees of freedom
## Multiple R-squared:  0.01495,    Adjusted R-squared:  0.01424 
## F-statistic: 21.04 on 1 and 1386 DF,  p-value: 4.911e-06
##The resulting coeffcient estimates and associated standard errors (as shown in the output above) include: 
##Intercept -- coefficient of 118.8211 and s.e. of 0.6401
##"cigs" -- coefficient -0.4642 and s.e. of 0.1012.
##The results suggest to us that a reduction of 1 cigarette consumed by birth mothers will reduce the resulting birthweight of their child by approximately 0.4642 lbs. The p-values for our model are both statistically significant at a 1% level (being less than 0.01). We note, though, that this model allows for birthweights of 0 lbs. We should adjust for this abnormality and re-test below.
##Adjust for the occurrence of observations with 0.0 birthweight
Q6_Model2 = lm(bwght~cigs, data[data$bwght > 0, ]);
summary(Q6_Model2)
## 
## Call:
## lm(formula = bwght ~ cigs, data = data[data$bwght > 0, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.790 -11.790   0.357  13.210 151.210 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.78960    0.57595 207.987  < 2e-16 ***
## cigs         -0.51470    0.09073  -5.673 1.71e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.17 on 1376 degrees of freedom
## Multiple R-squared:  0.02285,    Adjusted R-squared:  0.02214 
## F-statistic: 32.18 on 1 and 1376 DF,  p-value: 1.711e-08
##Now, we have adjusted the model to exclude observations with a birthweight of 0 lbs. The resulting coeffcient estimates and associated standard errors (as shown in the output above) include: 
##Intercept -- coefficient of 119.7896 and s.e. of 0.57595
##"cigs" -- coefficient -0.51470 and s.e. of 0.09073.
##The p-values corresponding to our model are both statistically significant at a 1% level (being less than 0.01). The results of our "cigs" regressor also suggest to us that a reduction of 1 cigarette consumed by birth mothers will reduce the resulting birthweight of their child by approximately 0.5147 lbs., which is a larger (more negative) coefficient value than when we included observations with a birthweight of zero. In doing so, we also saw our Adjusted R-squared value increase from 0.01424 originally to 0.02214 now, which suggests that more variation in birthweight may be explained by "cigs" once zero-value birthweights are excluded.
##Question 7 - Scatterplot introducing "faminc", shown as a matrix
hist(data$cigs, main='Histogram of Family Income in Observations', xlab='Family Income (in thousands of dollars)', ylab='Quantity of Families', breaks=4)

##Similar to "cigs", the variable "faminc" (for family income) shows a positively skewed distribution, having a large percentage of observations under $5,000. To that end, we note that the dataset stems from the 1988 National Health Interview Survey, so we should not just to comparing family income in the same lens as in present day.
hist(data$cigs, main='Histogram of Family Income in Observations', xlab='Family Income (in thousands of dollars)', ylab='Quantity of Families', breaks=10)

hist(data$cigs, main='Histogram of Family Income in Observations', xlab='Family Income (in thousands of dollars)', ylab='Quantity of Families', breaks=16)

scatterplotMatrix(~bwght+cigs+faminc, data=data, main='Scatterplot Matrix: "bwght", "cigs", "faminc"')
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

##Question 8 - Regression of birthweight using cigarette usage and family income
Q8_Model = lm(bwght~cigs+faminc, data[data$bwght > 0, ])
summary(Q8_Model)
## 
## Call:
## lm(formula = bwght ~ cigs + faminc, data = data[data$bwght > 
##     0, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.075 -11.592   0.722  13.262 150.062 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.97933    1.05363 111.025  < 2e-16 ***
## cigs         -0.46407    0.09182  -5.054 4.91e-07 ***
## faminc        0.09314    0.02928   3.181   0.0015 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.11 on 1375 degrees of freedom
## Multiple R-squared:  0.02999,    Adjusted R-squared:  0.02858 
## F-statistic: 21.25 on 2 and 1375 DF,  p-value: 8.109e-10
##The resulting coeffcient estimates and associated standard errors for our model, which now includes both "cigs" and "faminc" as regressors, (as shown in the output above) include the following: 
##Intercept -- coefficient of 116.97933 and s.e. of 1.05363
##"cigs" -- coefficient of -0.46407 and s.e. of 0.09182
##"faminc" -- coefficient of 0.09314 and s.e. of 0.02928
##Our resulting p-values for all three items are statistically significant at a 1% level and our resulting Adjusted R-squared value 0.02858, suggesting that more variation in birthweight is being explained in our new model than those done previously.
##Question 9 - Explain, in your own words, what the coeffcient on cigs in the multiple regression means, and how it is different than the coeffcient on cigs in the simple regression? Please provide the intuition to explain the difference, if any.
## The coefficient on "cigs" in the multiple regression represents the relationship between changes in cigarette usage and resulting birthweight, holding all other variables constant except for the additional variable "faminc". In our simple regression, though, we only look into "cigs" and not "faminc". As such, our simple regression points to a relationship between "cigs" and "bwght", with all else being held equal (an assumption for which "faminc" is held equal, as well).
##Question 10 - Which cofficient for cigs is more negative than the other? Suggest an explanation for why this is so.
##From the regressions considered in previous steps, the coefficient on "cigs" was most negative (at -0.51470) in our second model, which was our simple regression from Question 6 that excluded observations with birthweight entries of 0 lbs.
##A potential explanation for this would be that the two variables, being "cigs" and "faminc" have a negative correlation, where the correlation between "cigs" and "bwght" is negative, but the correlation between "faminc" and "bwght" is positive. If this is the case, our model excluding "faminc" could reflect a effect of higher magnitude than when "faminc" is also included.