Part 1: Data Exploration

# Load the readr library
library(readr)

# Read in the data
bweight_data <- read_csv("bweight_nonecon.csv")
## Rows: 60 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): bweight, pairpoll, airpoll, alcohol, stillbirths, age, edu, nprenatal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few observations
head(bweight_data)
## # A tibble: 6 Ă— 8
##   bweight pairpoll airpoll alcohol stillbirths   age   edu nprenatal
##     <dbl>    <dbl>   <dbl>   <dbl>       <dbl> <dbl> <dbl>     <dbl>
## 1    3515        0       0       0           0    25    12        13
## 2    3600        0       0       0           0    22    12        13
## 3    3585        0       0       0           1    26    12        11
## 4    3459        0       0       0           0    21    12        11
## 5    3600        0       0       0           0    24     0         4
## 6    3430        0       0       0           0    19    12        15
# View the summary statistics of all variables
summary(bweight_data)
##     bweight        pairpoll       airpoll       alcohol     stillbirths    
##  Min.   :3090   Min.   :0.00   Min.   :0.0   Min.   :0.0   Min.   :0.0000  
##  1st Qu.:3260   1st Qu.:0.00   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.0000  
##  Median :3374   Median :0.50   Median :0.5   Median :0.0   Median :0.0000  
##  Mean   :3395   Mean   :1.15   Mean   :0.5   Mean   :0.1   Mean   :0.3333  
##  3rd Qu.:3544   3rd Qu.:2.25   3rd Qu.:1.0   3rd Qu.:0.0   3rd Qu.:1.0000  
##  Max.   :3771   Max.   :3.00   Max.   :1.0   Max.   :1.0   Max.   :1.0000  
##       age             edu          nprenatal     
##  Min.   :14.00   Min.   : 0.00   Min.   : 2.000  
##  1st Qu.:21.00   1st Qu.:12.00   1st Qu.: 8.000  
##  Median :24.00   Median :12.00   Median :10.000  
##  Mean   :24.30   Mean   :11.85   Mean   : 9.667  
##  3rd Qu.:26.25   3rd Qu.:12.00   3rd Qu.:12.000  
##  Max.   :34.00   Max.   :16.00   Max.   :16.000
# Look at the distributions of birth weight and prenatal air pollution category
hist(bweight_data$bweight)

barplot(table(bweight_data$pairpoll))

# Do a scatter plot of birth weight versus prenatal air pollution category
plot(bweight_data$pairpoll, bweight_data$bweight,
     xlab = "Prenatal Air Pollution Category",
     ylab = "Birth Weight (grams)",
     main = "Birth Weight vs. Prenatal Air Pollution Category")

## Part 2 : Regression Analysis

The sign of the coefficient đť›˝ in equation (1) represents the direction of the relationship between the independent variable (prenatal air pollution) and the dependent variable (birth weight).

If the coefficient đť›˝ is positive, it means that an increase in prenatal air pollution leads to an increase in birth weight. On the other hand, if the coefficient đť›˝ is negative, it means that an increase in prenatal air pollution leads to a decrease in birth weight.

In the context of birth weight and prenatal air pollution, it is generally expected that higher levels of air pollution would have negative effects on birth weight, so it is reasonable to expect the sign of đť›˝ to be negative.

library(lfe)
## Loading required package: Matrix
bweight_reg <- felm(bweight ~ pairpoll, data=bweight_data)
summary(bweight_reg)
## 
## Call:
##    felm(formula = bweight ~ pairpoll, data = bweight_data) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -334.55  -68.62    6.61   81.26  261.45 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3509.55      20.79 168.775  < 2e-16 ***
## pairpoll      -99.58      12.11  -8.225 2.57e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 119.6 on 58 degrees of freedom
## Multiple R-squared(full model): 0.5384   Adjusted R-squared: 0.5305 
## Multiple R-squared(proj model): 0.5384   Adjusted R-squared: 0.5305 
## F-statistic(full model):67.66 on 1 and 58 DF, p-value: 2.57e-11 
## F-statistic(proj model): 67.66 on 1 and 58 DF, p-value: 2.57e-11

(ii)You can interpret the sign and magnitude of the pairpoll coefficient đť›˝ as the estimated change in birth weight associated with a one-unit increase in prenatal air pollution. The magnitude of đť›˝ will give you an idea of the strength of the relationship between birth weight and prenatal air pollution, while the sign of đť›˝ will indicate the direction of the relationship (positive or negative)

  1. Whether or not the coefficient 𝛽 in equation (1) is likely to produce the causal effect of pairpoll on bweight depends on several factors. First, it is important to consider any omitted variables that might be affecting both pairpoll and bweight. For example, the mother’s health, the mother’s consumption of alcohol during pregnancy, the mother’s education, and the number of prenatal visits to a doctor are all factors that might influence both pairpoll and bweight. If these variables are omitted from equation (1), then the coefficient 𝛽 will be biased and will not accurately reflect the causal effect of pairpoll on bweight.

In particular, if the omitted variables are positively correlated with both pairpoll and bweight, then the coefficient đť›˝ will be positively biased, meaning that it will overestimate the true causal effect of pairpoll on bweight. On the other hand, if the omitted variables are negatively correlated with both pairpoll and bweight, then the coefficient đť›˝ will be negatively biased, meaning that it will underestimate the true causal effect of pairpoll on bweight.

Therefore, in order to determine whether đť›˝ is likely to produce the causal effect of pairpoll on bweight, it is necessary to control for the omitted variables and account for the potential biases that they might induce. This can be done through a variety of methods, such as instrumental variables estimation, fixed effects estimation, or regression discontinuity design

bweight_data <- read.csv("bweight_nonecon.csv")
omitted_factors <- c("alcohol", "stillbirths", "age", "edu", "nprenatal")

for (factor in omitted_factors) {
  model_formula <- as.formula(paste("bweight ~ pairpoll +", factor))
  model <- lm(model_formula, data = bweight_data)
  print(summary(model))
}
## 
## Call:
## lm(formula = model_formula, data = bweight_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -335.76  -74.99   10.18   84.50  260.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3510.76      20.91 167.875  < 2e-16 ***
## pairpoll      -96.90      12.60  -7.692 2.23e-10 ***
## alcohol       -42.99      53.57  -0.802    0.426    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 120 on 57 degrees of freedom
## Multiple R-squared:  0.5436, Adjusted R-squared:  0.5276 
## F-statistic: 33.94 on 2 and 57 DF,  p-value: 1.959e-10
## 
## 
## Call:
## lm(formula = model_formula, data = bweight_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -335.01  -68.26    6.83   81.84  262.46 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3510.008     23.348 150.336  < 2e-16 ***
## pairpoll     -99.553     12.231  -8.139 4.04e-11 ***
## stillbirths   -1.467     33.101  -0.044    0.965    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 120.7 on 57 degrees of freedom
## Multiple R-squared:  0.5384, Adjusted R-squared:  0.5222 
## F-statistic: 33.25 on 2 and 57 DF,  p-value: 2.695e-10
## 
## 
## Call:
## lm(formula = model_formula, data = bweight_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -305.14  -58.73   13.33   79.58  232.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3346.781     90.863  36.833  < 2e-16 ***
## pairpoll    -113.466     14.065  -8.067 5.32e-11 ***
## age            7.355      4.001   1.838   0.0712 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117.3 on 57 degrees of freedom
## Multiple R-squared:  0.5643, Adjusted R-squared:  0.549 
## F-statistic: 36.91 on 2 and 57 DF,  p-value: 5.226e-11
## 
## 
## Call:
## lm(formula = model_formula, data = bweight_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -320.76  -77.13    5.24   78.12  245.12 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3449.110     88.313  39.055  < 2e-16 ***
## pairpoll     -99.759     12.162  -8.202 3.17e-11 ***
## edu            5.118      7.265   0.704    0.484    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 120.2 on 57 degrees of freedom
## Multiple R-squared:  0.5424, Adjusted R-squared:  0.5263 
## F-statistic: 33.78 on 2 and 57 DF,  p-value: 2.107e-10
## 
## 
## Call:
## lm(formula = model_formula, data = bweight_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -331.02  -79.20    7.83   71.06  236.46 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3454.691     47.929  72.079  < 2e-16 ***
## pairpoll     -99.817     12.045  -8.287  2.3e-11 ***
## nprenatal      5.703      4.495   1.269     0.21    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 119 on 57 degrees of freedom
## Multiple R-squared:  0.5511, Adjusted R-squared:  0.5354 
## F-statistic: 34.99 on 2 and 57 DF,  p-value: 1.22e-10
bweight_data <- read.csv("bweight_nonecon.csv")

explanatory_vars <- c("alcohol", "stillbirths", "age", "edu", "nprenatal")

model_formula <- as.formula(paste("bweight ~ pairpoll +", paste(explanatory_vars, collapse = " + ")))
model <- lm(model_formula, data = bweight_data)
summary(model)
## 
## Call:
## lm(formula = model_formula, data = bweight_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -307.91  -70.49   10.48   78.17  230.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3317.123    121.547  27.291  < 2e-16 ***
## pairpoll    -110.669     15.512  -7.135 2.74e-09 ***
## alcohol      -27.698     54.702  -0.506    0.615    
## stillbirths  -20.164     35.812  -0.563    0.576    
## age            6.918      4.669   1.482    0.144    
## edu            1.620      7.943   0.204    0.839    
## nprenatal      2.830      5.136   0.551    0.584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 120.3 on 53 degrees of freedom
## Multiple R-squared:  0.5738, Adjusted R-squared:  0.5256 
## F-statistic: 11.89 on 6 and 53 DF,  p-value: 2.082e-08

The coefficients represent the estimated effects of each variable on the outcome variable (birth weight in this case). The intercept term represents the estimated birth weight when all predictor variables are equal to zero.

For the predictor variable “pairpoll,” the coefficient is -110.669, which means that for a one-unit increase in pair pollution, the estimated decrease in birth weight is 110.669 grams. This coefficient is statistically significant, as indicated by the low p-value (2.74e-09) in the Pr(>|t|) column.

For the other predictor variables, the coefficients are not statistically significant, as indicated by their high p-values. This means that there is not enough evidence to conclude that these variables have a significant effect on birth weight.

The coefficient for “alcohol” is -27.698, which means that for a one-unit increase in alcohol consumption during pregnancy, the estimated decrease in birth weight is 27.698 grams. However, this coefficient is not statistically significant, as indicated by the high p-value (0.615).

The coefficient for “stillbirths” is -20.164, which means that for a one-unit increase in stillbirths, the estimated decrease in birth weight is 20.164 grams. This coefficient is also not statistically significant, as indicated by the high p-value (0.576).

The coefficient for “age” is 6.918, which means that for a one-unit increase in the mother’s age, the estimated increase in birth weight is 6.918 grams. This coefficient is also not statistically significant, as indicated by the high p-value (0.144).

The coefficient for “edu” is 1.620, which means that for a one-unit increase in the mother’s education, the estimated increase in birth weight is 1.620 grams. This coefficient is also not statistically significant, as indicated by the high p-value (0.839).

The coefficient for “nprenatal” is 2.830, which means that for a one-unit increase in the number of prenatal visits, the estimated increase in birth weight is 2.830 grams. This coefficient is also not statistically significant, as indicated by the high p-value (0.584).

In this case, the coefficient for “pairpoll” without any controls was -110.669, and with all possible explanatory variables included, it was -110.669. Therefore, the percentage change in the coefficient 𝛽 is 0%. ??????????????

bweight_data <- read.csv("bweight_nonecon.csv")
model_formula <- as.formula("bweight ~ pairpoll + airpoll")
model <- lm(model_formula, data = bweight_data)
summary(model)
## 
## Call:
## lm(formula = model_formula, data = bweight_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -358.9  -50.1   10.1   66.1  237.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3533.900     19.352 182.611  < 2e-16 ***
## pairpoll      -7.787     24.778  -0.314 0.754465    
## airpoll     -259.823     63.220  -4.110 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106 on 57 degrees of freedom
## Multiple R-squared:  0.6439, Adjusted R-squared:  0.6314 
## F-statistic: 51.54 on 2 and 57 DF,  p-value: 1.654e-13
cor_result <- cor(bweight_data$bweight, bweight_data$airpoll)

(vi)

The correlation value of -0.8020706 between the dependent variable and the independent variable (polluted area) indicates a strong negative relationship. This means that as the value of the independent variable (polluted area) increases, the value of the dependent variable (bweight) decreases, and vice versa. A correlation coefficient close to -1 indicates a strong negative linear relationship, while a value close to +1 indicates a strong positive linear relationship.

(vii)

predictors <- c(2, 1, 1, 18, 10, 0)
coefficients <- coefficients(model)
predicted_birth_weight <- sum(predictors * coefficients)