R Markdown

library(readr)

setwd("C:/Users/16782/Downloads/AdvAnaProject")
df <- read_csv("us_births_2016_2021.csv")
## Rows: 5496 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): State, State Abbreviation, Gender, Education Level of Mother
## dbl (5): Year, Education Level Code, Number of Births, Average Age of Mother...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##1)Scatter Plot

ggplot(df, aes(x = `Education Level Code`,
               y = `Average Age of Mother (years)`)) +
  geom_jitter(width = 0.15, color = "steelblue", size = 2, alpha = 0.4) +
  scale_x_continuous(breaks = 1:8, limits = c(1,8)) +
  scale_y_continuous(breaks = 20:40)
## Warning: Removed 1226 rows containing missing values or values outside the scale range
## (`geom_point()`).

  labs(
    title = "Average Maternal Age by Education Level",
    x = "Education Level Code",
    y = "Average Maternal Age (years)"
  ) +
  theme_minimal()
## NULL

The scatter plot shows how the average maternal age changes across different levels of education. Each point represents a group (1 state, 1 year, 1 gender, 1 education level code, and the number of babies in the group), which means that the graph shows patterns across multiple groups at once. The scatter plot is able to show that there is a wide range of maternal ages at the lowest education level. The range of maternal age narrows as education levels increases, and he maternal age increases as the education level increases.

##2) Calculating the average maternal age across each education level and creating a line plot to communicate the findings

mean_age <- df %>%
  group_by(`Education Level Code`) %>%
  summarize(mean_maternal_age = mean(`Average Age of Mother (years)`, na.rm = TRUE))

mean_age
## # A tibble: 9 × 2
##   `Education Level Code` mean_maternal_age
##                    <dbl>             <dbl>
## 1                     -9              29.4
## 2                      1              29.4
## 3                      2              25.1
## 4                      3              26.5
## 5                      4              28.1
## 6                      5              29.9
## 7                      6              31.2
## 8                      7              32.8
## 9                      8              33.7
ggplot(mean_age, aes(x = `Education Level Code`,
                     y = mean_maternal_age)) +
  geom_line(color = "darkred", linewidth = 1) +
  geom_point(color = "black", size = 3) +
  scale_x_continuous(breaks = 1:8, limits = c(1,8)) +
  labs(
    title = "Mean Maternal Age by Education Level",
    x = "Education Level Code (1 = lowest, 6 = highest)",
    y = "Mean Maternal Age (years)"
  ) +
  theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

The average maternal age increases as education level increases. This is able to support the hypothesis that mothers with higher levels of education tend to have children at older ages. Overall the results of the line plot show a consistent upward trend, supporting the hypothesis.

##3) Using linear regression to see if there is a statistically significant relationship between education level and maternal age.

model <- lm(`Average Age of Mother (years)` ~ `Education Level Code`, data = df)
summary(model)
## 
## Call:
## lm(formula = `Average Age of Mother (years)` ~ `Education Level Code`, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2235 -2.0694  0.3077  2.1388  7.7290 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            28.87756    0.04118  701.31   <2e-16 ***
## `Education Level Code`  0.22296    0.00733   30.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.572 on 5494 degrees of freedom
## Multiple R-squared:  0.1441, Adjusted R-squared:  0.144 
## F-statistic: 925.3 on 1 and 5494 DF,  p-value: < 2.2e-16
ggplot(df, aes(x = `Education Level Code`,
               y = `Average Age of Mother (years)`)) +
  geom_jitter(width = 0.15, height = 0, alpha = 0.4, color = "steelblue") +
   scale_x_continuous(breaks = 1:8, limits = c(1,8)) +
  scale_y_continuous(breaks = 20:40)+
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  labs(
    title = "Regression of Maternal Age on Education Level",
    x = "Education Level Code",
    y = "Average Maternal Age (years)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 600 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1225 rows containing missing values or values outside the scale range
## (`geom_point()`).

The linear regression analysis shows that there is a significant positive correlation between education level and maternal age. The p-value is really small which means that the relationship is highly unlikely due to random chance. The R^2 value shows that the education level is able to explain about 14% of the variation in maternal age.

##4) Creating a histogram to see how maternal age is distributed across all groups

ggplot(df, aes(x = `Average Age of Mother (years)`)) +
  geom_histogram(binwidth = 1, fill = "steelblue", alpha = 0.8) +
  labs(
    title = "Distribution of Average Maternal Age",
    x = "Average Age of Mother (years)",
    y = "Count of Groups"
  ) +
  theme_minimal()

The distribution is roughly bell-shaped s with the most frequent maternal age groups range from late-20s to early 30s indicating that the average maternal ages are concentrated in the middle instead of even distributed. There are not any very low or very high averages, therefore the distribution tapers off at the extremes.

##5)Splitting the data into two groups (Male and Female babies) and evaluate average maternal age between the two groups using t-test

group_male <- df %>% filter(`Gender` == "M")
group_female <- df %>% filter(`Gender` == "F")

t.test(group_male$`Average Age of Mother (years)`,
       group_female$`Average Age of Mother (years)`,
       var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  group_male$`Average Age of Mother (years)` and group_female$`Average Age of Mother (years)`
## t = -0.11938, df = 5494, p-value = 0.905
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1559793  0.1380720
## sample estimates:
## mean of x mean of y 
##  29.54780  29.55675

The average maternal age is almost identical for male and female babies. The test statistic is extremely small indicating that the there is no meaningful difference between the two groups. The p-value is very large which also shows that there is not a statistically significant difference between the two groups. The t-test suggests that the gender of the baby does not have an influence on the average maternal age.