library(readr)
setwd("C:/Users/16782/Downloads/AdvAnaProject")
df <- read_csv("us_births_2016_2021.csv")
## Rows: 5496 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): State, State Abbreviation, Gender, Education Level of Mother
## dbl (5): Year, Education Level Code, Number of Births, Average Age of Mother...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##1)Scatter Plot
ggplot(df, aes(x = `Education Level Code`,
y = `Average Age of Mother (years)`)) +
geom_jitter(width = 0.15, color = "steelblue", size = 2, alpha = 0.4) +
scale_x_continuous(breaks = 1:8, limits = c(1,8)) +
scale_y_continuous(breaks = 20:40)
## Warning: Removed 1226 rows containing missing values or values outside the scale range
## (`geom_point()`).
labs(
title = "Average Maternal Age by Education Level",
x = "Education Level Code",
y = "Average Maternal Age (years)"
) +
theme_minimal()
## NULL
The scatter plot shows how the average maternal age changes across different levels of education. Each point represents a group (1 state, 1 year, 1 gender, 1 education level code, and the number of babies in the group), which means that the graph shows patterns across multiple groups at once. The scatter plot is able to show that there is a wide range of maternal ages at the lowest education level. The range of maternal age narrows as education levels increases, and he maternal age increases as the education level increases.
##2) Calculating the average maternal age across each education level and creating a line plot to communicate the findings
mean_age <- df %>%
group_by(`Education Level Code`) %>%
summarize(mean_maternal_age = mean(`Average Age of Mother (years)`, na.rm = TRUE))
mean_age
## # A tibble: 9 × 2
## `Education Level Code` mean_maternal_age
## <dbl> <dbl>
## 1 -9 29.4
## 2 1 29.4
## 3 2 25.1
## 4 3 26.5
## 5 4 28.1
## 6 5 29.9
## 7 6 31.2
## 8 7 32.8
## 9 8 33.7
ggplot(mean_age, aes(x = `Education Level Code`,
y = mean_maternal_age)) +
geom_line(color = "darkred", linewidth = 1) +
geom_point(color = "black", size = 3) +
scale_x_continuous(breaks = 1:8, limits = c(1,8)) +
labs(
title = "Mean Maternal Age by Education Level",
x = "Education Level Code (1 = lowest, 6 = highest)",
y = "Mean Maternal Age (years)"
) +
theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
The average maternal age increases as education level increases. This is able to support the hypothesis that mothers with higher levels of education tend to have children at older ages. Overall the results of the line plot show a consistent upward trend, supporting the hypothesis.
##3) Using linear regression to see if there is a statistically significant relationship between education level and maternal age.
model <- lm(`Average Age of Mother (years)` ~ `Education Level Code`, data = df)
summary(model)
##
## Call:
## lm(formula = `Average Age of Mother (years)` ~ `Education Level Code`,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2235 -2.0694 0.3077 2.1388 7.7290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.87756 0.04118 701.31 <2e-16 ***
## `Education Level Code` 0.22296 0.00733 30.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.572 on 5494 degrees of freedom
## Multiple R-squared: 0.1441, Adjusted R-squared: 0.144
## F-statistic: 925.3 on 1 and 5494 DF, p-value: < 2.2e-16
ggplot(df, aes(x = `Education Level Code`,
y = `Average Age of Mother (years)`)) +
geom_jitter(width = 0.15, height = 0, alpha = 0.4, color = "steelblue") +
scale_x_continuous(breaks = 1:8, limits = c(1,8)) +
scale_y_continuous(breaks = 20:40)+
geom_smooth(method = "lm", se = TRUE, color = "darkred") +
labs(
title = "Regression of Maternal Age on Education Level",
x = "Education Level Code",
y = "Average Maternal Age (years)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 600 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1225 rows containing missing values or values outside the scale range
## (`geom_point()`).
The linear regression analysis shows that there is a significant
positive correlation between education level and maternal age. The
p-value is really small which means that the relationship is highly
unlikely due to random chance. The R^2 value shows that the education
level is able to explain about 14% of the variation in maternal age.
##4) Creating a histogram to see how maternal age is distributed across all groups
ggplot(df, aes(x = `Average Age of Mother (years)`)) +
geom_histogram(binwidth = 1, fill = "steelblue", alpha = 0.8) +
labs(
title = "Distribution of Average Maternal Age",
x = "Average Age of Mother (years)",
y = "Count of Groups"
) +
theme_minimal()
The distribution is roughly bell-shaped s with the most frequent
maternal age groups range from late-20s to early 30s indicating that the
average maternal ages are concentrated in the middle instead of even
distributed. There are not any very low or very high averages, therefore
the distribution tapers off at the extremes.
##5)Splitting the data into two groups (Male and Female babies) and evaluate average maternal age between the two groups using t-test
group_male <- df %>% filter(`Gender` == "M")
group_female <- df %>% filter(`Gender` == "F")
t.test(group_male$`Average Age of Mother (years)`,
group_female$`Average Age of Mother (years)`,
var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: group_male$`Average Age of Mother (years)` and group_female$`Average Age of Mother (years)`
## t = -0.11938, df = 5494, p-value = 0.905
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1559793 0.1380720
## sample estimates:
## mean of x mean of y
## 29.54780 29.55675
The average maternal age is almost identical for male and female babies. The test statistic is extremely small indicating that the there is no meaningful difference between the two groups. The p-value is very large which also shows that there is not a statistically significant difference between the two groups. The t-test suggests that the gender of the baby does not have an influence on the average maternal age.