Final project data 101

Introduction:

Research question

Does mean birth weight differ across mother’s age groups (teens, adults, and mature mothers)?

The data set I am investigating is called “ncbirths.csv”. The source of the data set is the state of North Carolina’s public large database data set. This data set contains information on births recorded in this state. The data has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children.I found this data set on https://www.openintro.org/data/index.php?data=ncbirths . I choose this data set because I thought it would be interesting as a women to know whether mean birth weight differs across different ages. In my data set there is 1000 observations and 13 variables.

Variables:

mage (quantitative variable) = mother’s age in years.

weight (continuous quantitative variable) = birth weight in pounds of the child.

Loading library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setting working directory

setwd("~/Desktop/Data 101")
nc_births<-read_csv("ncbirths.csv")

Data analysis:

I started to analyze my data by checking the head and structure using the “head” and “str” function. I then checked for NA’s using ‘colSums’ and I noticed I had NA’s in the variables I was not using therefore I did not need to filter NA’s because I was not selecting the variable. I used mdplyr functions such as select .First, I used select to select the variables I was focused on such as mage and wight. Second, I used the mutate function with ‘case_when’ to add a column for different pregnancy age groups. Lastly,I used group_by to group by the different pregnancy age groups and summarized the different min,max and mean ages in each group to understand the different age groups in the data set.

Finally, I plotted a box plot for my visualization comparing the birth weights of the children and their mother’s pregnancy age group . I chose a box plot because it is the easiest to visualize the spread of the data.

Checking the structure and head of the columns

str(nc_births)

## spc_tbl_ [1,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ fage          : num [1:1000] NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : num [1:1000] 13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : chr [1:1000] "younger mom" "younger mom" "younger mom" "younger mom" ...
##  $ weeks         : num [1:1000] 39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : chr [1:1000] "full term" "full term" "full term" "full term" ...
##  $ visits        : num [1:1000] 10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : chr [1:1000] "not married" "not married" "not married" "not married" ...
##  $ gained        : num [1:1000] 38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num [1:1000] 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: chr [1:1000] "not low" "not low" "not low" "not low" ...
##  $ gender        : chr [1:1000] "male" "male" "female" "male" ...
##  $ habit         : chr [1:1000] "nonsmoker" "nonsmoker" "nonsmoker" "nonsmoker" ...
##  $ whitemom      : chr [1:1000] "not white" "not white" "white" "white" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   fage = col_double(),
##   ..   mage = col_double(),
##   ..   mature = col_character(),
##   ..   weeks = col_double(),
##   ..   premie = col_character(),
##   ..   visits = col_double(),
##   ..   marital = col_character(),
##   ..   gained = col_double(),
##   ..   weight = col_double(),
##   ..   lowbirthweight = col_character(),
##   ..   gender = col_character(),
##   ..   habit = col_character(),
##   ..   whitemom = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(nc_births)

## # A tibble: 6 × 13
##    fage  mage mature    weeks premie visits marital gained weight lowbirthweight
##   <dbl> <dbl> <chr>     <dbl> <chr>   <dbl> <chr>    <dbl>  <dbl> <chr>         
## 1    NA    13 younger …    39 full …     10 not ma…     38   7.63 not low       
## 2    NA    14 younger …    42 full …     15 not ma…     20   7.88 not low       
## 3    19    15 younger …    37 full …     11 not ma…     38   6.63 not low       
## 4    21    15 younger …    41 full …      6 not ma…     34   8    not low       
## 5    NA    15 younger …    39 full …      9 not ma…     27   6.38 not low       
## 6    NA    15 younger …    38 full …     19 not ma…     22   5.38 low           
## # ℹ 3 more variables: gender <chr>, habit <chr>, whitemom <chr>

Checking for NA’s

colSums(is.na(nc_births))

##           fage           mage         mature          weeks         premie 
##            171              0              0              2              2 
##         visits        marital         gained         weight lowbirthweight 
##              9              1             27              0              0 
##         gender          habit       whitemom 
##              0              1              2

I did not filter out NA’s since there were none in the variables I was using.

Making the pregnancy groups

nc_births_clean<-nc_births |>
  select(mage, weight) |>
  mutate(pregnancy_age_class=case_when( 
    mage < 20 ~ "teenage pregnancy",
    mage >= 20 & mage <=34 ~ "common pregnancy",
    mage >=35 ~ "AMA pregnancy"))
nc_births_clean

## # A tibble: 1,000 × 3
##     mage weight pregnancy_age_class
##    <dbl>  <dbl> <chr>              
##  1    13   7.63 teenage pregnancy  
##  2    14   7.88 teenage pregnancy  
##  3    15   6.63 teenage pregnancy  
##  4    15   8    teenage pregnancy  
##  5    15   6.38 teenage pregnancy  
##  6    15   5.38 teenage pregnancy  
##  7    15   8.44 teenage pregnancy  
##  8    15   4.69 teenage pregnancy  
##  9    16   8.81 teenage pregnancy  
## 10    16   6.94 teenage pregnancy  
## # ℹ 990 more rows

 #made the groups based off the different type of pregnancies I referred to the links in the references.

Summary Stats

Age in the table represents the mother’s age. Where AMA=advanced mother’s age.

nc_births_clean |>
  group_by(pregnancy_age_class) |>
  summarise(min_age=min(mage), 
            max_age=max(mage),
            mean_age=mean(mage),)

## # A tibble: 3 × 4
##   pregnancy_age_class min_age max_age mean_age
##   <chr>                 <dbl>   <dbl>    <dbl>
## 1 AMA pregnancy            35      50     37.2
## 2 common pregnancy         20      34     26.6
## 3 teenage pregnancy        13      19     17.7

The summary stats reveal the pregnancy age classes provide information on the different ages present in the data set. The teenage pregnancy age class had the youngest age at 13 which surprised me since that’s very young to have a child. Additionally, the AMA pregnancy had the highest age at 50 which also shocked me since according to Mayo clinic (reference below) the average age of menopause is around 40-50 years old.

Box plot visualization

ggplot(nc_births_clean,aes(x=pregnancy_age_class, y=weight, fill=pregnancy_age_class))+
  geom_boxplot()+
  labs(
    title= "Birth Weight by Mother's Age group",
    x="Mother's Age",
    y="Birth weight(lbs)" ) +
  theme_minimal()

The box plot shows the distribution of birth weights in pounds across the different pregnancy age classes from the North Carolina birth data set. The three different age group have similar median birth weights, the median being around 7 across the three age groups.Each box plot has outliers below the minimum birth weights and common pregnancy has the most outliers with also higher outliers. This could be due to the number of common pregnancy’s as compared to the teenage and A.M.A. pregnancies.

Hypothesis

\(H_0\):There is no difference in the mean birth weight across the pregnancy age groups. \(H_a\): At least one pregnancy age group has a different mean birth weight compared to the others.

\(H_0\): \(\mu_t\) = \(\mu_c\) = \(\mu_m\)

\(H_a\): not all \(\mu_i\) are equal

where: \(\mu_t\)= mean weight of a child born from a teen mother. \(\mu_c\)= mean weight of a child born from a common adult mother. \(\mu_m\)= mean weight of a child born from a mature mother.

ANOVA test

I selected an ANOVA test for my analysis because my dependent variable weight (birth weight) is continuous and quantitative and the age groups are categorical with more than two groups. Thus, ANOVA, tests whether mean birth weight differ across mother’s age groups (teens, adults, and mature mothers) or if there is no difference in birth weight across mother’s ages. I. first conducted an ANOVA test using my quantitative variable first and age groups as a predictor and analyzed the P-value from the summary results to test whether the results are statistically significant. Next, I did the Tukey’s Honestly Significant Difference to tell whether there is a significant difference between the different mean weights of the different types of pregnancy age classes.

Performing ANOVA

anova_result <- aov(weight ~ pregnancy_age_class, data = nc_births_clean) #Quantative first and pregnancy age class category second

anova_result

## Call:
##    aov(formula = weight ~ pregnancy_age_class, data = nc_births_clean)
## 
## Terms:
##                 pregnancy_age_class Residuals
## Sum of Squares              12.3531 2262.0295
## Deg. of Freedom                   2       997
## 
## Residual standard error: 1.506266
## Estimated effects may be unbalanced

summary(anova_result)

##                      Df Sum Sq Mean Sq F value Pr(>F)  
## pregnancy_age_class   2   12.4   6.177   2.722 0.0662 .
## Residuals           997 2262.0   2.269                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The degree of freedom is two since there a three different pregnancy age group. The ANOVA test tested whether birth weight was different between the three pregnancy age classes: teen pregnancy, common pregnancy, and A.M.A. pregnancy. The p-value is 0.0622, which is above p=0.05 and tells us that the pregnancy age group is not statistically significant. This means that we fail to reject the null hypothesis; there is no strong evidence to suggest that there is different in birth weight depending on pregnancy age class.

Tukey’s Honestly Significant Difference (HSD) test on the ANOVA model

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ pregnancy_age_class, data = nc_births_clean)
## 
## $pregnancy_age_class
##                                           diff        lwr        upr     p adj
## common pregnancy-AMA pregnancy      0.01699884 -0.3154120 0.34940970 0.9920884
## teenage pregnancy-AMA pregnancy    -0.34029118 -0.7959449 0.11536256 0.1862368
## teenage pregnancy-common pregnancy -0.35729002 -0.7180507 0.00347069 0.0529134

Tukey’s Honestly Significant Difference suggest similar results to the ANOVA model, showing that none of the differences between age groups is significant. However, the largest difference was between teenage pregnancy and common pregnancy at -0.357lbs and had the lowest p-value at 0.0529, very close to the 0.05 threshold to be significant. The common pregnancy and Advanced Mature Age pregnancy had the smallest difference in mean birth weight at 0.0169lbs, and the largest p-value, indicating no difference. Therefore, since this data is not statistically significant, we can not use it to predict other data on the birth weight and pregnancy age class. Any differences in the data set are most likely due to random chance rather than a true difference between the mean birth weight depending on age groups.

Conclusion and future directions

In conclusion,The ANOVA test tested whether birth weight was different between the three pregnancy age classes: teen pregnancy, common pregnancy, and A.M.A. pregnancy. The p-value is 0.0622, which is above p=0.05 and tells us that the pregnancy age group is not statistically significant. This means that we fail to reject the null hypothesis; there is no strong evidence to suggest that there is different in birth weight depending on pregnancy age class.

The implications of this results show that any differences in mean birth weight are most likely due to random chance rather than a true difference between the mean birth weight depending on age groups.

Potential Avenues

Potential Avenues is testing the difference in mean birth weight across age groups for the entire U.S since this data set only cover North Carolina there could be differences between results on a national scale. Additionally, acknowledging confounding variables such as the habit of the mother whether she smoked or not could affect the study.

References

https://www.ssmhealth.com/services/maternity-care/high-risk-pregnancy/age-pregnancy

https://americanpregnancy.org/unplanned-pregnancy/teenage-pregnancy/