Question for analysis:

How consistent is the lenght of pregnancy in humans and are there relationships between habits or demographics of pregnant mothers and the length of pregnancy (gestation period) or child birth?

I will attempt to take on the above questions using:

NCbirths: Data from births in North Carolina in 2001

This dataset contains data on a sample of 1450 birth records that statistician John Holcomb selected from the North Carolina State Center for Health and Environmental Statistics.

It contains 1450 observations on the following 15 variables.

  1. ID: Patient ID code
  2. Plural: 1=single birth, 2=twins, 3=triplets
  3. Sex: Sex of the baby 1=male 2=female
  4. MomAge: Mother’s age (in years)
  5. Weeks: Completed weeks of gestation
  6. Marital: Marital status: 1=married or 2=not married
  7. RaceMom: Mother’s race: 1=white, 2=black, 3=American Indian, 4=Chinese 5=Japanese, 6=Hawaiian, 7=Filipino, or 8=Other Asian or Pacific Islander
  8. HispMom: Hispanic origin of mother: C=Cuban, M=Mexican, N=not Hispanic O=Other Hispanic, P=Puerto Rico, S=Central/South America
  9. Gained: Weight gained during pregnancy (in pounds)
  10. Smoke: Smoker mom? 1=yes or 0=no
  11. BirthWeightOz: Birth weight in ounces
  12. BirthWeightGm: Birth weight in grams
  13. Low: Indicator for low birth weight, 1=2500 grams or less
  14. Premie: Indicator for premature birth, 1=36 weeks or sooner
  15. MomRace: Mother’s race: black, hispanic, other, or white

1. Data Exploration:

This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

ncbirths <- read.csv('NCbirths.csv', header = TRUE, sep=",")
summary(ncbirths)
##        X                ID             Plural           Sex       
##  Min.   :   1.0   Min.   :   1.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 363.2   1st Qu.: 363.2   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 725.5   Median : 725.5   Median :1.000   Median :1.000  
##  Mean   : 725.5   Mean   : 725.5   Mean   :1.037   Mean   :1.487  
##  3rd Qu.:1087.8   3rd Qu.:1087.8   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :1450.0   Max.   :1450.0   Max.   :3.000   Max.   :2.000  
##                                                                   
##      MomAge          Weeks          Marital         RaceMom      HispMom 
##  Min.   :13.00   Min.   :22.00   Min.   :1.000   Min.   :1.000   C:   2  
##  1st Qu.:22.00   1st Qu.:38.00   1st Qu.:1.000   1st Qu.:1.000   M: 128  
##  Median :26.00   Median :39.00   Median :1.000   Median :1.000   N:1283  
##  Mean   :26.76   Mean   :38.62   Mean   :1.345   Mean   :1.831   O:   3  
##  3rd Qu.:31.00   3rd Qu.:40.00   3rd Qu.:2.000   3rd Qu.:2.000   P:   9  
##  Max.   :43.00   Max.   :45.00   Max.   :2.000   Max.   :8.000   S:  25  
##                  NA's   :1                                               
##      Gained         Smoke        BirthWeightOz   BirthWeightGm   
##  Min.   : 0.0   Min.   :0.0000   Min.   : 12.0   Min.   : 340.2  
##  1st Qu.:20.0   1st Qu.:0.0000   1st Qu.:106.0   1st Qu.:3005.1  
##  Median :30.0   Median :0.0000   Median :118.0   Median :3345.3  
##  Mean   :30.6   Mean   :0.1446   Mean   :116.2   Mean   :3295.6  
##  3rd Qu.:40.0   3rd Qu.:0.0000   3rd Qu.:130.0   3rd Qu.:3685.5  
##  Max.   :95.0   Max.   :1.0000   Max.   :181.0   Max.   :5131.4  
##  NA's   :40     NA's   :5                                        
##       Low              Premie           MomRace   
##  Min.   :0.00000   Min.   :0.0000   black   :332  
##  1st Qu.:0.00000   1st Qu.:0.0000   hispanic:164  
##  Median :0.00000   Median :0.0000   other   : 48  
##  Mean   :0.08621   Mean   :0.1317   white   :906  
##  3rd Qu.:0.00000   3rd Qu.:0.0000                 
##  Max.   :1.00000   Max.   :1.0000                 
## 

It is apparent that majority of the mothers were young married white women with average age of 26.76 years, average gestation period of 38.62 weeks which is quite okay, with most of their babies delivered around the 36th week or lower on the average. It appears most of them don’t smoke as the median is 0 and the mean is around 14%

str(ncbirths)
## 'data.frame':    1450 obs. of  16 variables:
##  $ X            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Plural       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex          : int  1 2 1 1 1 1 2 2 2 2 ...
##  $ MomAge       : int  32 32 27 27 25 28 25 15 21 27 ...
##  $ Weeks        : int  40 37 39 39 39 43 39 42 39 40 ...
##  $ Marital      : int  1 1 1 1 1 1 1 2 1 2 ...
##  $ RaceMom      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ HispMom      : Factor w/ 6 levels "C","M","N","O",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Gained       : int  38 34 12 15 32 32 75 25 28 37 ...
##  $ Smoke        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BirthWeightOz: int  111 116 138 136 121 117 143 113 120 124 ...
##  $ BirthWeightGm: num  3147 3289 3912 3856 3430 ...
##  $ Low          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Premie       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MomRace      : Factor w/ 4 levels "black","hispanic",..: 4 4 4 4 4 4 4 4 4 4 ...

2. Data wrangling:

Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)

racesEval <- function(races)
  {
    mraces <- vector()
    
    for(race in races)
    {
       switch(race, `1` = {mraces <- c(mraces, "White")}, `2` = {mraces <- c(mraces, "Black")}, 
              `3` = {mraces <- c(mraces, "American Indian")}, `4` = {mraces <- c(mraces, "Chinese")}, 
             `5` = {mraces <- c(mraces, "Japanese")}, `6` = {mraces <- c(mraces, "Hawaiian")}, 
             `7` = {mraces <- c(mraces, "Filipino")}, `8` = {mraces <- c(mraces, "Other")})
      
    }
    
    return(mraces)
}
oldermothers <- na.omit(subset(ncbirths, MomAge >= 30, select = c("MomAge","Weeks", "MomRace")))
names(oldermothers)<- c("Age", "Gestation", "Race")

youngermothers <- na.omit(subset(ncbirths, MomAge <= 29, select = c("MomAge","Weeks", "MomRace")))
names(youngermothers)<- c("Age", "Gestation", "Race")

mumrace <- na.omit(ncbirths[ , c('MomRace', 'Weeks')])
names(mumrace)<- na.omit(c("Race", "Gestation"))

smokingHabit <- na.omit(ncbirths[ , c('Smoke', 'Weeks')])
names(smokingHabit)<- c("SmokingHabit", "Gestation")

plural <- na.omit(ncbirths[ , c('Plural', 'Weeks')])
names(plural)<- c("BabiesInOneBirth", "Gestation")

birthweight <- na.omit(ncbirths[ , c('BirthWeightGm', 'Weeks')])
names(birthweight)<- c("BabyWeigth", "Gestation")

birthweightonsmoke <- na.omit(ncbirths[ , c('BirthWeightGm', 'Weeks')])
names(birthweightonsmoke)<- c("SmokingHabit", "BabyWeight")


## To visualise percentage and frequency distributions of all the moms in all ages according to their Races

allmothers <- na.omit(subset(ncbirths, MomAge > 0, select = c("MomAge","Weeks", "RaceMom", "BirthWeightGm")))
names(allmothers)<- c("Age", "Gestation (weeks)", "Race", "Birth Weight (Gm)")
allmothers$Race <- racesEval(allmothers$Race)

mothers <- na.omit(subset(ncbirths, MomAge > 0, select = c("MomAge","Weeks", "MomRace")))
names(mothers)<- c("Age", "Gestation", "Race")
## using summarise
mothers %>% summarise(
          
          meanWhiteAge= round(mean(mothers$Age[mothers$Race == "white"]),2), 
          meanBlackAge= round(mean(mothers$Age[mothers$Race == "black"]),2), 
          meanHispanicAge= round(mean(mothers$Age[mothers$Race == "hispanic"]),2), 
          meanOtherAge= round(mean(mothers$Age[mothers$Race == "other"]),2),
          
          whiteMoms = length(mothers$Race[mothers$Race == "white"]), 
          blackMoms=length(mothers$Race[mothers$Race == "black"]),
          hispanicMoms=length(mothers$Race[mothers$Race == "hispanic"]),
          otherMoms=length(mothers$Race[mothers$Race == "other"]),
          
          meanWhiteGestation=round(mean(mothers$Gestation[mothers$Race == "white"]),2),
          meanBlackGestation=round(mean(mothers$Gestation[mothers$Race == "black"]),2),
          meanHispanicGestation=round(mean(mothers$Gestation[mothers$Race == "hispanic"]),2),
          meanOtherGestation=round(mean(mothers$Gestation[mothers$Race == "other"]),2),
          
          medianWhiteAge=round(median(mothers$Age[mothers$Race == "white"]),2), 
          medianBlackAge=round(mean(mothers$Age[mothers$Race == "black"]),2), 
          medianHispanicAge=round(mean(mothers$Age[mothers$Race == "hispanic"]),2),
          medianOtherAge=round(mean(mothers$Age[mothers$Race == "other"]),2))
##   meanWhiteAge meanBlackAge meanHispanicAge meanOtherAge whiteMoms
## 1        27.92        24.55           24.75        27.02       906
##   blackMoms hispanicMoms otherMoms meanWhiteGestation meanBlackGestation
## 1       331          164        48              38.65              38.31
##   meanHispanicGestation meanOtherGestation medianWhiteAge medianBlackAge
## 1                 38.99              38.94             28          24.55
##   medianHispanicAge medianOtherAge
## 1             24.75          27.02

Comparing gestation periods in mothers above 29 years of age and those 29 years and below

print(summary(oldermothers))
##       Age          Gestation           Race    
##  Min.   :30.00   Min.   :23.00   black   : 65  
##  1st Qu.:31.00   1st Qu.:38.00   hispanic: 33  
##  Median :33.00   Median :39.00   other   : 21  
##  Mean   :33.57   Mean   :38.66   white   :387  
##  3rd Qu.:35.00   3rd Qu.:40.00                 
##  Max.   :43.00   Max.   :45.00
head (oldermothers, 10)
##    Age Gestation     Race
## 1   32        40    white
## 2   32        37    white
## 18  36        37    white
## 22  30        38    white
## 23  32        43 hispanic
## 24  34        38    white
## 26  34        37    white
## 31  35        38    white
## 32  38        37    white
## 33  35        39 hispanic
print(summary(youngermothers))
##       Age          Gestation          Race    
##  Min.   :13.00   Min.   :22.0   black   :266  
##  1st Qu.:20.00   1st Qu.:38.0   hispanic:131  
##  Median :23.00   Median :39.0   other   : 27  
##  Mean   :23.11   Mean   :38.6   white   :519  
##  3rd Qu.:26.00   3rd Qu.:40.0                 
##  Max.   :29.00   Max.   :45.0
head (youngermothers, 10)
##    Age Gestation  Race
## 3   27        39 white
## 4   27        39 white
## 5   25        39 white
## 6   28        43 white
## 7   25        39 white
## 8   15        42 white
## 9   21        39 white
## 10  27        40 white
## 11  26        41 white
## 12  20        41 white

3. Graphics:

Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

## percentage distribution of all the mothers in all ages according to their Races

plot_ly(mothers, labels = ~Race, values = ~Age, type = 'pie',
        textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = '#FFFFFF'),
        hoverinfo = 'text',
        text = ~paste(Race, ' mothers of all ages'),
                      
        showlegend = FALSE) %>%
  layout(title = 'Distribution of all mothers by Race',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

Age distribution of mothers:

## Younger Moms
plot_ly(youngermothers, x = ~Age, color = ~Age) %>% add_histogram()%>%
  layout(title = "Younger Moms' Frequency distribution",
         xaxis = list(title = "Age (years)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
         yaxis = list(title = 'Frequency', titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
         tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min;
## returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max;
## returning -Inf
## Older moms Frequency distribution
plot_ly(oldermothers, x = ~Age, color = ~Age) %>% add_histogram()%>%
  layout(title = "Older moms' Frequency distribution",
         xaxis = list(title = "Age (years)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
         yaxis = list(title = 'Frequency', titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
         tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min;
## returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max;
## returning -Inf
## Younger Moms Vs Gestation

plot_ly(youngermothers, type="bar", x = ~Gestation, y = ~Age, color = ~Age)%>%
  layout(title = "Younger Moms' Vs Gestation",
         xaxis = list(title = "Gestation (weeks)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
         yaxis = list(title = 'Age (years)', titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
         tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## Warning: textfont.color doesn't (yet) support data arrays

## Warning: textfont.color doesn't (yet) support data arrays

Moms at ages 29 years had gestation periods that spanned 33 to 45 weeks

## Older Moms Vs Gestation

plot_ly(data=oldermothers, type="bar", x = ~Gestation, y = ~Age, color = ~Age)%>%
  layout(title = "Older Moms' Vs Gestation",
         xaxis = list(title = "Gestation (weeks)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
         yaxis = list(title = 'Age (years)', titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
         tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## Warning: textfont.color doesn't (yet) support data arrays

## Warning: textfont.color doesn't (yet) support data arrays
ggplot(mumrace, aes(x=Race, y=Gestation)) + 
  geom_point(aes(col=Race, size=Gestation)) + 
  geom_smooth(method="loess", se=T) + 
  labs(title="Gestation Period VS Race", y="Gestation", x="Race", caption="North Carolina births (2001)")

Race doesn’t have any effect on gestation period

boxplot(plural$Gestation ~ plural$BabiesInOneBirth, data=plural, main="Gestation Period Vs. Number of Babies per Birth", font.main=3, cex.main=1.2, xlab="Number of Babies per birth", ylab="Gestation (weeks)", font.lab=3, col=(c("#2962ff","#00695c", "#ad1457")) , 
    ylab="disease" , xlab="- variety -")

It appears that the more the number of babies per birth, the lower the gestation period

## Mom's Age Vs Gestation

plot_ly(data = allmothers, x = ~`Gestation (weeks)`, y = ~Age, color = ~Age,
         type = "scatter", 
         mode="markers")%>%
  layout(title = "Mom's Age (years) Vs. Gestation (weeks)",
         xaxis = list(title = "Gestation (weeks)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
         yaxis = list(title = "Mom's Age (years)", titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
         tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))

Suprisingly, the oldest moms didn’t have the longest gestation period

## baby's weight race Vs Mom's Race

plot_ly(data = allmothers, type="bar", x = ~Race, y = ~`Birth Weight (Gm)`, color = ~Race)%>%
  layout(title = "Baby's weight (gm) Vs. Mom's Race",
         xaxis = list(title = "Mom's Race", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
         yaxis = list(title = "Baby's Weight (gm)", titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
         tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## smokingHabit Vs Gestation
  
ggplot(smokingHabit, aes(Gestation, SmokingHabit))+ geom_point(colour = smokingHabit$Gestation, size = 4) + geom_line(colour ="#2962ff", size=0.8) +
  labs(y="Smoking Habit", 
       x="Gestation (weeks)", 
       title="SMoking Habit Vs Gestation(weeks)", 
       caption="North Carolina Births (2001)")

There appears to be no relationship between smoking habit and gestation period. Lenght of pregnancy is not affected by smoking habit.

## 3D plot to consider gestation periods in relation to Race and Age

plot_ly(mothers, x = ~Race, y = ~Age, z = ~Gestation, color = ~Race, colors = c('#BF382A', '#0C4B8E')) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Race'),
                     yaxis = list(title = 'Age'),
                     zaxis = list(title = 'Gestation')))

Conclusion:

From the analysis and visualisations, it is evident that length of pregnancy is not affected by Race and Smoking Habit. It is affected slightly by Age with Moms at ages 29 years having gestation periods that spanned 33 to 45 weeks. Also the number of babies (single, twins, etc) per birth reduces gestation periods and also affects baby’s weight. A lot of factors like mother’s health (which is not measured in this dataset) can contribute a lot in determining lenght of pregnancy, baby’s weight, timely/delayed/premature delivery, or even miscarriage. The comparison between Baby’s weight and mother’s race shows a fraction of the white moms delivering babies that weighed around 5000gm. This can be as a result of adequate diet and medical care during pregnancy.

5. BONUS:

place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Ncbirths_git <- read.csv2("https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/NCbirths.csv", header = TRUE, sep=",")
summary(Ncbirths_git) #Summary of the csv data from github.com
##        X                ID             Plural           Sex       
##  Min.   :   1.0   Min.   :   1.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 363.2   1st Qu.: 363.2   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 725.5   Median : 725.5   Median :1.000   Median :1.000  
##  Mean   : 725.5   Mean   : 725.5   Mean   :1.037   Mean   :1.487  
##  3rd Qu.:1087.8   3rd Qu.:1087.8   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :1450.0   Max.   :1450.0   Max.   :3.000   Max.   :2.000  
##                                                                   
##      MomAge          Weeks          Marital         RaceMom      HispMom 
##  Min.   :13.00   Min.   :22.00   Min.   :1.000   Min.   :1.000   C:   2  
##  1st Qu.:22.00   1st Qu.:38.00   1st Qu.:1.000   1st Qu.:1.000   M: 128  
##  Median :26.00   Median :39.00   Median :1.000   Median :1.000   N:1283  
##  Mean   :26.76   Mean   :38.62   Mean   :1.345   Mean   :1.831   O:   3  
##  3rd Qu.:31.00   3rd Qu.:40.00   3rd Qu.:2.000   3rd Qu.:2.000   P:   9  
##  Max.   :43.00   Max.   :45.00   Max.   :2.000   Max.   :8.000   S:  25  
##                  NA's   :1                                               
##      Gained         Smoke        BirthWeightOz   BirthWeightGm 
##  Min.   : 0.0   Min.   :0.0000   Min.   : 12.0   3316.95:  42  
##  1st Qu.:20.0   1st Qu.:0.0000   1st Qu.:106.0   3175.2 :  38  
##  Median :30.0   Median :0.0000   Median :118.0   3402   :  38  
##  Mean   :30.6   Mean   :0.1446   Mean   :116.2   3430.35:  38  
##  3rd Qu.:40.0   3rd Qu.:0.0000   3rd Qu.:130.0   3260.25:  37  
##  Max.   :95.0   Max.   :1.0000   Max.   :181.0   3345.3 :  35  
##  NA's   :40     NA's   :5                        (Other):1222  
##       Low              Premie           MomRace   
##  Min.   :0.00000   Min.   :0.0000   black   :332  
##  1st Qu.:0.00000   1st Qu.:0.0000   hispanic:164  
##  Median :0.00000   Median :0.0000   other   : 48  
##  Mean   :0.08621   Mean   :0.1317   white   :906  
##  3rd Qu.:0.00000   3rd Qu.:0.0000                 
##  Max.   :1.00000   Max.   :1.0000                 
##