How consistent is the lenght of pregnancy in humans and are there relationships between habits or demographics of pregnant mothers and the length of pregnancy (gestation period) or child birth?
NCbirths: Data from births in North Carolina in 2001
This dataset contains data on a sample of 1450 birth records that statistician John Holcomb selected from the North Carolina State Center for Health and Environmental Statistics.
It contains 1450 observations on the following 15 variables.
This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
ncbirths <- read.csv('NCbirths.csv', header = TRUE, sep=",")
summary(ncbirths)
## X ID Plural Sex
## Min. : 1.0 Min. : 1.0 Min. :1.000 Min. :1.000
## 1st Qu.: 363.2 1st Qu.: 363.2 1st Qu.:1.000 1st Qu.:1.000
## Median : 725.5 Median : 725.5 Median :1.000 Median :1.000
## Mean : 725.5 Mean : 725.5 Mean :1.037 Mean :1.487
## 3rd Qu.:1087.8 3rd Qu.:1087.8 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :1450.0 Max. :1450.0 Max. :3.000 Max. :2.000
##
## MomAge Weeks Marital RaceMom HispMom
## Min. :13.00 Min. :22.00 Min. :1.000 Min. :1.000 C: 2
## 1st Qu.:22.00 1st Qu.:38.00 1st Qu.:1.000 1st Qu.:1.000 M: 128
## Median :26.00 Median :39.00 Median :1.000 Median :1.000 N:1283
## Mean :26.76 Mean :38.62 Mean :1.345 Mean :1.831 O: 3
## 3rd Qu.:31.00 3rd Qu.:40.00 3rd Qu.:2.000 3rd Qu.:2.000 P: 9
## Max. :43.00 Max. :45.00 Max. :2.000 Max. :8.000 S: 25
## NA's :1
## Gained Smoke BirthWeightOz BirthWeightGm
## Min. : 0.0 Min. :0.0000 Min. : 12.0 Min. : 340.2
## 1st Qu.:20.0 1st Qu.:0.0000 1st Qu.:106.0 1st Qu.:3005.1
## Median :30.0 Median :0.0000 Median :118.0 Median :3345.3
## Mean :30.6 Mean :0.1446 Mean :116.2 Mean :3295.6
## 3rd Qu.:40.0 3rd Qu.:0.0000 3rd Qu.:130.0 3rd Qu.:3685.5
## Max. :95.0 Max. :1.0000 Max. :181.0 Max. :5131.4
## NA's :40 NA's :5
## Low Premie MomRace
## Min. :0.00000 Min. :0.0000 black :332
## 1st Qu.:0.00000 1st Qu.:0.0000 hispanic:164
## Median :0.00000 Median :0.0000 other : 48
## Mean :0.08621 Mean :0.1317 white :906
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
##
It is apparent that majority of the mothers were young married white women with average age of 26.76 years, average gestation period of 38.62 weeks which is quite okay, with most of their babies delivered around the 36th week or lower on the average. It appears most of them don’t smoke as the median is 0 and the mean is around 14%
str(ncbirths)
## 'data.frame': 1450 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Plural : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : int 1 2 1 1 1 1 2 2 2 2 ...
## $ MomAge : int 32 32 27 27 25 28 25 15 21 27 ...
## $ Weeks : int 40 37 39 39 39 43 39 42 39 40 ...
## $ Marital : int 1 1 1 1 1 1 1 2 1 2 ...
## $ RaceMom : int 1 1 1 1 1 1 1 1 1 1 ...
## $ HispMom : Factor w/ 6 levels "C","M","N","O",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Gained : int 38 34 12 15 32 32 75 25 28 37 ...
## $ Smoke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BirthWeightOz: int 111 116 138 136 121 117 143 113 120 124 ...
## $ BirthWeightGm: num 3147 3289 3912 3856 3430 ...
## $ Low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Premie : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MomRace : Factor w/ 4 levels "black","hispanic",..: 4 4 4 4 4 4 4 4 4 4 ...
Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)
racesEval <- function(races)
{
mraces <- vector()
for(race in races)
{
switch(race, `1` = {mraces <- c(mraces, "White")}, `2` = {mraces <- c(mraces, "Black")},
`3` = {mraces <- c(mraces, "American Indian")}, `4` = {mraces <- c(mraces, "Chinese")},
`5` = {mraces <- c(mraces, "Japanese")}, `6` = {mraces <- c(mraces, "Hawaiian")},
`7` = {mraces <- c(mraces, "Filipino")}, `8` = {mraces <- c(mraces, "Other")})
}
return(mraces)
}
oldermothers <- na.omit(subset(ncbirths, MomAge >= 30, select = c("MomAge","Weeks", "MomRace")))
names(oldermothers)<- c("Age", "Gestation", "Race")
youngermothers <- na.omit(subset(ncbirths, MomAge <= 29, select = c("MomAge","Weeks", "MomRace")))
names(youngermothers)<- c("Age", "Gestation", "Race")
mumrace <- na.omit(ncbirths[ , c('MomRace', 'Weeks')])
names(mumrace)<- na.omit(c("Race", "Gestation"))
smokingHabit <- na.omit(ncbirths[ , c('Smoke', 'Weeks')])
names(smokingHabit)<- c("SmokingHabit", "Gestation")
plural <- na.omit(ncbirths[ , c('Plural', 'Weeks')])
names(plural)<- c("BabiesInOneBirth", "Gestation")
birthweight <- na.omit(ncbirths[ , c('BirthWeightGm', 'Weeks')])
names(birthweight)<- c("BabyWeigth", "Gestation")
birthweightonsmoke <- na.omit(ncbirths[ , c('BirthWeightGm', 'Weeks')])
names(birthweightonsmoke)<- c("SmokingHabit", "BabyWeight")
## To visualise percentage and frequency distributions of all the moms in all ages according to their Races
allmothers <- na.omit(subset(ncbirths, MomAge > 0, select = c("MomAge","Weeks", "RaceMom", "BirthWeightGm")))
names(allmothers)<- c("Age", "Gestation (weeks)", "Race", "Birth Weight (Gm)")
allmothers$Race <- racesEval(allmothers$Race)
mothers <- na.omit(subset(ncbirths, MomAge > 0, select = c("MomAge","Weeks", "MomRace")))
names(mothers)<- c("Age", "Gestation", "Race")
## using summarise
mothers %>% summarise(
meanWhiteAge= round(mean(mothers$Age[mothers$Race == "white"]),2),
meanBlackAge= round(mean(mothers$Age[mothers$Race == "black"]),2),
meanHispanicAge= round(mean(mothers$Age[mothers$Race == "hispanic"]),2),
meanOtherAge= round(mean(mothers$Age[mothers$Race == "other"]),2),
whiteMoms = length(mothers$Race[mothers$Race == "white"]),
blackMoms=length(mothers$Race[mothers$Race == "black"]),
hispanicMoms=length(mothers$Race[mothers$Race == "hispanic"]),
otherMoms=length(mothers$Race[mothers$Race == "other"]),
meanWhiteGestation=round(mean(mothers$Gestation[mothers$Race == "white"]),2),
meanBlackGestation=round(mean(mothers$Gestation[mothers$Race == "black"]),2),
meanHispanicGestation=round(mean(mothers$Gestation[mothers$Race == "hispanic"]),2),
meanOtherGestation=round(mean(mothers$Gestation[mothers$Race == "other"]),2),
medianWhiteAge=round(median(mothers$Age[mothers$Race == "white"]),2),
medianBlackAge=round(mean(mothers$Age[mothers$Race == "black"]),2),
medianHispanicAge=round(mean(mothers$Age[mothers$Race == "hispanic"]),2),
medianOtherAge=round(mean(mothers$Age[mothers$Race == "other"]),2))
## meanWhiteAge meanBlackAge meanHispanicAge meanOtherAge whiteMoms
## 1 27.92 24.55 24.75 27.02 906
## blackMoms hispanicMoms otherMoms meanWhiteGestation meanBlackGestation
## 1 331 164 48 38.65 38.31
## meanHispanicGestation meanOtherGestation medianWhiteAge medianBlackAge
## 1 38.99 38.94 28 24.55
## medianHispanicAge medianOtherAge
## 1 24.75 27.02
print(summary(oldermothers))
## Age Gestation Race
## Min. :30.00 Min. :23.00 black : 65
## 1st Qu.:31.00 1st Qu.:38.00 hispanic: 33
## Median :33.00 Median :39.00 other : 21
## Mean :33.57 Mean :38.66 white :387
## 3rd Qu.:35.00 3rd Qu.:40.00
## Max. :43.00 Max. :45.00
head (oldermothers, 10)
## Age Gestation Race
## 1 32 40 white
## 2 32 37 white
## 18 36 37 white
## 22 30 38 white
## 23 32 43 hispanic
## 24 34 38 white
## 26 34 37 white
## 31 35 38 white
## 32 38 37 white
## 33 35 39 hispanic
print(summary(youngermothers))
## Age Gestation Race
## Min. :13.00 Min. :22.0 black :266
## 1st Qu.:20.00 1st Qu.:38.0 hispanic:131
## Median :23.00 Median :39.0 other : 27
## Mean :23.11 Mean :38.6 white :519
## 3rd Qu.:26.00 3rd Qu.:40.0
## Max. :29.00 Max. :45.0
head (youngermothers, 10)
## Age Gestation Race
## 3 27 39 white
## 4 27 39 white
## 5 25 39 white
## 6 28 43 white
## 7 25 39 white
## 8 15 42 white
## 9 21 39 white
## 10 27 40 white
## 11 26 41 white
## 12 20 41 white
Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
## percentage distribution of all the mothers in all ages according to their Races
plot_ly(mothers, labels = ~Race, values = ~Age, type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~paste(Race, ' mothers of all ages'),
showlegend = FALSE) %>%
layout(title = 'Distribution of all mothers by Race',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
## Younger Moms
plot_ly(youngermothers, x = ~Age, color = ~Age) %>% add_histogram()%>%
layout(title = "Younger Moms' Frequency distribution",
xaxis = list(title = "Age (years)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
yaxis = list(title = 'Frequency', titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min;
## returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max;
## returning -Inf
## Older moms Frequency distribution
plot_ly(oldermothers, x = ~Age, color = ~Age) %>% add_histogram()%>%
layout(title = "Older moms' Frequency distribution",
xaxis = list(title = "Age (years)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
yaxis = list(title = 'Frequency', titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min;
## returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max;
## returning -Inf
## Younger Moms Vs Gestation
plot_ly(youngermothers, type="bar", x = ~Gestation, y = ~Age, color = ~Age)%>%
layout(title = "Younger Moms' Vs Gestation",
xaxis = list(title = "Gestation (weeks)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
yaxis = list(title = 'Age (years)', titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Older Moms Vs Gestation
plot_ly(data=oldermothers, type="bar", x = ~Gestation, y = ~Age, color = ~Age)%>%
layout(title = "Older Moms' Vs Gestation",
xaxis = list(title = "Gestation (weeks)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
yaxis = list(title = 'Age (years)', titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
ggplot(mumrace, aes(x=Race, y=Gestation)) +
geom_point(aes(col=Race, size=Gestation)) +
geom_smooth(method="loess", se=T) +
labs(title="Gestation Period VS Race", y="Gestation", x="Race", caption="North Carolina births (2001)")
boxplot(plural$Gestation ~ plural$BabiesInOneBirth, data=plural, main="Gestation Period Vs. Number of Babies per Birth", font.main=3, cex.main=1.2, xlab="Number of Babies per birth", ylab="Gestation (weeks)", font.lab=3, col=(c("#2962ff","#00695c", "#ad1457")) ,
ylab="disease" , xlab="- variety -")
## Mom's Age Vs Gestation
plot_ly(data = allmothers, x = ~`Gestation (weeks)`, y = ~Age, color = ~Age,
type = "scatter",
mode="markers")%>%
layout(title = "Mom's Age (years) Vs. Gestation (weeks)",
xaxis = list(title = "Gestation (weeks)", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
yaxis = list(title = "Mom's Age (years)", titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## baby's weight race Vs Mom's Race
plot_ly(data = allmothers, type="bar", x = ~Race, y = ~`Birth Weight (Gm)`, color = ~Race)%>%
layout(title = "Baby's weight (gm) Vs. Mom's Race",
xaxis = list(title = "Mom's Race", tickfont = list( size = 14, color = 'rgb(107, 107, 107)')),
yaxis = list(title = "Baby's Weight (gm)", titlefont = list(size = 16, color = 'rgb(107, 107, 107)'),
tickfont = list(size = 14,color = 'rgb(107, 107, 107)')))
## smokingHabit Vs Gestation
ggplot(smokingHabit, aes(Gestation, SmokingHabit))+ geom_point(colour = smokingHabit$Gestation, size = 4) + geom_line(colour ="#2962ff", size=0.8) +
labs(y="Smoking Habit",
x="Gestation (weeks)",
title="SMoking Habit Vs Gestation(weeks)",
caption="North Carolina Births (2001)")
## 3D plot to consider gestation periods in relation to Race and Age
plot_ly(mothers, x = ~Race, y = ~Age, z = ~Gestation, color = ~Race, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Race'),
yaxis = list(title = 'Age'),
zaxis = list(title = 'Gestation')))
From the analysis and visualisations, it is evident that length of pregnancy is not affected by Race and Smoking Habit. It is affected slightly by Age with Moms at ages 29 years having gestation periods that spanned 33 to 45 weeks. Also the number of babies (single, twins, etc) per birth reduces gestation periods and also affects baby’s weight. A lot of factors like mother’s health (which is not measured in this dataset) can contribute a lot in determining lenght of pregnancy, baby’s weight, timely/delayed/premature delivery, or even miscarriage. The comparison between Baby’s weight and mother’s race shows a fraction of the white moms delivering babies that weighed around 5000gm. This can be as a result of adequate diet and medical care during pregnancy.
place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
Ncbirths_git <- read.csv2("https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/NCbirths.csv", header = TRUE, sep=",")
summary(Ncbirths_git) #Summary of the csv data from github.com
## X ID Plural Sex
## Min. : 1.0 Min. : 1.0 Min. :1.000 Min. :1.000
## 1st Qu.: 363.2 1st Qu.: 363.2 1st Qu.:1.000 1st Qu.:1.000
## Median : 725.5 Median : 725.5 Median :1.000 Median :1.000
## Mean : 725.5 Mean : 725.5 Mean :1.037 Mean :1.487
## 3rd Qu.:1087.8 3rd Qu.:1087.8 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :1450.0 Max. :1450.0 Max. :3.000 Max. :2.000
##
## MomAge Weeks Marital RaceMom HispMom
## Min. :13.00 Min. :22.00 Min. :1.000 Min. :1.000 C: 2
## 1st Qu.:22.00 1st Qu.:38.00 1st Qu.:1.000 1st Qu.:1.000 M: 128
## Median :26.00 Median :39.00 Median :1.000 Median :1.000 N:1283
## Mean :26.76 Mean :38.62 Mean :1.345 Mean :1.831 O: 3
## 3rd Qu.:31.00 3rd Qu.:40.00 3rd Qu.:2.000 3rd Qu.:2.000 P: 9
## Max. :43.00 Max. :45.00 Max. :2.000 Max. :8.000 S: 25
## NA's :1
## Gained Smoke BirthWeightOz BirthWeightGm
## Min. : 0.0 Min. :0.0000 Min. : 12.0 3316.95: 42
## 1st Qu.:20.0 1st Qu.:0.0000 1st Qu.:106.0 3175.2 : 38
## Median :30.0 Median :0.0000 Median :118.0 3402 : 38
## Mean :30.6 Mean :0.1446 Mean :116.2 3430.35: 38
## 3rd Qu.:40.0 3rd Qu.:0.0000 3rd Qu.:130.0 3260.25: 37
## Max. :95.0 Max. :1.0000 Max. :181.0 3345.3 : 35
## NA's :40 NA's :5 (Other):1222
## Low Premie MomRace
## Min. :0.00000 Min. :0.0000 black :332
## 1st Qu.:0.00000 1st Qu.:0.0000 hispanic:164
## Median :0.00000 Median :0.0000 other : 48
## Mean :0.08621 Mean :0.1317 white :906
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
##