Guannan Shi PID:A98051827 1st Year Graduate Electrical Engineering
Hao Jiang PID:A91426797 1st Year Graduate Electrical Engineering
Wenhao Sheng PID:A99033876 1st Year Graduate Electrical Engineering
Imam Raj PID:A12045687 3rd year Probability and Statistics major
Vincent Yeh PID:A11437295 4th year Electrical Engineering major
Surgeon General warning suggests that smoking by pregant women may have a harmful effect on new borns’ health. In this homework assignment, we are going to analyze the “babies.txt” dataset to answer whether or not smoking during pregnancy will affect a baby’s birth weight and whether or not a baby’s birth weight has an effect on its health.We parse this data with R to find correlations between smoking and birth weight, and, from there, we can use other background data from other studies to correlate birth weight to health. Thus, we are then able to link the effect of smoking during pregnancy to the health of the baby.
The data in the case study was collected from the births of 1,236 male babies between the years 1960 and 1967. The data was restricted to only women who were enrolled in a Kaiser Health Plan for their pregnancies at Kaiser hospitals in Oakland, California. To reduce the amount of external variables that could factor into the birth weight, the babies had to live for at least 28 days and only data from single births was collected. From this sample, seven variables were observed. The seven data points describe the baby’s birth weight, gestation days, and whether or not the baby is first born. Additionally, they describe the mother’s age, height, pre-pregnancy weight, and of course, whether or not she smoked during pregnancy. However, for some of the reported mothers, it is unknown whether or not they smoked during their pregnancies. As a result, such data points were removed from the analysis.
Our data was parsed using R code in RStudio. Some data entries included babies born to mothers who did not report whether or not they smoked during pregnancy, making their entry invalid for our purposes. There were 62 of these entries, and we first removed those data entries, leaving us with a grand total of 1,174 usable data entries. From there we separated data entries from nonsmoking mothers and smoking mothers. There were 459 entries from smoking mothers and 715 from nonsmoking mothers. Then, we took the birth weights of all the babies born to nonsmoking mothers and graphed them, and we also did the same for babies born to smoking mothers. This gave us an easy and visual way to understand and compare the data entries that were given to us.
In numerical summary, our results show babies born to mothers who are smokers have an average birth weight 113.8 ounces and babies born to mothers who are nonsmokers have an average birth weight 123.1 ounces. The first quartile in the summary tables also indicates that 25% of the babies born to smoking mothers have birth weights lower than 101 ounces and 25% of the babies born to nonsmoking mothers have birth weights lower than 113 ounces.
# Set Working directory
setwd("/Users/weshe/Desktop/Math 289C/HWs/HW1")
# Input data
data <- read.table("babies.txt", header=TRUE)
# observe and clean data
cleandata <- data[which(data['gestation'] != 999),] # clean gestation
cleandata <- cleandata[which(cleandata['age'] != 99),] # clean age
cleandata <- cleandata[which(cleandata['height'] != 99),] # clean height
cleandata <- cleandata[which(cleandata['weight'] != 999),] # clean wight
data <- cleandata[which(cleandata['smoke'] != 9),] # clean smoke
summary(data)
bwt gestation parity age height weight smoke
Min. : 55.0 Min. :148.0 Min. :0.0000 Min. :15.00 Min. :53.00 Min. : 87.0 Min. :0.000
1st Qu.:108.0 1st Qu.:272.0 1st Qu.:0.0000 1st Qu.:23.00 1st Qu.:62.00 1st Qu.:114.2 1st Qu.:0.000
Median :120.0 Median :280.0 Median :0.0000 Median :26.00 Median :64.00 Median :125.0 Median :0.000
Mean :119.5 Mean :279.1 Mean :0.2624 Mean :27.23 Mean :64.05 Mean :128.5 Mean :0.391
3rd Qu.:131.0 3rd Qu.:288.0 3rd Qu.:1.0000 3rd Qu.:31.00 3rd Qu.:66.00 3rd Qu.:139.0 3rd Qu.:1.000
Max. :176.0 Max. :353.0 Max. :1.0000 Max. :45.00 Max. :72.00 Max. :250.0 Max. :1.000
smoker.ind <- which(data['smoke'] == 1)
data.smoker <- data[smoker.ind,] # Sort out data labeled as 'Smoker'
nonsmoker.ind <- which(data['smoke'] == 0)
data.nonsmoker <- data[nonsmoker.ind,] # Sort out data labeled as 'NonSmoker'
summary(data.smoker$bwt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
58.0 101.0 115.0 113.8 126.0 163.0
summary(data.nonsmoker$bwt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
55.0 113.0 123.0 123.1 134.0 176.0
The histograms below show the distribution of birth weight of babies from both smoking and nonsmoking mothers. The overlapping histogram plot shows that the distribution of babies from smoking mother is slightly to the left of the distribution of babies from nonsmoking mother. This result indicates that babies from nonsmoking mothers tend to have a higher birth weight in general. Besides, babies with nonsmoking mother have a very high frequency of high birth weight(110 - 130 ounces) while babies with smoking mother have a low frequency of high birth weight.
hist(data.smoker$bwt, col=rgb(1,0,0,0.5), xlim = c(60,180), ylim = c(0,200), main = "Histogram of smokers", xlab = "birth weight(ounce)")
hist(data.nonsmoker$bwt, col=rgb(0,0,1,0.4), ylim = c(0,200), main = "Histogram of nonsmokers", xlab = 'birth weight(ounce)')
hist(data.smoker$bwt, col=rgb(1,0,0,0.5),xlim = c(60,180), ylim=c(0,200), main="Overlapping Histogram(red-smoker, blue-nonsmoker)", xlab="birth weight(ounce)")
hist(data.nonsmoker$bwt, col=rgb(0,0,1,0.4), add=T)
Besides the histograms of birth weight, we also plotted the histograms of gestation for both smoking and nonsmoking mother to show the relation between gestation and birth weight. From scientific researches, we learned that the normal gestational period for a baby is about 40 weeks(280 days). In the final weeks, baby gains 0.2 pounds(3.2 ounces) per week.
From the two histograms below, most babies from smoking mothers have a gestational period from 270 to 280 days with a large variance. However, baies from nonsmoking mothers have a gestational period close to 300 days with a very small variance. This result indicates that babies from smoking mothers tend to have a shorter gestational period and further explains that babies from smoking mothers tend to have lower birth weight
hist(data.smoker$gestation, main=('Histogram of gestation for smoker'), col=rgb(1,0,0,0.5), xlim = c(130,350), ylim = c(0,300), xlab = "gestation(days)")
hist(data.nonsmoker$gestation, main=('Histogram of gestation for nonsmoker'), col=rgb(0,0,1,0.4), xlim = c(130,350), xlab = "gestation(days)")
The box plots confirm our observation in the overlapping histogram plot that babies from nonsmoking mother tend to have a higher average birth weight than babies from smoking other. The upper and lower whiskers in the two box plots show that the birth weight of babies from smoking mothers has a higher variance than babies from nonsmoking mothers.
boxplot(bwt~smoke,data, main = 'Boxplot of smoker and nonsmoker', names = c("nonsmoker", "smoker"), ylab=('birth weight(ounce)'))
We also plotted the Q-Q plot to show whether or not the birth weight from smoking and nonsmoking mothers come from populations with a common distribution. As it shows in the plot that most of the points are above the reference line which indicates that the two dataset don’t come from a population with same distribution. This result once again confirms our prior assumption that smoking during pregnancy affects babies’ birth weight.
qqplot(data.smoker$bwt,data.nonsmoker$bwt, xlab=('smoker birth weight(ounce)'), ylab=('nonsmoker birth weight(ounce)'))
abline(c(0,1)) # reference line
Now after analyzing the problem numerically and graphically, we estimate the probabity of a low birth weight baby from our dataset. From our scientific research, we learned that a baby born under 5.5lb(88 ounce) is often considered as low-weight birth (reference: http://www.stanfordchildrens.org/en/topic/default?id=low-birthweight-90-P02382), we use 88 ounce as a threshold value to assess the probability of low-weight birth.
Out of all the 459 babies from smoking mothers, 36 of them are considered as low birth weight.
Out of all the 715 babies from smoking mothers, 21 of them are considered as low birth weight.
# 5.5 lb == 88 ounce, which is considered as low weight birth
A = sum(data.smoker$bwt<88)
B = sum(data.smoker$bwt>=88)
C = sum(data.nonsmoker$bwt<88)
D = sum(data.nonsmoker$bwt>=88)
tbl <- matrix(c(A,C,B,D),ncol=2)
tbl
[,1] [,2]
[1,] 36 423
[2,] 21 694
smoker.lbr = A/(A+B)
nonsmoker.lbr = C/(C+D)
The probabity of low-weight birth from smoking mothers is 0.078
smoker.lbr
[1] 0.07843137
The probability of low-weight birth from nonsmoking mothers is 0.029
nonsmoker.lbr
[1] 0.02937063
The probability of low-weight birth will increase if more babies were classified as low birth weight and vice versa.
From our dataset, we estimate that the probability of low-weight birth from nonsmoking mother is lower than the one from smoking mother.
To assess the reliability of our estimates of the low-weight rates, we are going to fit our dataset into a statistical model and use the model to predict the probability of low-weight birth. Our initial assumption is to fit the data into a normal distributions. To verfity the validity of this assumption, we plot two Q-Q plots for birth weight from both smoking and nonsmoking mothers. In despite of minor variation at the two tails, the two Q-Q plots show that most of the data points fall onto the reference line. Therefore, we are able to determine that a normal distribution is a valid statistical model for our data. To build our model, we use the mean and standard deviation from the numerical analysis. The mean and variance from smoking-mother dataset is 113.8 and 18.29 respectively while the mean and variance from nonsmoking-mother dataset is 123.1 and 17.42.
qqnorm(data.smoker$bwt, main = ('Normal Q-Q Plot for smoker'))
qqline(data.smoker$bwt)
qqnorm(data.nonsmoker$bwt, main = ('Normal Q-Q Plot for nonsmoker'))
qqline(data.nonsmoker$bwt)
n <- seq(50, 200, length=3000)
simulated.smoker <- dnorm(n, mean = 113.8, sd = 18.29501)
simulated.nonsmoker <- dnorm(n, mean = 123.1 , sd = 17.4237)
plot(n,simulated.smoker,xlab = "birth weight", ylab = "probability density", main = "Distribution of birth weight from smoking mothers")
plot(n,simulated.nonsmoker,xlab = "birth weight", ylab = "probability density", main = "Distribution of birth weight from nonsmoking mothers")
Now we are going to assess our estimate by calculating the probabilty of a low-weight birth from our models above. To do that, we are simply going to calculate the probabity of birth weight being 88 ounces and less. Below shows that our model predicts the probabity of a low-weight birth is 0.079 from a smoking mother and is 0.021 from a nonsmoking mother. These two results match the estimated probabilty calculated from our dataset.
simulated.birthrate.smoker <- pnorm(88, mean = 113.8, sd = 18.29501, lower.tail=TRUE)
simulated.birthrate.nonsmoker <- pnorm(88, mean = 123.1 , sd = 17.4237, lower.tail=TRUE)
simulated.birthrate.smoker
[1] 0.07923728
simulated.birthrate.nonsmoker
[1] 0.02197866
The findings in the numerical and graphical analysis indicate a strong correlation between the birth weights of the babies and whether or not their respective mothers smoked during pregnancy. Babies of smoking mothers has a lower average birth weight than babies of nonsmoking mothers. Our graphical analysis also indicates that the variance of birth weight with nonsmoking mother is lower than the birth weight with smoking mothers. A conclusion can be drawn that babies born to nonsmokers have a higher birth weight in general. Our study and analysis in gestation also indicates that mothers who smoke tend to have a shorter gestational period, which ultimately causes a low birth weight for the newborns. Lastly, the normal distribution model we built also predicts that the probability of low-weight birth is 0.021 for nonsmoking mothers, and 0.079 for smoking mothers. Therefore we can conclude that the importance of the difference in birth weight is significant and important.
Codes are embedded into text.