StatsCoursework1

a) Exploratory analysis

setwd("C:/Users/user/Downloads")
BD <- read.csv(file="BirthData.csv", header=TRUE)
summary(BD)

##       bwt             age           race          smoke    
##  Min.   :0.709   Min.   :14.00   black:26   NonSmoker:115  
##  1st Qu.:2.414   1st Qu.:19.00   other:67   Smoker   : 74  
##  Median :2.977   Median :23.00   white:96                  
##  Mean   :2.945   Mean   :23.24                             
##  3rd Qu.:3.487   3rd Qu.:26.00                             
##  Max.   :4.990   Max.   :45.00                             
##       ptl           ht        ui           ftv              mwt        
##  Min.   :0.0000   No :177   No :161   Min.   :0.0000   Min.   : 36.29  
##  1st Qu.:0.0000   Yes: 12   Yes: 28   1st Qu.:0.0000   1st Qu.: 49.90  
##  Median :0.0000                       Median :0.0000   Median : 54.88  
##  Mean   :0.1958                       Mean   :0.7937   Mean   : 58.88  
##  3rd Qu.:0.0000                       3rd Qu.:1.0000   3rd Qu.: 63.50  
##  Max.   :3.0000                       Max.   :6.0000   Max.   :113.40

This is the summary Birth Data, BD, for the variables provided by the . present are continous variables (mwt, bwt), discrete variables (age, ftv, ptl) and categoric (race, smoke, ht, ui)

qplot(BD$age,BD$bwt)

a scatterplot to represent birth weight against age. there is no clear correlation present on the graph. also, birth weight 5 is a obvious outlier.

boxplot(bwt~race, data=BD, xlab="mother's race",ylab="baby's birth weight in kg", main="race in relation to birth weight")

The box plot regarding race and birth weight seems to suggest the white babies are heavier than other races and also has a great range of weights. This could just be down to more white women sampled. But the difference in mean should be tested.

 boxplot(bwt~smoke, data=BD, col=(c("white","grey")), xlab="smoker status",ylab="baby's birth weight in kg", main="smoker status in relation to birth weight")

these boxplots clearly shows a relationship such that Smokers have lighter babies; Both Smokers and NonSmokers have a large enough data sets to suggest reliable data.

qplot(factor(ptl),bwt, data=BD, geom=c("boxplot","jitter"), xlab="no. of premature labours", ylab= "baby's birth weight in kg", main="premature labours in relation to birth weight" )

These boxplots now categorise the data by number of previous premature labours. The jitter plots show that most points are in the 0 and 1 categories, using only these it would be easy to conclude that have premature labours results in a lighter child at birth. But given that the other catorgries exsist and there is not enough data for 1,2 or premature labours we cant definately make that conclusion.

qplot(ht,bwt, data=BD, geom=c("boxplot"), xlab="history of hypertension",ylab="baby's birth weight in kg")

Given the no’s lower quartil is nearly the same value as the median of the yes’s from viewing this plot we would say that a history of hypertension does lead to a lighter baby at birth

boxplot(bwt~ui, data=BD, main="uterine irratability in relation to birth weight", xlab="uterine irratability",ylab="baby's birth weight in kg", col=c("red","green"))

Given the no’s lower quartil is nearly the same value as the median of the yes’s cat from viewing this plot we would say that a presence of uterine irritability does lead to a lighter baby at birth.

boxplot(bwt~ftv, data=BD, main="GP visits in relation to birth weight", xlab="GP visits in first trimester",ylab="baby's birth weight in kg",  col=c("darkgreen","green", "yellow",  "orange", "red", "blue"))

In terms of GP visits in relation to the baby’s birth weight, regardless of the number of visits it seem the median weight is all in similar value so we cant really conclude a change in mean. However if 3 GP visits is said to be anomalous data set then there appears to be a small positive correlation between birth weight and GP visits according the rest of the data

qplot(mwt,bwt,data=BD,geom=c("point","smooth"),method=lm)

the scatterplot of birth weight against weight at last menstrual period with included line of best fit clearly shows a positive corelation. a lower weight of the mother appears to be correlated with lower weight of the baby.

b) Effect of smoking on birth weight

A one-way ANOVA model that could be used to examine the effect of smoking on birth weight is: \[ y= \alpha+ \beta x + \epsilon \] where alpha is mean of baby weights, y is the expect weight of a smoker’s baby, epsilon is the random error, and if smoking has no effect on the birth weight, H_0, then beta = 0

fit1 <- lm(bwt~smoke,data=BD)
fit0 <- lm(bwt~1,data=BD)
summary(fit1)

## 
## Call:
## lm(formula = bwt ~ smoke, data = BD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0629 -0.4759  0.0343  0.5451  1.9343 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.05570    0.06693  45.653  < 2e-16 ***
## smokeSmoker -0.28378    0.10697  -2.653  0.00867 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7178 on 187 degrees of freedom
## Multiple R-squared:  0.03627,    Adjusted R-squared:  0.03112 
## F-statistic: 7.038 on 1 and 187 DF,  p-value: 0.008667

The above table is the summary data for the fitted ANOVA model. the estimated difference in birth weight between smoking and nonsmoking mothers is -0.28378.

anova(fit0,fit1)

## Analysis of Variance Table
## 
## Model 1: bwt ~ 1
## Model 2: bwt ~ smoke
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    188 99.970                                
## 2    187 96.344  1    3.6259 7.0378 0.008667 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis of variance table above. shows that F is 7.0378 and the P-value of 0.008667. p-value<0.01 => reject H_0, beta does not equal zero so there is strong evidence that smoking has an effect on birth weight.

confint(fit1)

##                  2.5 %      97.5 %
## (Intercept)  2.9236543  3.18773697
## smokeSmoker -0.4947973 -0.07275612

The above gives the 95% confidence interval (-0.4947973,-0.07275612). for 95% of samples of smoker and non smoker mothers, smoker children will have a lower weight.

c) Effect of race on birth weight A one-way ANOVA model that could be used to examine the effect of race on birth weight is \[ y = \alpha + \beta x + \gamma + \epsilon \] where alpha is mean of baby weights, y is the expect weight of the baby, epsilon is the random error, gamma is the constant for other race and if race has no effect on the birth weight then beta = 0, this is the null hypothesis.

fit2 <- lm(bwt~race,data=BD)
anova(fit0,fit2)

## Analysis of Variance Table
## 
## Model 1: bwt ~ 1
## Model 2: bwt ~ race
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    188 99.970                                
## 2    186 94.954  2    5.0157 4.9125 0.008336 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value is given as 0.008336 so p-value<0.01. There is strong evidence that race has an effect on birth weight.

The effect of race on birth weight while accounting for the mother’s weight at last menstraul period can be calculated using the ANCOVA model:\[ y = \alpha + \beta x + \kappa + \epsilon \] where alpha is mean of baby weights, y is the expect weight of the baby, epsilon is the random error, kappa the constant for white race and if race given mothers weight has no effect on the birth weight then beta = 0.

The null hypothesis is that beta and kappa = 0, or in other words race has no effect on birth weight.

fit3 <- lm(bwt~race + mwt,data=BD)
anova(fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: bwt ~ race
## Model 2: bwt ~ race + mwt
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    186 94.954                                
## 2    185 91.445  1    3.5087 7.0984 0.008397 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the p-value is given as 0.008397 so so p-value<0.01. There is strong evidence that first model should be rejected. a model that controls for the mother’s weight at last menstrual period should now be used. We can study the relationship between the impacts of race and pre-pregnancy weight on birth weight to see if they’re simply additive or if there’s an interaction between them.

mothersweight <- cut(BD$mwt, 3, labels=c("low", "medium", "high"))
interaction.plot(BD$race, mothersweight, BD$bwt)

there is an apparent interaction between race and mothers weight during the effects the baby’s birth weight. from the plot it seems regardless of race a highier weighted mother is will less likey to have a light baby.

d) Predicting birth weight

fit<-lm(bwt~age+race+smoke+ptl+ht+ui+ftv+mwt,data=BD)
coef(fit)

##  (Intercept)          age    raceother    racewhite  smokeSmoker 
##  2.439609654 -0.003569056  0.133330349  0.488416745 -0.352044717 
##          ptl        htYes        uiYes          ftv          mwt 
## -0.048410887 -0.592799869 -0.516083406 -0.014055959  0.009597505

These are the coefficients of a linear model calculated using all 8 factors included in the data. To find the model’s prediction of birth weight for a certain set of variables, the coefficients are multiplied by the factor values, and then summed. For categoric data, a data point can either take a point 1 or 0. e.g. white woman 0x0.1333, 1x0.4884

visreg(fit)

visualising model’s predicted values along side the data point results in being able to see that the mothers age, number of GP visits and number of premature labours have a negligible effect on baby’ birth weight. This goes along with previous findings and the fact their coefficients are close to zero. eg. the scalar of the age coefficient is less <0.01

summary(fit)

## 
## Call:
## lm(formula = bwt ~ age + race + smoke + ptl + ht + ui + ftv + 
##     mwt, data = BD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82529 -0.43522  0.05592  0.47346  1.70119 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.439610   0.327240   7.455 3.71e-12 ***
## age         -0.003569   0.009620  -0.371 0.711081    
## raceother    0.133330   0.159394   0.836 0.403998    
## racewhite    0.488417   0.149985   3.256 0.001350 ** 
## smokeSmoker -0.352045   0.106477  -3.306 0.001142 ** 
## ptl         -0.048411   0.101972  -0.475 0.635546    
## htYes       -0.592800   0.202322  -2.930 0.003831 ** 
## uiYes       -0.516083   0.138886  -3.716 0.000271 ***
## ftv         -0.014056   0.046468  -0.302 0.762634    
## mwt          0.009598   0.003826   2.508 0.013022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6503 on 179 degrees of freedom
## Multiple R-squared:  0.2427, Adjusted R-squared:  0.2047 
## F-statistic: 6.375 on 9 and 179 DF,  p-value: 7.898e-08

doing t-tests for each factor shows that p-value>0.1, no evidence to reject H_0, for ptl, age, or ftv. therefor once we include the other factors we should ignore ptl, age, or ftv as they will no effect on the baby weight.
Further fits are now summarised, and demonstrate that so long as race, smoke, ht, ui and mwt are considered, there is no combination of ptl, age and ftv that makes them significant factors.

fitcheck<-lm(bwt~race+smoke+ht+ui+mwt+ptl,data=BD)
summary(fitcheck)

## 
## Call:
## lm(formula = bwt ~ race + smoke + ht + ui + mwt + ptl, data = BD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86018 -0.43778  0.06402  0.46772  1.62728 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.377007   0.283550   8.383 1.41e-14 ***
## raceother    0.128818   0.157949   0.816  0.41582    
## racewhite    0.474477   0.145896   3.252  0.00137 ** 
## smokeSmoker -0.346390   0.105334  -3.288  0.00121 ** 
## htYes       -0.582258   0.200116  -2.910  0.00407 ** 
## uiYes       -0.510686   0.137826  -3.705  0.00028 ***
## mwt          0.009164   0.003718   2.465  0.01464 *  
## ptl         -0.053232   0.100553  -0.529  0.59718    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6472 on 181 degrees of freedom
## Multiple R-squared:  0.2416, Adjusted R-squared:  0.2122 
## F-statistic: 8.235 on 7 and 181 DF,  p-value: 1.044e-08

fitcheck<-lm(bwt~race+smoke+ht+ui+mwt+age,data=BD)
summary(fitcheck)

## 
## Call:
## lm(formula = bwt ~ race + smoke + ht + ui + mwt + age, data = BD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81988 -0.45286  0.05985  0.46043  1.72614 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.442903   0.324957   7.518 2.49e-12 ***
## raceother    0.134004   0.158559   0.845 0.399151    
## racewhite    0.490628   0.149189   3.289 0.001210 ** 
## smokeSmoker -0.360710   0.104028  -3.467 0.000656 ***
## htYes       -0.590003   0.200290  -2.946 0.003645 ** 
## uiYes       -0.528539   0.135088  -3.913 0.000129 ***
## mwt          0.009689   0.003763   2.575 0.010828 *  
## age         -0.004672   0.009337  -0.500 0.617449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6473 on 181 degrees of freedom
## Multiple R-squared:  0.2414, Adjusted R-squared:  0.2121 
## F-statistic:  8.23 on 7 and 181 DF,  p-value: 1.059e-08

fitcheck<-lm(bwt~race+smoke+ht+ui+mwt+ftv,data=BD)
summary(fitcheck)

## 
## Call:
## lm(formula = bwt ~ race + smoke + ht + ui + mwt + ftv, data = BD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82258 -0.43234  0.05898  0.44628  1.63214 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.365138   0.282398   8.375 1.48e-14 ***
## raceother    0.125983   0.157987   0.797 0.426247    
## racewhite    0.477449   0.146091   3.268 0.001295 ** 
## smokeSmoker -0.358019   0.103789  -3.449 0.000699 ***
## htYes       -0.593118   0.201251  -2.947 0.003630 ** 
## uiYes       -0.527647   0.135115  -3.905 0.000133 ***
## mwt          0.009540   0.003738   2.553 0.011520 *  
## ftv         -0.016969   0.045510  -0.373 0.709678    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6475 on 181 degrees of freedom
## Multiple R-squared:  0.241,  Adjusted R-squared:  0.2116 
## F-statistic: 8.209 on 7 and 181 DF,  p-value: 1.114e-08

fitcheck<-lm(bwt~race+smoke+ht+ui+mwt,data=BD)
summary(fitcheck)

## 
## Call:
## lm(formula = bwt ~ race + smoke + ht + ui + mwt, data = BD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8422 -0.4332  0.0671  0.4592  1.6310 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.362296   0.281626   8.388 1.33e-14 ***
## raceother    0.126888   0.157594   0.805 0.421781    
## racewhite    0.475050   0.145604   3.263 0.001318 ** 
## smokeSmoker -0.356324   0.103444  -3.445 0.000710 ***
## htYes       -0.585168   0.199645  -2.931 0.003812 ** 
## uiYes       -0.525530   0.134676  -3.902 0.000134 ***
## mwt          0.009350   0.003694   2.531 0.012212 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6459 on 182 degrees of freedom
## Multiple R-squared:  0.2404, Adjusted R-squared:  0.2153 
## F-statistic: 9.599 on 6 and 182 DF,  p-value: 3.604e-09

after checking that indead that ptl, age, ftv make no difference, checking the data without them seems to suggest race should also be split into just white and nonwhite to make p-values smaller=> stronger evidence to reject h_0.

BD$racecat<-BD$race
levels(BD$racecat) <- c("Non-white","Non-white","White")
fit1<-lm(bwt~racecat+smoke+ht+ui+mwt,data=BD)
summary(fit1)

## 
## Call:
## lm(formula = bwt ~ racecat + smoke + ht + ui + mwt, data = BD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85304 -0.46025  0.02616  0.45070  1.62051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.504373   0.219278  11.421  < 2e-16 ***
## racecatWhite  0.389700   0.099721   3.908 0.000131 ***
## smokeSmoker  -0.370290   0.101881  -3.635 0.000362 ***
## htYes        -0.584402   0.199451  -2.930 0.003821 ** 
## uiYes        -0.522546   0.134496  -3.885 0.000143 ***
## mwt           0.008521   0.003544   2.404 0.017198 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6453 on 183 degrees of freedom
## Multiple R-squared:  0.2377, Adjusted R-squared:  0.2169 
## F-statistic: 11.41 on 5 and 183 DF,  p-value: 1.347e-09

the new p-value for race is now <0.001. so definately strong evidence to reject H_0 in the new model

coef(fit1)

##  (Intercept) racecatWhite  smokeSmoker        htYes        uiYes 
##  2.504373370  0.389700191 -0.370290361 -0.584402246 -0.522546226 
##          mwt 
##  0.008521483

with a model predicted intercept of 2.5kg we can say baby’s lower than this are light weight. By using the coefficients in the worst and best case senarios for each factor we can create the advice for GPs

“Dear GPs, please use the following for a baby’s expected birth weight: start by asking the mother’s weight at last menstrual period in kg, if white add 46, if smoking subtract 43, if sufferer or has suffered from hypertensionsubtract 69, and subtract 61 if uterine irritability is present. by using these factors we can say a low number will mean a baby is more likely to be of a low weight. We remind GPs to be sensitive, patient and polite with all patients, as a patient’s number of visits is found to be only for the mothers piece of mind.”

through this advice GPs can advice the expected mothers on potential lifestyle changes, e.g. starting to smoke again, etc.

StatsCoursework1

Stephen Pearce, 4203897

Sunday, March 14, 2015