1.Setting

Mothers and Length of Breastfeeding

The length of breastfeeding was recorded along with several factors pertaining to the mother’s pregnancy and lifestyle. The data consists of 9 variables and 922 observations. 1. duration: Number of weeks breastfeeding 2. race: Race of mother (1=white, 2=black, 3=other) 3. poverty: Mother in poverty (1=yes, 0=no) 4. smoke: Mother smoked at birth of child (1=yes, 0=no) 5. alcohol: Mother used alcohol at child bith (1=yes, 0=no) 6. agemth: Age of mother at child birth 7. ybirth: Year of birth 8. yschool: Education level of mother (years in school) 9. pc3mth: Prenatal care after 3rd month

bfeed = read.csv("bfeed.csv")

Blow are the first 6 observations

head(bfeed)

##   X duration race poverty smoke alcohol agemth ybirth yschool pc3mth
## 1 1       16    1       0     0       1     24     82      14      0
## 2 2        1    1       0     1       0     26     85      12      0
## 3 3        4    1       0     0       0     25     85      12      0
## 4 4        3    1       0     1       1     21     85       9      0
## 5 5       36    1       0     1       0     22     82      12      0
## 6 6       36    1       0     0       0     18     82      11      0

Factors and levels

The following four factors will be used in the model: 1. race: 3 levels 2. poverty: 2 levels 3. smoke: 2 levels 4 agemth: 14 levels

str(bfeed)

## 'data.frame':    927 obs. of  10 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ duration: int  16 1 4 3 36 36 16 8 20 44 ...
##  $ race    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ poverty : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ smoke   : int  0 1 0 1 1 0 1 1 0 0 ...
##  $ alcohol : int  1 0 0 1 0 0 0 0 0 0 ...
##  $ agemth  : int  24 26 25 21 22 18 20 24 24 24 ...
##  $ ybirth  : int  82 85 85 85 82 82 81 85 85 82 ...
##  $ yschool : int  14 12 12 9 12 11 9 12 12 14 ...
##  $ pc3mth  : int  0 0 0 0 0 0 0 0 0 0 ...

As you can see above, the factors are currently defined as integers. We will change them to be factors before we start our analysis.

bfeed$race <- as.factor(bfeed$race)
str(bfeed$race)

##  Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

bfeed$poverty <- as.factor(bfeed$poverty)
str(bfeed$poverty)

##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...

bfeed$smoke <- as.factor(bfeed$smoke)
str(bfeed$smoke)

##  Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 2 1 1 ...

bfeed$agemth <- as.factor(bfeed$agemth)
str(bfeed$agemth)

##  Factor w/ 14 levels "15","16","17",..: 10 12 11 7 8 4 6 10 10 10 ...

The levels are shown below.

levels(bfeed$race)

## [1] "1" "2" "3"

levels(bfeed$poverty)

## [1] "0" "1"

levels(bfeed$smoke)

## [1] "0" "1"

levels(bfeed$alcohol)

## NULL

Continuos Variables

In the data set we have three continuos variables: * duration: number of months breast feeding * agemth: age at giving birth * yschool: years in school

Response Variables

The response variable for this experiment is the amount of months that the mothers spend breast feeding, duration.

The Data: How it is organized and what does it look like?

As said above, the data consists of 9 variables and 922 observations(mothers). Again here are the first six rows of data:

head(bfeed)

##   X duration race poverty smoke alcohol agemth ybirth yschool pc3mth
## 1 1       16    1       0     0       1     24     82      14      0
## 2 2        1    1       0     1       0     26     85      12      0
## 3 3        4    1       0     0       0     25     85      12      0
## 4 4        3    1       0     1       1     21     85       9      0
## 5 5       36    1       0     1       0     22     82      12      0
## 6 6       36    1       0     0       0     18     82      11      0

And here is a general summary of the data:

summary(bfeed)

##        X            duration      race    poverty smoke  
##  Min.   :  1.0   Min.   :  1.00   1:662   0:756   0:657  
##  1st Qu.:232.5   1st Qu.:  4.00   2:117   1:171   1:270  
##  Median :464.0   Median : 10.00   3:148                  
##  Mean   :464.0   Mean   : 16.18                          
##  3rd Qu.:695.5   3rd Qu.: 24.00                          
##  Max.   :927.0   Max.   :192.00                          
##                                                          
##     alcohol            agemth        ybirth         yschool     
##  Min.   :0.00000   21     :135   Min.   :78.00   Min.   : 3.00  
##  1st Qu.:0.00000   20     :123   1st Qu.:80.00   1st Qu.:12.00  
##  Median :0.00000   22     :120   Median :82.00   Median :12.00  
##  Mean   :0.08522   19     :112   Mean   :81.97   Mean   :12.21  
##  3rd Qu.:0.00000   23     : 99   3rd Qu.:84.00   3rd Qu.:13.00  
##  Max.   :1.00000   24     : 84   Max.   :86.00   Max.   :19.00  
##                    (Other):254                                  
##      pc3mth      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1769  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
##

(Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment will use four factors each with 2 or more levels. The experiment will look at these factors’ effects (and also interaction effects) on the duration of breastfeeding. The null hypothesis of this experiment can be set as the duration of breast feeding is not affected by race, poverty, smoking, alcohol, and any two way interactions of these factors. This expiremnt is set up as a mixed effect experiment. Some of the levels used in this experiment are not able to represent the entire population, while others can. For example; race and age of mother are fixed effect factors where their levels do not represent the entire population (there are more races and ages then those chosen). On the other hand; smoke and poverty are random factors (they are binary variables where you either smoke or do not, or you are in poverty or not).

What is the rationale for this design?

This data set was retrieved as is from the internet, and the factors show in the data were chosen. There are obviously infinite other factors that could have been chosen to run this experiment with including employment status, number of other children, or mothers current health status (healthy vs. nonhealthy). It is impossible for us to know why the factors that were included were chosen.

Randomize: What is the randomization scheme?

There was no information submitted with the data that gave insight to the randomization scheme. By looking at the data and the patterns we can visually see it seems as though the data was collected somewhat randomly where there are not large amounts of “groupings” of durations.

Replicate: Are there replicates and/or repeated measures

In this dataset there are no replicates and/or repeated measures. The only measures that I could think of as possibly being repeated measures are alcohol and smoking, but they are not.

Blocking: Did you use blocking

The data set had more than the four factors that I included in this experiment, so in that sense I used blocking to disregard these variables. There may be some variation in the response variables caused by one of the variables that is not one of the 4 vairables I chose. This makes that variable a nuisance factor and therefore by not including it in the experiment, I am blocking.

3. (Statistical) Analysis

Exploratory Data Analysis) Graphics and Descriptive Summary

Below you can see the summary of our data again, as well as a histogram of the response variable, duration in order to get a better feel for its distribution.

summary(bfeed)

##        X            duration      race    poverty smoke  
##  Min.   :  1.0   Min.   :  1.00   1:662   0:756   0:657  
##  1st Qu.:232.5   1st Qu.:  4.00   2:117   1:171   1:270  
##  Median :464.0   Median : 10.00   3:148                  
##  Mean   :464.0   Mean   : 16.18                          
##  3rd Qu.:695.5   3rd Qu.: 24.00                          
##  Max.   :927.0   Max.   :192.00                          
##                                                          
##     alcohol            agemth        ybirth         yschool     
##  Min.   :0.00000   21     :135   Min.   :78.00   Min.   : 3.00  
##  1st Qu.:0.00000   20     :123   1st Qu.:80.00   1st Qu.:12.00  
##  Median :0.00000   22     :120   Median :82.00   Median :12.00  
##  Mean   :0.08522   19     :112   Mean   :81.97   Mean   :12.21  
##  3rd Qu.:0.00000   23     : 99   3rd Qu.:84.00   3rd Qu.:13.00  
##  Max.   :1.00000   24     : 84   Max.   :86.00   Max.   :19.00  
##                    (Other):254                                  
##      pc3mth      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1769  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
##

hist(bfeed$duration, breaks = 50, main = "Frequencies of Duration of Breast Feeding", xlab = "Duration")

After some trial and error I decided to use 50 breaks to make it easier to visualize the distribution. It is clear to see that duration is positively skewed with a few outliers pushing out into the 100 and beyond region for number of months.

I will be using boxplots in order to analyze the main effects visually. The box plots show the significance of a factor.

boxplot(bfeed$duration~bfeed$race, xlab = "Race", ylab = "Duration of Breast Feeding", main = "Race: 1= white, 2= black, 3= other")

boxplot(bfeed$duration~bfeed$poverty, xlab = "Poverty", ylab = "Duration of Breast Feeding", main = "Mother in Poverty: 1= Yes, 0= No")

boxplot(bfeed$duration~bfeed$smoke, xlab = "Smoke", ylab = "Duration of Breast Feeding", main = "Mother Smoked at Birth of Child:1= Yes, 0= No")

boxplot(bfeed$duration~bfeed$agemth, xlab = "Age of Mother", ylab = "Duration of Breast Feeding", main = "Age of Mother at Birth of Child")

Based on these box plots, we can say that all of the four factors had some effect on the duration of breast feeding. The factor that seemed to have the biggest difference between its factors was smoke.

Testing

I will be using ANOVA in this section to determine the statistical significance of this experiment. We will be able to understand the explained and unexplained variation in the experiment to determine if the experiment is significant.

ANOVA1 = aov(bfeed$duration~bfeed$race)
anova(ANOVA1)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##             Df Sum Sq Mean Sq F value Pr(>F)  
## bfeed$race   2   1991  995.47  3.1139 0.0449 *
## Residuals  924 295394  319.69                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA2 = aov(bfeed$duration~bfeed$poverty)
anova(ANOVA2)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##                Df Sum Sq Mean Sq F value Pr(>F)
## bfeed$poverty   1    378  378.50  1.1788 0.2779
## Residuals     925 297006  321.09

ANOVA3 = aov(bfeed$duration~bfeed$smoke)
anova(ANOVA3)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##              Df Sum Sq Mean Sq F value   Pr(>F)   
## bfeed$smoke   1   2422 2421.85  7.5949 0.005968 **
## Residuals   925 294963  318.88                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA4 = aov(bfeed$duration~bfeed$agemth)
anova(ANOVA4)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##               Df Sum Sq Mean Sq F value Pr(>F)
## bfeed$agemth  13   5206  400.46  1.2513 0.2374
## Residuals    913 292179  320.02

With this outcome, we are not able to reject the null hypothesis because not all of the P values are less that .05. On top of this, the F vlaues for agemth and poverty are not much larger than 1 and therefore the explained and unexplained errors are similar, meaning we cannot confidently say that these factors are significant.

Below are the ANOVA results for the interactions effects.

ANOVA_12 <- aov(bfeed$duration~bfeed$race*bfeed$poverty)
anova(ANOVA_12)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##                           Df Sum Sq Mean Sq F value  Pr(>F)  
## bfeed$race                 2   1991  995.47  3.1119 0.04499 *
## bfeed$poverty              1    619  618.90  1.9347 0.16458  
## bfeed$race:bfeed$poverty   2    153   76.58  0.2394 0.78715  
## Residuals                921 294622  319.89                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interaction.plot(bfeed$race, bfeed$poverty, bfeed$duration)

Race x Poverty: p > .05, no evidence of interaction effect. No interaction shown in plot.

ANOVA_13 <- aov(bfeed$duration~bfeed$race*bfeed$smoke)
anova(ANOVA_13)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##                         Df Sum Sq Mean Sq F value    Pr(>F)    
## bfeed$race               2   1991   995.5  3.1473 0.0434299 *  
## bfeed$smoke              1   3633  3633.3 11.4871 0.0007306 ***
## bfeed$race:bfeed$smoke   2    455   227.5  0.7193 0.4873755    
## Residuals              921 291306   316.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interaction.plot(bfeed$race, bfeed$smoke, bfeed$duration)

Race x Smoke: p > 0.05, no evidence of interaction effect.

ANOVA_14 <- aov(bfeed$duration~bfeed$race*bfeed$agemth)
anova(ANOVA_14)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##                          Df Sum Sq Mean Sq F value  Pr(>F)  
## bfeed$race                2   1991  995.47  3.0960 0.04572 *
## bfeed$agemth             13   5543  426.38  1.3261 0.19125  
## bfeed$race:bfeed$agemth  23   4326  188.11  0.5850 0.94000  
## Residuals               888 285525  321.54                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interaction.plot(bfeed$race, bfeed$agemth, bfeed$duration)

Race x Age Mother: p > 0.05, no evidence shown of interaction effects.

ANOVA_23 <- aov(bfeed$duration~bfeed$poverty*bfeed$smoke)
anova(ANOVA_23)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##                            Df Sum Sq Mean Sq F value   Pr(>F)   
## bfeed$poverty               1    378  378.50  1.1870 0.276210   
## bfeed$smoke                 1   2626 2625.95  8.2355 0.004202 **
## bfeed$poverty:bfeed$smoke   1     76   75.90  0.2380 0.625737   
## Residuals                 923 294305  318.86                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interaction.plot(bfeed$poverty, bfeed$smoke, bfeed$duration)

Poverty x Smoke: p > 0.05, no evidence shown of interaction effects. Plots are almost parrellel.

ANOVA_24 <- aov(bfeed$duration~bfeed$poverty*bfeed$agemth)
anova(ANOVA_24)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##                             Df Sum Sq Mean Sq F value Pr(>F)
## bfeed$poverty                1    378  378.50  1.1709 0.2795
## bfeed$agemth                13   5126  394.31  1.2199 0.2592
## bfeed$poverty:bfeed$agemth  13   1287   99.03  0.3064 0.9912
## Residuals                  899 290593  323.24

interaction.plot(bfeed$poverty, bfeed$agemth, bfeed$duration)

Poverty x Age Mother: p > 0.05, no interaction effect.

ANOVA_34 <- aov(bfeed$duration~bfeed$smoke*bfeed$agemth)
anova(ANOVA_34)

## Analysis of Variance Table
## 
## Response: bfeed$duration
##                           Df Sum Sq Mean Sq F value   Pr(>F)   
## bfeed$smoke                1   2422 2421.85  7.6008 0.005953 **
## bfeed$agemth              13   5359  412.20  1.2936 0.210336   
## bfeed$smoke:bfeed$agemth  13   3154  242.63  0.7615 0.701673   
## Residuals                899 286450  318.63                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interaction.plot(bfeed$smoke, bfeed$agemth, bfeed$duration)

Smoke x Age Mother: p >0.05, no evidence of interaction effect shown.

4.References to the Literature

http://vincentarelbundock.github.io/Rdatasets/doc/KMsurv/bfeed.html

Klein and Moeschberger (1997) Survival Analysis Techniques for Censored and truncated data, Springer. National Longitudinal Survey of Youth Handbook The Ohio State University, 1995.

5. Appendices

Complete R Code:

#Read in Breast Feeding Data 
bfeed = read.csv("bfeed.csv")

# Display first 6 rows of data
head(bfeed)

#Display Summary of data
summary(bfeed)

#Display the structure of data
str(bfeed)

#Change the factors from integers to factors, display new structures
bfeed$race <- as.factor(bfeed$race)
str(bfeed$race)

bfeed$poverty <- as.factor(bfeed$poverty)
str(bfeed$poverty)

bfeed$smoke <- as.factor(bfeed$smoke)
str(bfeed$smoke)

bfeed$agemth <- as.factor(bfeed$agemth)
str(bfeed$agemth)

#Display the levels of four factors

levels(bfeed$race)

levels(bfeed$poverty)

levels(bfeed$smoke)

levels(bfeed$alcohol)


#Display Histogram of Duration
hist(bfeed$duration, breaks = 50, main = "Frequencies of Duration of Breast Feeding", xlab = "Duration")

#Display Boxplot of interaction between duration and each of the four factors
boxplot(bfeed$duration~bfeed$race, xlab = "Race", ylab = "Duration of Breast Feeding", main = "Race: 1= white, 2= black, 3= other")

boxplot(bfeed$duration~bfeed$poverty, xlab = "Poverty", ylab = "Duration of Breast Feeding", main = "Mother in Poverty: 1= Yes, 0= No")

boxplot(bfeed$duration~bfeed$smoke, xlab = "Smoke", ylab = "Duration of Breast Feeding", main = "Mother Smoked at Birth of Child:1= Yes, 0= No")

boxplot(bfeed$duration~bfeed$agemth, xlab = "Age of Mother", ylab = "Duration of Breast Feeding", main = "Age of Mother at Birth of Child")


#Perform Analysis of Varience test for Main Effects
ANOVA1 = aov(bfeed$duration~bfeed$race)
anova(ANOVA1)

ANOVA2 = aov(bfeed$duration~bfeed$poverty)
anova(ANOVA2)

ANOVA3 = aov(bfeed$duration~bfeed$smoke)
anova(ANOVA3)

ANOVA4 = aov(bfeed$duration~bfeed$agemth)
anova(ANOVA4)


#Perform ANOVA for Interaction Effects 
ANOVA_12 <- aov(bfeed$duration~bfeed$race*bfeed$poverty)
anova(ANOVA_12)

interaction.plot(bfeed$race, bfeed$poverty, bfeed$duration)


ANOVA_13 <- aov(bfeed$duration~bfeed$race*bfeed$smoke)
anova(ANOVA_13)

interaction.plot(bfeed$race, bfeed$smoke, bfeed$duration)


ANOVA_14 <- aov(bfeed$duration~bfeed$race*bfeed$agemth)
anova(ANOVA_14)

interaction.plot(bfeed$race, bfeed$agemth, bfeed$duration)


ANOVA_23 <- aov(bfeed$duration~bfeed$poverty*bfeed$smoke)
anova(ANOVA_23)

interaction.plot(bfeed$poverty, bfeed$smoke, bfeed$duration)


ANOVA_24 <- aov(bfeed$duration~bfeed$poverty*bfeed$agemth)
anova(ANOVA_24)

interaction.plot(bfeed$poverty, bfeed$agemth, bfeed$duration)


ANOVA_34 <- aov(bfeed$duration~bfeed$smoke*bfeed$agemth)
anova(ANOVA_34)

interaction.plot(bfeed$smoke, bfeed$agemth, bfeed$duration)

Recipes for the Design of Experiments

Kristen Cole, RPI

October 11, 2016