# Read the CSV file
data.bweight <- read.csv("birthweight.csv",header = TRUE) 

# Check data format
str(data.bweight)
## 'data.frame':    295 obs. of  6 variables:
##  $ Weight  : int  2891 3572 3827 4593 3940 2778 3544 3402 3147 3997 ...
##  $ Black   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Married : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ Boy     : int  0 1 0 1 1 0 1 0 0 1 ...
##  $ MomSmoke: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Ed      : int  1 0 1 1 1 1 1 1 0 1 ...
data.bweight$Black=as.factor(data.bweight$Black)
data.bweight$Married=as.factor(data.bweight$Married)
data.bweight$Boy=as.factor(data.bweight$Boy)
data.bweight$MomSmoke=as.factor(data.bweight$MomSmoke)
data.bweight$Ed=as.factor(data.bweight$Ed)

str(data.bweight)
## 'data.frame':    295 obs. of  6 variables:
##  $ Weight  : int  2891 3572 3827 4593 3940 2778 3544 3402 3147 3997 ...
##  $ Black   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Married : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
##  $ Boy     : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 2 ...
##  $ MomSmoke: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ Ed      : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 2 ...

Exercise 1:


Part (a):

Answer -:

boxplot(data.bweight$Weight, main="Box plot for Infant birth weight",xlab="Weight",ylab="g (gram)",col = "aquamarine3",border = "aquamarine4")
points(mean(data.bweight$Weight),col="red",pch=16)

Comment:

Mean is very closed to median, skewness is very closed to zero, Q1 and Q3 are almost symmetric, thus:
data looks very normal distribution.

qqnorm(data.bweight$Weight)
qqline(data.bweight$Weight,col="red")

print(shapiro.test(data.bweight$Weight))
## 
##  Shapiro-Wilk normality test
## 
## data:  data.bweight$Weight
## W = 0.99206, p-value = 0.1153
Comment:

As QQ plot, most amount of points are closed to red line, thus normal distribution is reasonable.

As the Shapiro-Wilk test result, the P-Value is large (>0.05), hence data follows normal distribution.

Conclusion:

Regarding to Box plot, QQ plot and Shapiro-Wilk test result, data for the “Weight” follows normal distribution.


Part (b):

Answer -:

boxplot(data.bweight$Weight ~ data.bweight$MomSmoke,main="Box plot for Weight by Momsoking",xlab = "MomSmoke",ylab="Weight (gram)",col="aquamarine3",border="aquamarine4") 

mean.aggregate <- aggregate(data.bweight$Weight,list(data.bweight$MomSmoke),mean)
points(x = 1:nrow(mean.aggregate),                             # Add points to plot
       y = mean.aggregate$x,
       col = "red",
       pch = 16)

text(x = 1:nrow(mean.aggregate),                               # Add text to plot
     y = mean.aggregate$x +200,
     labels = paste("Mean:", round(mean.aggregate$x, 1)),
     col = "red")

Comment:

The plot for moms who smoked (value=1), has lower min and max, lower Q1 and Q2 and lower mean and median, it seems the moms who smoked might have lower weight infants.

Regarding the plots, mean and median are very close together, it seems both groups follows normal distribution.


Part (c):

Answer -:

data.weight.momsnosmoked <- data.bweight[data.bweight$MomSmoke==0,]
print(shapiro.test(data.weight.momsnosmoked$Weight))
## 
##  Shapiro-Wilk normality test
## 
## data:  data.weight.momsnosmoked$Weight
## W = 0.99362, p-value = 0.3549
data.weight.momsmoked <- data.bweight[data.bweight$MomSmoke==1,]
print(shapiro.test(data.weight.momsmoked$Weight))
## 
##  Shapiro-Wilk normality test
## 
## data:  data.weight.momsmoked$Weight
## W = 0.96299, p-value = 0.2

Conclusion:

As the Shapiro-Wilk test results, the P-Value for both groups as large (> 0.05), hence both groups follows normal distribution.


Exercise 2:


Answer -:

Regarding to result of exercise 1, data follows a normal distribution and we have only one independent variable (MomSmoke) with two levels (0/1), thus we should choose a parametric Two-Sample T-Test.

  • H0: There is no Mom Smoking effect on infant’s weight
  • H1: There exist Mom smoking effect.
Step 1: Check variances equality
# Check the Variances
var.test(data.bweight$Weight ~ data.bweight$MomSmoke, alternative="two.sided")
## 
##  F test to compare two variances
## 
## data:  data.bweight$Weight by data.bweight$MomSmoke
## F = 1.0786, num df = 253, denom df = 40, p-value = 0.8009
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6421109 1.6671729
## sample estimates:
## ratio of variances 
##           1.078555

P-Value >> 0.05 ===> Variances are equal.

Step 2: Perform T-Test
t.test(data.bweight$Weight ~ data.bweight$MomSmoke,alternative="two.sided",var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  data.bweight$Weight by data.bweight$MomSmoke
## t = 3.071, df = 293, p-value = 0.002334
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   93.37931 426.65488
## sample estimates:
## mean in group 0 mean in group 1 
##        3422.724        3162.707

Conclusion:

P-Value << 0.05 ===> We don’t have any evidence to accept Null hypothesis.
Thus:
The mean of two groups are not equal, therefore, there exist significant effect on Mom smoking for weight of infants.


Exercise 3:

***

Part (a):

Answer -:

Step 1) Check Balancing
table(data.bweight$MomSmoke)
## 
##   0   1 
## 254  41

Data is unbalanced.

Step 2) Perform One-Way ANOVA
e3.aov.a <-  aov(data=data.bweight, Weight ~ MomSmoke)
summary(e3.aov.a)
##              Df   Sum Sq Mean Sq F value  Pr(>F)   
## MomSmoke      1  2386708 2386708   9.431 0.00233 **
## Residuals   293 74151291  253076                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 3) Chack variance assumption

We must check equality of variances with 2 ways:
1) Levente Test
2) Residuals diagram

leveneTest(e3.aov.a)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.6767 0.4114
##       293

Based on levene test’s result. P-Value >> 0.05 ===> variances assumption is reasonable

par(mfrow=c(2,2))
plot(e3.aov.a)

Comment

Regarding to the Residuals plot, variances equality is reasonable.
Thus, based on levene test and residuals plot:
variances equality assumption for both groups are reasonable (Homogeneous variances), therefore it’s OK to perform one-way anova.


Part (b):

Answer -:

Calculate the R Square
e3.lm=lm(data=data.bweight, Weight ~ MomSmoke)
anova(e3.lm)
## Analysis of Variance Table
## 
## Response: Weight
##            Df   Sum Sq Mean Sq F value   Pr(>F)   
## MomSmoke    1  2386708 2386708  9.4308 0.002334 **
## Residuals 293 74151291  253076                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(e3.lm)$r.squared
## [1] 0.03118331

This R Square is the percentage of variation in a response variable that is explained by the model.


Final Result:

According the Part (a) - Step (2), P-Value << 0.05 ===> P-Value is statisticaly significant, thus we don’t have any evidence to accept Null, therefore we reject Null.

Conclussion:

Mom smoke groups have different means, it means there exist mom smoke effect on weight of infants.
This result is equal with the reult of Exercise 2 (Two-Way T-Test) and both methods have same results.


Exercise 4:


Part (a):

Answer -:

Step 1) Make full model

The full model with main effects is:

Weight ~ Black + Married + Boy + MomSmoke + Ed

Step 2) Perform ANOVA and Type III
e4.aov.a <- aov(data=data.bweight, Weight ~ Black + Married + Boy + MomSmoke + Ed)
Anova(e4.aov.a, type=3)
## Anova Table (Type III tests)
## 
## Response: Weight
##                Sum Sq  Df   F value    Pr(>F)    
## (Intercept) 470708892   1 1936.5547 < 2.2e-16 ***
## Black         2778525   1   11.4312 0.0008217 ***
## Married         53455   1    0.2199 0.6394546    
## Boy            190845   1    0.7852 0.3763046    
## MomSmoke      2225578   1    9.1563 0.0027017 ** 
## Ed               7295   1    0.0300 0.8625846    
## Residuals    70245819 289                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 3) Eliminate first variable

We must eliminate the least significant variable (the biggest P-value)

“ED” has the biggest P-Value, thus we will eliminate it.
The next model is:

Weight ~ Black + Married + Boy + MomSmoke

Step 4) Perform ANOVA and Type III
e4.aov.a <- aov(data=data.bweight, Weight ~ Black + Married + Boy + MomSmoke)
Anova(e4.aov.a, type=3)
## Anova Table (Type III tests)
## 
## Response: Weight
##                Sum Sq  Df   F value    Pr(>F)    
## (Intercept) 504241891   1 2081.4757 < 2.2e-16 ***
## Black         2794731   1   11.5365 0.0007778 ***
## Married         61146   1    0.2524 0.6157671    
## Boy            186656   1    0.7705 0.3807876    
## MomSmoke      2227433   1    9.1947 0.0026466 ** 
## Residuals    70253114 290                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 5) Eliminate next least significant variable

“Married” has the biggest P-Value, thus we will eliminate it.
The next model is:

Weight ~ Black + Boy + MomSmoke

Step 6) Perform ANOVA and Type III
e4.aov.a <- aov(data=data.bweight, Weight ~ Black + Boy + MomSmoke)
Anova(e4.aov.a, type=3)
## Anova Table (Type III tests)
## 
## Response: Weight
##                 Sum Sq  Df   F value    Pr(>F)    
## (Intercept) 1413353138   1 5849.2511 < 2.2e-16 ***
## Black          3664133   1   15.1642 0.0001223 ***
## Boy             179990   1    0.7449 0.3888071    
## MomSmoke       2498854   1   10.3417 0.0014471 ** 
## Residuals     70314260 291                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 7) Eliminate next least significant variable

“Boy” has the biggest P-Value, thus we will eliminate it.
The next model is:

Weight ~ Black + MomSmoke

Step 8) Perform ANOVA and Type III
e4.aov.a <- aov(data=data.bweight, Weight ~ Black + MomSmoke)
Anova(e4.aov.a, type=3)
## Anova Table (Type III tests)
## 
## Response: Weight
##                 Sum Sq  Df   F value    Pr(>F)    
## (Intercept) 2600800716   1 10772.989 < 2.2e-16 ***
## Black          3657042   1    15.148 0.0001232 ***
## MomSmoke       2513301   1    10.411 0.0013954 ** 
## Residuals     70494249 292                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 9) Cut-off

There exist no more insignificant variables, thus we will stop elimination and, now we have the final model.

Model with main effects is: Weight ~ Black + MomSmoke

Step 10) Adding interaction for main effects
e4.aov.a.2 <- aov(data=data.bweight, Weight ~ Black + MomSmoke + Black * MomSmoke)
Anova(e4.aov.a.2, type=3)
## Anova Table (Type III tests)
## 
## Response: Weight
##                    Sum Sq  Df    F value    Pr(>F)    
## (Intercept)    2546671287   1 10513.4642 < 2.2e-16 ***
## Black             3292713   1    13.5934 0.0002707 ***
## MomSmoke          2222570   1     9.1755 0.0026729 ** 
## Black:MomSmoke       5461   1     0.0225 0.8807474    
## Residuals        70488788 291                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-Value for interaction >> 0.05 ====> interaction is not significant, thus we don’t add interaction to final model.

Conclussion:

Final model will be:

Weight ~ Black + MomSmoke


Part (b):

Answer -:

Step 1) Calculate R Square
e4.lm <- lm(data=data.bweight, Weight ~ Black + MomSmoke)
summary(e4.lm)$r.squared
## [1] 0.07896405

This R Square is the percentage of variation in a response variable that is explained by the model.

Step 2) Diagnostic plot
par(mfrow=c(2,2))
plot(e4.aov.a)

##### Result:
According to QQ plot, normality assumption is reasonable, since, most points are closed to red line, thus we can judge data follows normality distribution.


Part (c):

Answer -:

We must perform a Post-hoc test to figure out about the differences.

TukeyHSD(e4.aov.a)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Weight ~ Black + MomSmoke, data = data.bweight)
## 
## $Black
##          diff       lwr       upr     p adj
## 1-0 -293.9412 -445.2216 -142.6608 0.0001605
## 
## $MomSmoke
##         diff       lwr       upr     p adj
## 1-0 -266.763 -429.5199 -104.0061 0.0013989
  • Diff of “Black = 1/Black = 0” is “Negative”, P-Value is small ==> µ(Black=1) < µ(Black=0)
  • Diff of “Momsmoke = 1/Momsmoke = 0” is “Negative”, P-Value is small ==> µ(Momsmoke = 1) < µ(Momsmoke = 0)

Conclussion:

Mean of infant’s weight of white moms is bigger than blacks.
Mean of infant’s weight of non-smoking moms is bigger than smoking moms.