# Read the CSV file
data.bweight <- read.csv("birthweight.csv",header = TRUE)
# Check data format
str(data.bweight)
## 'data.frame': 295 obs. of 6 variables:
## $ Weight : int 2891 3572 3827 4593 3940 2778 3544 3402 3147 3997 ...
## $ Black : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Married : int 1 1 1 1 0 1 1 1 0 1 ...
## $ Boy : int 0 1 0 1 1 0 1 0 0 1 ...
## $ MomSmoke: int 0 0 0 0 1 0 0 0 0 0 ...
## $ Ed : int 1 0 1 1 1 1 1 1 0 1 ...
data.bweight$Black=as.factor(data.bweight$Black)
data.bweight$Married=as.factor(data.bweight$Married)
data.bweight$Boy=as.factor(data.bweight$Boy)
data.bweight$MomSmoke=as.factor(data.bweight$MomSmoke)
data.bweight$Ed=as.factor(data.bweight$Ed)
str(data.bweight)
## 'data.frame': 295 obs. of 6 variables:
## $ Weight : int 2891 3572 3827 4593 3940 2778 3544 3402 3147 3997 ...
## $ Black : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Married : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
## $ Boy : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 2 ...
## $ MomSmoke: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## $ Ed : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 2 ...
boxplot(data.bweight$Weight, main="Box plot for Infant birth weight",xlab="Weight",ylab="g (gram)",col = "aquamarine3",border = "aquamarine4")
points(mean(data.bweight$Weight),col="red",pch=16)
Mean is very closed to median, skewness is very closed to zero, Q1 and Q3 are almost symmetric, thus:
data looks very normal distribution.
qqnorm(data.bweight$Weight)
qqline(data.bweight$Weight,col="red")
print(shapiro.test(data.bweight$Weight))
##
## Shapiro-Wilk normality test
##
## data: data.bweight$Weight
## W = 0.99206, p-value = 0.1153
Regarding to Box plot, QQ plot and Shapiro-Wilk test result, data for the “Weight” follows normal distribution.
boxplot(data.bweight$Weight ~ data.bweight$MomSmoke,main="Box plot for Weight by Momsoking",xlab = "MomSmoke",ylab="Weight (gram)",col="aquamarine3",border="aquamarine4")
mean.aggregate <- aggregate(data.bweight$Weight,list(data.bweight$MomSmoke),mean)
points(x = 1:nrow(mean.aggregate), # Add points to plot
y = mean.aggregate$x,
col = "red",
pch = 16)
text(x = 1:nrow(mean.aggregate), # Add text to plot
y = mean.aggregate$x +200,
labels = paste("Mean:", round(mean.aggregate$x, 1)),
col = "red")
The plot for moms who smoked (value=1), has lower min and max, lower Q1 and Q2 and lower mean and median, it seems the moms who smoked might have lower weight infants.
Regarding the plots, mean and median are very close together, it seems both groups follows normal distribution.
data.weight.momsnosmoked <- data.bweight[data.bweight$MomSmoke==0,]
print(shapiro.test(data.weight.momsnosmoked$Weight))
##
## Shapiro-Wilk normality test
##
## data: data.weight.momsnosmoked$Weight
## W = 0.99362, p-value = 0.3549
data.weight.momsmoked <- data.bweight[data.bweight$MomSmoke==1,]
print(shapiro.test(data.weight.momsmoked$Weight))
##
## Shapiro-Wilk normality test
##
## data: data.weight.momsmoked$Weight
## W = 0.96299, p-value = 0.2
As the Shapiro-Wilk test results, the P-Value for both groups as large (> 0.05), hence both groups follows normal distribution.
Regarding to result of exercise 1, data follows a normal distribution and we have only one independent variable (MomSmoke) with two levels (0/1), thus we should choose a parametric Two-Sample T-Test.
# Check the Variances
var.test(data.bweight$Weight ~ data.bweight$MomSmoke, alternative="two.sided")
##
## F test to compare two variances
##
## data: data.bweight$Weight by data.bweight$MomSmoke
## F = 1.0786, num df = 253, denom df = 40, p-value = 0.8009
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6421109 1.6671729
## sample estimates:
## ratio of variances
## 1.078555
P-Value >> 0.05 ===> Variances are equal.
t.test(data.bweight$Weight ~ data.bweight$MomSmoke,alternative="two.sided",var.equal=TRUE)
##
## Two Sample t-test
##
## data: data.bweight$Weight by data.bweight$MomSmoke
## t = 3.071, df = 293, p-value = 0.002334
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 93.37931 426.65488
## sample estimates:
## mean in group 0 mean in group 1
## 3422.724 3162.707
P-Value << 0.05 ===> We don’t have any evidence to accept Null hypothesis.
Thus:
The mean of two groups are not equal, therefore, there exist significant effect on Mom smoking for weight of infants.
***
table(data.bweight$MomSmoke)
##
## 0 1
## 254 41
Data is unbalanced.
e3.aov.a <- aov(data=data.bweight, Weight ~ MomSmoke)
summary(e3.aov.a)
## Df Sum Sq Mean Sq F value Pr(>F)
## MomSmoke 1 2386708 2386708 9.431 0.00233 **
## Residuals 293 74151291 253076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We must check equality of variances with 2 ways:
1) Levente Test
2) Residuals diagram
leveneTest(e3.aov.a)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.6767 0.4114
## 293
Based on levene test’s result. P-Value >> 0.05 ===> variances assumption is reasonable
par(mfrow=c(2,2))
plot(e3.aov.a)
Regarding to the Residuals plot, variances equality is reasonable.
Thus, based on levene test and residuals plot:
variances equality assumption for both groups are reasonable (Homogeneous variances), therefore it’s OK to perform one-way anova.
e3.lm=lm(data=data.bweight, Weight ~ MomSmoke)
anova(e3.lm)
## Analysis of Variance Table
##
## Response: Weight
## Df Sum Sq Mean Sq F value Pr(>F)
## MomSmoke 1 2386708 2386708 9.4308 0.002334 **
## Residuals 293 74151291 253076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(e3.lm)$r.squared
## [1] 0.03118331
This R Square is the percentage of variation in a response variable that is explained by the model.
According the Part (a) - Step (2), P-Value << 0.05 ===> P-Value is statisticaly significant, thus we don’t have any evidence to accept Null, therefore we reject Null.
Mom smoke groups have different means, it means there exist mom smoke effect on weight of infants.
This result is equal with the reult of Exercise 2 (Two-Way T-Test) and both methods have same results.
The full model with main effects is:
Weight ~ Black + Married + Boy + MomSmoke + Ed
e4.aov.a <- aov(data=data.bweight, Weight ~ Black + Married + Boy + MomSmoke + Ed)
Anova(e4.aov.a, type=3)
## Anova Table (Type III tests)
##
## Response: Weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 470708892 1 1936.5547 < 2.2e-16 ***
## Black 2778525 1 11.4312 0.0008217 ***
## Married 53455 1 0.2199 0.6394546
## Boy 190845 1 0.7852 0.3763046
## MomSmoke 2225578 1 9.1563 0.0027017 **
## Ed 7295 1 0.0300 0.8625846
## Residuals 70245819 289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We must eliminate the least significant variable (the biggest P-value)
“ED” has the biggest P-Value, thus we will eliminate it.
The next model is:
Weight ~ Black + Married + Boy + MomSmoke
e4.aov.a <- aov(data=data.bweight, Weight ~ Black + Married + Boy + MomSmoke)
Anova(e4.aov.a, type=3)
## Anova Table (Type III tests)
##
## Response: Weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 504241891 1 2081.4757 < 2.2e-16 ***
## Black 2794731 1 11.5365 0.0007778 ***
## Married 61146 1 0.2524 0.6157671
## Boy 186656 1 0.7705 0.3807876
## MomSmoke 2227433 1 9.1947 0.0026466 **
## Residuals 70253114 290
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
“Married” has the biggest P-Value, thus we will eliminate it.
The next model is:
Weight ~ Black + Boy + MomSmoke
e4.aov.a <- aov(data=data.bweight, Weight ~ Black + Boy + MomSmoke)
Anova(e4.aov.a, type=3)
## Anova Table (Type III tests)
##
## Response: Weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 1413353138 1 5849.2511 < 2.2e-16 ***
## Black 3664133 1 15.1642 0.0001223 ***
## Boy 179990 1 0.7449 0.3888071
## MomSmoke 2498854 1 10.3417 0.0014471 **
## Residuals 70314260 291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
“Boy” has the biggest P-Value, thus we will eliminate it.
The next model is:
Weight ~ Black + MomSmoke
e4.aov.a <- aov(data=data.bweight, Weight ~ Black + MomSmoke)
Anova(e4.aov.a, type=3)
## Anova Table (Type III tests)
##
## Response: Weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 2600800716 1 10772.989 < 2.2e-16 ***
## Black 3657042 1 15.148 0.0001232 ***
## MomSmoke 2513301 1 10.411 0.0013954 **
## Residuals 70494249 292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There exist no more insignificant variables, thus we will stop elimination and, now we have the final model.
Model with main effects is: Weight ~ Black + MomSmoke
e4.aov.a.2 <- aov(data=data.bweight, Weight ~ Black + MomSmoke + Black * MomSmoke)
Anova(e4.aov.a.2, type=3)
## Anova Table (Type III tests)
##
## Response: Weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 2546671287 1 10513.4642 < 2.2e-16 ***
## Black 3292713 1 13.5934 0.0002707 ***
## MomSmoke 2222570 1 9.1755 0.0026729 **
## Black:MomSmoke 5461 1 0.0225 0.8807474
## Residuals 70488788 291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-Value for interaction >> 0.05 ====> interaction is not significant, thus we don’t add interaction to final model.
Final model will be:
Weight ~ Black + MomSmoke
e4.lm <- lm(data=data.bweight, Weight ~ Black + MomSmoke)
summary(e4.lm)$r.squared
## [1] 0.07896405
This R Square is the percentage of variation in a response variable that is explained by the model.
par(mfrow=c(2,2))
plot(e4.aov.a)
##### Result:
According to QQ plot, normality assumption is reasonable, since, most points are closed to red line, thus we can judge data follows normality distribution.
We must perform a Post-hoc test to figure out about the differences.
TukeyHSD(e4.aov.a)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Weight ~ Black + MomSmoke, data = data.bweight)
##
## $Black
## diff lwr upr p adj
## 1-0 -293.9412 -445.2216 -142.6608 0.0001605
##
## $MomSmoke
## diff lwr upr p adj
## 1-0 -266.763 -429.5199 -104.0061 0.0013989
Mean of infant’s weight of white moms is bigger than blacks.
Mean of infant’s weight of non-smoking moms is bigger than smoking moms.
Comment:
As QQ plot, most amount of points are closed to red line, thus normal distribution is reasonable.
As the Shapiro-Wilk test result, the P-Value is large (>0.05), hence data follows normal distribution.