options(dplyr.summarise.inform = F)
library(dplyr)
prb<-read_csv("PRB2013.csv", col_names=T)
##
## ── Column specification ─────────────────────────────────
## cols(
## .default = col_double(),
## Country = col_character(),
## Continent = col_character(),
## Region = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
names(prb)<-tolower(names(prb))
prb_new<-prb%>%
mutate(Africa=ifelse(prb$continent=="Africa",yes= "Africa",no= "Not Africa"))
table(prb_new$Africa)
##
## Africa Not Africa
## 55 153
#summary stats by group using group_by
options(dplyr.summarise.inform = F)
prb_new %>%
group_by(Africa) %>%
summarise(means=mean(tfr, na.rm=T), sds=sd (tfr, na.rm=T), n=n())
## # A tibble: 2 x 4
## Africa means sds n
## <chr> <dbl> <dbl> <int>
## 1 Africa 4.61 1.42 55
## 2 Not Africa 2.25 0.889 153
Q1. What can we conclude from the summary statistics?
#The mean fertility rate for Africa is 4.612727 with a standard deviation of 1.4. The total sample of countries from Africa is 55.
#The mean fertility rate for Non-African is 2.250980 with a standard deviation of 0.8. The total sample of non-African countries is 153. The means are very different between African and non-African countries. With Africa TRF of 4.6 and non-Africa TFR of 2.4, the difference is 2.4. That is a difference of 2.4 children per women on average.
Q2. What can we conclude from the significant t-test?
#significant t-test
t.test(tfr~Africa, data=prb_new)
##
## Welch Two Sample t-test
##
## data: tfr by Africa
## t = 11.559, df = 69.813, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.954226 2.769268
## sample estimates:
## mean in group Africa mean in group Not Africa
## 4.612727 2.250980
#Based on the data from the two-sample t-test, we have evidence to reject the null hypothesis because the t-value of 11.59 is greater than 1.96, degrees of freedom 69.8, and the P-value ,< 2.2e-16, is less than 0.05. The mean for total fertility rate in African countries is 4.6 while the mean for non-African countries is 2.25. The total fertility rate between African and Non-African countries is significant.
Q3. Now that you see an example, now your turn. Please conduct a significant test to examine the difference in infant mortality (imr) between Asian countries and African countries. a) provide summary statistics by group
prb_imr <- prb %>%
transmute(
continent=ifelse(continent=="Africa", "Africa" ,
ifelse(continent=="Asia", "Asia", NA)),
imr=imr
)
View(prb_imr)
prb_imr <- na.omit(prb_imr)
prb_imr %>%
group_by(continent)%>%
summarise(mean=mean(imr,na.rm=TRUE), sd=sd(imr, na.rm = TRUE), n=n())
## # A tibble: 2 x 4
## continent mean sd n
## <chr> <dbl> <dbl> <int>
## 1 Africa 57.8 28.4 55
## 2 Asia 24.0 20.3 51
#The summary statistics suggest that the means for infant mortality rate (imr) is different between Asian and African countries. The mean imr for Asia is 23.0, standard deviation is 20.30 and the number sample is 51. The mean imr for African countries is 57.83, standard deviation 29.4 and the number sample is 55. With Africa with an imr of 57.83 and Asia with an imr of 23.0, the difference is 34.8 in imr.
b)provide boxplots by group
boxplot(imr~continent, data=prb)
e <- ggplot(prb, aes(x = continent, y = imr))
e + geom_boxplot() +
scale_x_discrete(limits=c("Africa", "Asia"))
## Warning: Removed 102 rows containing missing values (stat_boxplot).
t.test(imr~continent, data=prb_imr)
##
## Welch Two Sample t-test
##
## data: imr by continent
## t = 7.0987, df = 97.84, p-value = 2.02e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 24.38592 43.31119
## sample estimates:
## mean in group Africa mean in group Asia
## 57.83091 23.98235
#Based on the data from the two-tailed t-test, the t value is 7.1, degrees of freedom is 97.84 and the p value is 2.02e-10. The infant mortality rate mean for the African group is 57.83 and the infant mortality rate mean for the Asia is 23.98. The difference between the means is statistically significant. There is enough evidence to reject the null hypothesis. H₀: μ₁ - μ₂ = 0 Ha: μ₁ - μ₂ ≠ 0
Moving from two groups to multiple groups.
#summary statistics by group
prb%>%
group_by(continent)%>%
summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
## # A tibble: 6 x 4
## continent means sds n
## <chr> <dbl> <dbl> <int>
## 1 Africa 4.61 1.42 55
## 2 Asia 2.52 1.03 51
## 3 Europe 1.55 0.228 45
## 4 North America 2.21 0.546 27
## 5 Oceania 3.18 0.901 17
## 6 South America 2.5 0.476 13
Q6. Based on the output above,how many groups are there? Describe the table briefly.
#There are six groups: Africa, Asia, Europe, North America, Oceania and South America. The table provides means, standard deviation and the sample number for each group. The group/continent with the highest fertility rate is Africa with a mean of 4.612727; the group/continent with the lowest fertility rate is Europe with a mean of 1.553333.
Q7. Conduct the Anova test, and explain how did we reach to the F-value of 57.379.
at1 <- aov(tfr~continent, data=prb_new)
anova(at1)
## Analysis of Variance Table
##
## Response: tfr
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 5 266.61 53.322 57.379 < 2.2e-16 ***
## Residuals 202 187.72 0.929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Anova compares 'Between Variance' to 'Within Variance' through a ratio called F. F=Msb/Msw, based on the F ratio formula we divided the mean square value of the between variance by the within variance, 53.322/0.929 = 57.379. P value is < 2.2e-16.
Q8. Interpret the F-test (ANOVA test) results. Make sure you state the null and research hypotheses.
# Null hypothesis: There is no difference in mean among the six groups.
# Alternative hypothesis: There is a difference in mean in at least one of the six groups.F value is 57.379. P value is < 2.2e-16. We reject the null hypothesis. H₀: μ₁ - μ₂ = 0 Ha: μ₁ - μ₂ ≠ 0
Q9. What’s the mean value for education and family income, respectively?
library(haven)
stata_PSID_w1 <- read_dta("stata_PSID_w1.dta")
View(stata_PSID_w1)
mean(stata_PSID_w1$educ, na.rm=T)
## [1] 13.03642
mean(stata_PSID_w1$adjfinc, na.rm=T)
## [1] 60.39013
#The mean value for education is 13.03 and the mean value for family income it is 60.39.
Q10. Estimate the relationship between education (X) and family income (Y).
reg1 <- lm(adjfinc~educ, data = stata_PSID_w1)
summary(reg1)
##
## Call:
## lm(formula = adjfinc ~ educ, data = stata_PSID_w1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -982.3 -31.8 -11.7 14.8 4962.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.7707 0.9331 -39.41 <2e-16 ***
## educ 7.4568 0.0696 107.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.13 on 128843 degrees of freedom
## (2516 observations deleted due to missingness)
## Multiple R-squared: 0.0818, Adjusted R-squared: 0.0818
## F-statistic: 1.148e+04 on 1 and 128843 DF, p-value: < 2.2e-16
with(stata_PSID_w1,plot(educ, adjfinc))
abline(reg1, col="purple", lwd=2)
10-1) How would you write the linear regression equation?
#y=-36.77 + 4.4568x + E
10-2) Do you have any concerns that this model violates the regression assumptions?
#No concerns.
10-3) What’s the R output of the regression analysis?
reg1 <- lm(adjfinc~educ, data = stata_PSID_w1)
summary(reg1)
##
## Call:
## lm(formula = adjfinc ~ educ, data = stata_PSID_w1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -982.3 -31.8 -11.7 14.8 4962.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.7707 0.9331 -39.41 <2e-16 ***
## educ 7.4568 0.0696 107.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.13 on 128843 degrees of freedom
## (2516 observations deleted due to missingness)
## Multiple R-squared: 0.0818, Adjusted R-squared: 0.0818
## F-statistic: 1.148e+04 on 1 and 128843 DF, p-value: < 2.2e-16
10-4) How would you interpret the coefficient of education?
#The value for the education coefficient is 7.4568, for every unit increase in education the adjusted family income increases by $7,457.
10-5) Show the analysis of variance table from this regression analysis.
anova(reg1)
## Analysis of Variance Table
##
## Response: adjfinc
## Df Sum Sq Mean Sq F value Pr(>F)
## educ 1 70070052 70070052 11479 < 2.2e-16 ***
## Residuals 128843 786481569 6104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10-6) What’s the value of SSE? What does it mean?
# The error is the difference between the observed value and the predicted value. The SSE is 786481569, its a measure of the total variances.
Bonus Question 12 Estimate the relationship between education levels (more than high school>12 years versus equal or less than high school<=12) and family income (Y).
educb <-stata_PSID_w1 %>%
mutate (educc= ifelse(educ<=8,
c("less highschool"), c("more highschool")))
summary(educb)
## year sex age marpi
## Min. :2001 Length:131361 Min. : 1.00 Min. :0.0000
## 1st Qu.:2003 Class :character 1st Qu.: 14.00 1st Qu.:0.0000
## Median :2007 Mode :character Median : 29.00 Median :0.0000
## Mean :2006 Mean : 32.03 Mean :0.4178
## 3rd Qu.:2009 3rd Qu.: 47.00 3rd Qu.:1.0000
## Max. :2011 Max. :999.00 Max. :4.0000
## NA's :28
## educ adjfinc pubhs rnthlp
## Min. : 0.00 Min. :-929.60 Min. :0.00000 Min. :0.00000
## 1st Qu.:12.00 1st Qu.: 24.04 1st Qu.:0.00000 1st Qu.:0.00000
## Median :12.00 Median : 45.18 Median :0.00000 Median :0.00000
## Mean :13.04 Mean : 60.39 Mean :0.05301 Mean :0.02409
## 3rd Qu.:15.00 3rd Qu.: 75.31 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :20.00 Max. :5044.84 Max. :1.00000 Max. :1.00000
## NA's :2496 NA's :48 NA's :34 NA's :48
## adjwlth1 adjwlth2 h_race_ethnic_new id
## Min. :-2467.18 Min. :-2304.98 Length:131361 Min. : 4003
## 1st Qu.: 0.01 1st Qu.: 1.91 Class :character 1st Qu.:1269033
## Median : 9.98 Median : 32.80 Mode :character Median :2464171
## Mean : 129.48 Mean : 187.17 Mean :3014466
## 3rd Qu.: 58.05 3rd Qu.: 143.55 3rd Qu.:5381175
## Max. :80199.41 Max. :80303.23 Max. :6872185
## NA's :48 NA's :48
## race5 educc
## Min. :1.000 Length:131361
## 1st Qu.:3.000 Class :character
## Median :5.000 Mode :character
## Mean :3.927
## 3rd Qu.:5.000
## Max. :5.000
##
table(educb$educc)
##
## less highschool more highschool
## 6594 122271
reg2 <- lm(adjfinc~educc, data=educb)
summary(reg2)
##
## Call:
## lm(formula = adjfinc ~ educc, data = educb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -991.7 -36.0 -14.2 14.6 4982.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.769 1.001 29.75 <2e-16 ***
## educcmore highschool 32.326 1.027 31.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.22 on 128843 degrees of freedom
## (2516 observations deleted due to missingness)
## Multiple R-squared: 0.007627, Adjusted R-squared: 0.007619
## F-statistic: 990.3 on 1 and 128843 DF, p-value: < 2.2e-16
#y=29.769 + 32.326x + E
reg2 <- lm(adjfinc~educc, data=educb)
summary(reg2)
##
## Call:
## lm(formula = adjfinc ~ educc, data = educb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -991.7 -36.0 -14.2 14.6 4982.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.769 1.001 29.75 <2e-16 ***
## educcmore highschool 32.326 1.027 31.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.22 on 128843 degrees of freedom
## (2516 observations deleted due to missingness)
## Multiple R-squared: 0.007627, Adjusted R-squared: 0.007619
## F-statistic: 990.3 on 1 and 128843 DF, p-value: < 2.2e-16
# On average, a person with more than high school education will earn about $32,326 more in family income than their counterparts with less than high school education.
# ($62,095 = 29.769+32.326)
# A person with less than or equal to high school education, will earn on average $29,769.
Please type your answers, including my original code in a R-markdown file. When you submit, please submit the Rpub link.
#bet