##load packages to use
#Conducting TTest
setwd("~/Documents/R_programming")
prb<-read_csv("PRB2013.csv", col_names=T)
## Parsed with column specification:
## cols(
## .default = col_double(),
## Country = col_character(),
## Continent = col_character(),
## Region = col_character()
## )
## See spec(...) for full column specifications.
names(prb)<-tolower(names(prb))
prb_new<-prb%>% #we are creating a new dataframe called prb_new
mutate(Africa=ifelse(prb$continent=="Africa",yes= "Africa",no= "Not Africa")) # this new dataframe contains an additional dummy variable, Africa
#summary statistics by group using group_by
prb_new%>%
group_by(Africa)%>%
summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
#significant t-test
t.test(tfr~Africa, data=prb_new)
##
## Welch Two Sample t-test
##
## data: tfr by Africa
## t = 11.559, df = 69.813, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.954226 2.769268
## sample estimates:
## mean in group Africa mean in group Not Africa
## 4.612727 2.250980
Q1. What can we conclude from the summary statistics?
On the average, the 55 countries in Africa continent has a total fertility rate 0f 4.61, while 53 countries from other continents have an average total fertility rate of 2.25. Given that a standard deviation close to 0 indicates that the data points tend to be closer to the mean, and the further the data points are from the mean, the greater the standard deviation; the standard deviation for the total fertility rate for African countries (1.42) is greater than that of Non African countries(0.89), and 0.89 is closer to zero, we can conclude that the data point(total fertility rates for African countries) has more variation and that it is farther from Its mean. On the other, the data points(total fertility rates for non African countries) has lesser variation and closer to the mean when compared to that of Africa continent.
Q2. What can we conclude from the significant t-test?
Since our p value is less than 0.05, we can then conclude at a 95% significance level, that there is a significant difference between the mean total fertility rates for African continent and Non African continent.
Q3. Now that you see an example, now your turn. Please conduct a significant test to examine the difference in infant mortality (imr) between Asian countries and African countries.
#a) provide summary statistics by group
PRB_AA <- prb%>%
filter(continent%in%c("Africa","Asia"))
PRB_AA %>%
group_by(continent) %>%
summarise(means=mean(imr, na.rm=T), sds=sd(imr, na.rm=T), n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
#b) provide boxplots by group
boxplot(imr ~ continent,PRB_AA,
main='Boxplot for Infant Mortality Rates In Africa and Asia Continents',
ylab = "Infant Mortality",
xlab = "Continent")
#c) conduct t-test. Make sure you interpret the results thoroughly
attach(PRB_AA)
t.test(imr ~ continent)
##
## Welch Two Sample t-test
##
## data: imr by continent
## t = 7.0987, df = 97.84, p-value = 2.02e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 24.38592 43.31119
## sample estimates:
## mean in group Africa mean in group Asia
## 57.83091 23.98235
#Since the P-value (0.000000000202)is less than the significance level of 0.05 we can conclude that there is enough evidence to conclude that the differences in the mean for infant mortality rate between Asia and African continents is significant. Since there is no zero value between the 95% confidence interval for the mean infant mortality rate in Africa and the mean of infant mortality rate in Asia, then we will reject the null hypothesis (no difference in the mean value for infant mortality rates in Africa and Asia ) and conclude that at a 95% confidence level the difference observed in the mean value for infant mortality rates between Africa and Asia continents is significant.
detach(PRB_AA)
##Moving from two groups to multiple groups
#summary statistics by group
prb%>%
group_by(continent)%>%
summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
#Q6. Based on the output above, how many groups are there? Describe the table briefly.
#There are six groups. The table shows the mean(means) value for each of the continent total fertility rates, standard deviation(sds), and the total no have countries in each continent. The African continent has the highest no of countries(55) and highest average total fertility rate of 4.6. European continent with 45 countries has a mean total fertility rate of 1.5 and a standard deviation (0.23) closest to zero when compared to other continents.
#Q7. Conduct the Anova test, and explain how did we reach to the F-value of 57.379.
mm <- aov(tfr~continent,prb)
anova(mm)
# The F-value of 57.379 is determined by finding the ratio of the between-groups variance (mean square value for continent) and the within-groups variance(mean square for residual ). Hence, dividing 53.322 by 0.929 will yield the f value of 57.379
# Q8. Interpret the F-test (ANOVA test) results. Make sure you state the null and research hypotheses.
# H0: μ1 = μ2 = μ3 ... = μ6
# H1: Means are not all equal
# From the analysis above, at a 95% level of significance, we can reject the null hypothesis and conclude that there is a significant difference in the means of the total fertility rate for the different continent.This is so because our p-value is less than 0.05
##Simple Regression Analysis.
#Using PSID data to examine the relationship between education (educ) and family income (adjfinc).
##Q9. What’s the mean value for education and family income, respectively?
PSIDD <-read_dta("psid2013.dta")
PSID <- PSIDD
mean(PSID$educ, na.rm=T)
## [1] 13.37999
mean(PSID$adjfinc,na.rm=T)
## [1] 54.36537
#Q10. Estimate the relationship between education (X) and family income (Y).
#1) How would you write the linear regression equation?
# Family income(Y) = β0 + β1*X(Education) + ϵ
#2) Do you have any concerns that this model violates the regression assumptions?
# No
#3) What’s the R output of the regression analysis?
fit<-lm(adjfinc~educ, data=PSIDD)
summary(fit)
##
## Call:
## lm(formula = adjfinc ~ educ, data = PSIDD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.74 -28.94 -10.36 15.13 1751.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.1346 1.7674 -23.27 <2e-16 ***
## educ 7.1438 0.1285 55.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.07 on 22959 degrees of freedom
## (173 observations deleted due to missingness)
## Multiple R-squared: 0.1186, Adjusted R-squared: 0.1186
## F-statistic: 3091 on 1 and 22959 DF, p-value: < 2.2e-16
#4) How would you interpret the coefficient of education?
# For every additional year of education completed, the family income increases by 7.14
#5) Show the analysis of variance table from this regression analysis.
anova(fit)
#6) What’s the value of SSE? What does it mean?
# 88450801
# It is the sum of squared deviations of actual values from predicted values in the regression model.It is the deviations predicted from actual empirical values of the data.
#Q11. (Bonus). Estimate the relationship between education levels (more than high school>12 years versus equal or less than high school<=12) and family income (Y).
#1) How would you write the equation?
# Family income(Y) = β0 + β1*X + ϵ
#2) What’s the R output of the regression analysis?
Edu_more <- PSID %>%
mutate(new_edu =factor( ifelse(educ <= 12 & educ >= 1, 0,
ifelse(educ > 12, 1, NA))))
Mid<-lm(adjfinc~new_edu, data=Edu_more)
summary(Mid)
##
## Call:
## lm(formula = adjfinc ~ new_edu, data = Edu_more)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.01 -28.64 -10.76 15.32 1781.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.4079 0.6045 60.23 <2e-16 ***
## new_edu1 35.3897 0.8437 41.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.79 on 22877 degrees of freedom
## (255 observations deleted due to missingness)
## Multiple R-squared: 0.07141, Adjusted R-squared: 0.07137
## F-statistic: 1759 on 1 and 22877 DF, p-value: < 2.2e-16
#3) How would you interpret the coefficient of education?
# Familyincome = 36.40 + (35.4 x 1) = 71.8 (More than high school edu)
#Familyincome = 36.40 + (35.4 x 0)= 36.40 ( less than high school edu)
# on average, respondents with more than high school education reported a family income of 35.4 times higher than respondents with less than high school education.
#4) How would you interpret the intercept?
# The intercept describes what what family income would be if the completed years of education is zero
# However, from the p value of the groups that had more than high score education, we can conclude that the coefficient is different from zero, then we have enough evidence to say that the difference between the group that has more than high school education and the group that has less than high school education is statistically significant.