options(dplyr.summarise.inform = F)
library(dplyr)
prb<-read_csv("PRB2013.csv", col_names=T)
## 
## ── Column specification ─────────────────────────────────
## cols(
##   .default = col_double(),
##   Country = col_character(),
##   Continent = col_character(),
##   Region = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
names(prb)<-tolower(names(prb))   
prb_new<-prb%>% 
 mutate(Africa=ifelse(prb$continent=="Africa",yes= "Africa",no= "Not Africa")) 
table(prb_new$Africa)
## 
##     Africa Not Africa 
##         55        153
#summary stats by group using group_by
options(dplyr.summarise.inform = F)
prb_new %>% 
  group_by(Africa) %>% 
summarise(means=mean(tfr, na.rm=T), sds=sd (tfr, na.rm=T), n=n())
## # A tibble: 2 x 4
##   Africa     means   sds     n
##   <chr>      <dbl> <dbl> <int>
## 1 Africa      4.61 1.42     55
## 2 Not Africa  2.25 0.889   153

Q1. What can we conclude from the summary statistics?

#The mean fertility rate for Africa is 4.612727  with a standard deviation of 1.4. The total sample of countries from Africa is 55. 
#The mean fertility rate for Non-African is 2.250980 with a standard deviation of 0.8. The total sample of non-African countries is 153. The means are very different between African and non-African countries. With Africa TRF of 4.6 and non-Africa TFR of 2.4, the difference is 2.4. That is a difference of 2.4 children per women on average. 

Q2. What can we conclude from the significant t-test?

#significant t-test
t.test(tfr~Africa, data=prb_new) 
## 
##  Welch Two Sample t-test
## 
## data:  tfr by Africa
## t = 11.559, df = 69.813, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.954226 2.769268
## sample estimates:
##     mean in group Africa mean in group Not Africa 
##                 4.612727                 2.250980
#Based on the data from the two-sample t-test, we have evidence to reject the null hypothesis because the t-value of 11.59 is greater than 1.96, degrees of freedom 69.8,  and the P-value ,< 2.2e-16, is less than 0.05. The mean for total fertility rate in African countries is 4.6 while the mean for non-African countries is 2.25.  The total fertility rate between African and Non-African countries is significant.

Q3. Now that you see an example, now your turn. Please conduct a significant test to examine the difference in infant mortality (imr) between Asian countries and African countries. a) provide summary statistics by group

prb_imr <- prb %>%
  transmute(
    continent=ifelse(continent=="Africa", "Africa" ,
                     ifelse(continent=="Asia", "Asia", NA)), 
    imr=imr
  )
View(prb_imr)
prb_imr <- na.omit(prb_imr)
prb_imr %>% 
  group_by(continent)%>%
  summarise(mean=mean(imr,na.rm=TRUE), sd=sd(imr, na.rm = TRUE), n=n())
## # A tibble: 2 x 4
##   continent  mean    sd     n
##   <chr>     <dbl> <dbl> <int>
## 1 Africa     57.8  28.4    55
## 2 Asia       24.0  20.3    51
#The summary statistics suggest that the means for infant mortality rate (imr) is different between Asian and African countries. The mean imr for Asia is 23.0, standard deviation is 20.30 and the number sample is 51. The mean imr for African countries is 57.83, standard deviation 29.4 and the number sample is 55. With Africa with an imr of 57.83 and Asia with an imr of 23.0, the difference is 34.8 in imr. 

b)provide boxplots by group

boxplot(imr~continent, data=prb)

e <- ggplot(prb, aes(x = continent, y = imr))
e + geom_boxplot() + 
  scale_x_discrete(limits=c("Africa", "Asia"))
## Warning: Removed 102 rows containing missing values (stat_boxplot).

  1. conduct t-test. Make sure you interpret the results thoroughly
t.test(imr~continent, data=prb_imr)
## 
##  Welch Two Sample t-test
## 
## data:  imr by continent
## t = 7.0987, df = 97.84, p-value = 2.02e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  24.38592 43.31119
## sample estimates:
## mean in group Africa   mean in group Asia 
##             57.83091             23.98235
#Based on the data from the two-tailed t-test, the t value is 7.1, degrees of freedom is 97.84 and the p value is 2.02e-10. The infant mortality rate mean for the African group is 57.83 and the infant mortality rate mean for the Asia is 23.98. The difference between the means is statistically significant. There is enough evidence to reject the null hypothesis.  H₀: μ₁ - μ₂ = 0     Ha: μ₁ - μ₂ ≠ 0 

Moving from two groups to multiple groups.

#summary statistics by group
prb%>%
 group_by(continent)%>%
 summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
## # A tibble: 6 x 4
##   continent     means   sds     n
##   <chr>         <dbl> <dbl> <int>
## 1 Africa         4.61 1.42     55
## 2 Asia           2.52 1.03     51
## 3 Europe         1.55 0.228    45
## 4 North America  2.21 0.546    27
## 5 Oceania        3.18 0.901    17
## 6 South America  2.5  0.476    13

Q6. Based on the output above,how many groups are there? Describe the table briefly.

#There are six groups: Africa, Asia, Europe, North America, Oceania and South America. The table provides means, standard deviation and the sample number for each group. The group/continent with the highest fertility rate is Africa with a mean of 4.612727; the group/continent with the lowest fertility rate is Europe with a mean of 1.553333. 

Q7. Conduct the Anova test, and explain how did we reach to the F-value of 57.379.

at1 <- aov(tfr~continent, data=prb_new)
anova(at1)
## Analysis of Variance Table
## 
## Response: tfr
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## continent   5 266.61  53.322  57.379 < 2.2e-16 ***
## Residuals 202 187.72   0.929                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Anova compares 'Between Variance' to 'Within Variance' through a ratio called F. F=Msb/Msw, based on the F ratio formula we divided the mean square value of the between variance by the within variance, 53.322/0.929 = 57.379. P value is < 2.2e-16. 

Q8. Interpret the F-test (ANOVA test) results. Make sure you state the null and research hypotheses.

# Null hypothesis: There is no difference in mean among the six groups. 
# Alternative hypothesis: There is a difference in mean in at least one of the six groups.F value is  57.379. P value is < 2.2e-16. We reject the null hypothesis.              H₀: μ₁ - μ₂ = 0     Ha: μ₁ - μ₂ ≠ 0 

Q9. What’s the mean value for education and family income, respectively?

library(haven)
stata_PSID_w1 <- read_dta("stata_PSID_w1.dta")
View(stata_PSID_w1)

mean(stata_PSID_w1$educ, na.rm=T)
## [1] 13.03642
mean(stata_PSID_w1$adjfinc, na.rm=T)
## [1] 60.39013
#The mean value for education is 13.03 and the mean value for family income it is 60.39. 

Q10. Estimate the relationship between education (X) and family income (Y).

reg1 <- lm(adjfinc~educ, data = stata_PSID_w1)
summary(reg1)
## 
## Call:
## lm(formula = adjfinc ~ educ, data = stata_PSID_w1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -982.3  -31.8  -11.7   14.8 4962.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.7707     0.9331  -39.41   <2e-16 ***
## educ          7.4568     0.0696  107.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.13 on 128843 degrees of freedom
##   (2516 observations deleted due to missingness)
## Multiple R-squared:  0.0818, Adjusted R-squared:  0.0818 
## F-statistic: 1.148e+04 on 1 and 128843 DF,  p-value: < 2.2e-16
with(stata_PSID_w1,plot(educ, adjfinc))
abline(reg1, col="purple", lwd=2)

10-1) How would you write the linear regression equation?

#y=-36.77 + 4.4568x + E 

10-2) Do you have any concerns that this model violates the regression assumptions?

#No concerns. 

10-3) What’s the R output of the regression analysis?

reg1 <- lm(adjfinc~educ, data = stata_PSID_w1)
summary(reg1)
## 
## Call:
## lm(formula = adjfinc ~ educ, data = stata_PSID_w1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -982.3  -31.8  -11.7   14.8 4962.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.7707     0.9331  -39.41   <2e-16 ***
## educ          7.4568     0.0696  107.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.13 on 128843 degrees of freedom
##   (2516 observations deleted due to missingness)
## Multiple R-squared:  0.0818, Adjusted R-squared:  0.0818 
## F-statistic: 1.148e+04 on 1 and 128843 DF,  p-value: < 2.2e-16

10-4) How would you interpret the coefficient of education?

#The value for the education coefficient is 7.4568, for every unit increase in education the adjusted family income increases by $7,457. 

10-5) Show the analysis of variance table from this regression analysis.

anova(reg1)
## Analysis of Variance Table
## 
## Response: adjfinc
##               Df    Sum Sq  Mean Sq F value    Pr(>F)    
## educ           1  70070052 70070052   11479 < 2.2e-16 ***
## Residuals 128843 786481569     6104                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10-6) What’s the value of SSE? What does it mean?

# The error is the difference between the observed value and the predicted value. The SSE is 786481569, its a measure of the total variances. 

Bonus Question 12 Estimate the relationship between education levels (more than high school>12 years versus equal or less than high school<=12) and family income (Y).

educb <-stata_PSID_w1 %>%   
  mutate (educc= ifelse(educ<=8,
                        c("less highschool"), c("more highschool")))
summary(educb)
##       year          sex                 age             marpi       
##  Min.   :2001   Length:131361      Min.   :  1.00   Min.   :0.0000  
##  1st Qu.:2003   Class :character   1st Qu.: 14.00   1st Qu.:0.0000  
##  Median :2007   Mode  :character   Median : 29.00   Median :0.0000  
##  Mean   :2006                      Mean   : 32.03   Mean   :0.4178  
##  3rd Qu.:2009                      3rd Qu.: 47.00   3rd Qu.:1.0000  
##  Max.   :2011                      Max.   :999.00   Max.   :4.0000  
##                                                     NA's   :28      
##       educ          adjfinc            pubhs             rnthlp       
##  Min.   : 0.00   Min.   :-929.60   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:12.00   1st Qu.:  24.04   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :12.00   Median :  45.18   Median :0.00000   Median :0.00000  
##  Mean   :13.04   Mean   :  60.39   Mean   :0.05301   Mean   :0.02409  
##  3rd Qu.:15.00   3rd Qu.:  75.31   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :20.00   Max.   :5044.84   Max.   :1.00000   Max.   :1.00000  
##  NA's   :2496    NA's   :48        NA's   :34        NA's   :48       
##     adjwlth1           adjwlth2        h_race_ethnic_new        id         
##  Min.   :-2467.18   Min.   :-2304.98   Length:131361      Min.   :   4003  
##  1st Qu.:    0.01   1st Qu.:    1.91   Class :character   1st Qu.:1269033  
##  Median :    9.98   Median :   32.80   Mode  :character   Median :2464171  
##  Mean   :  129.48   Mean   :  187.17                      Mean   :3014466  
##  3rd Qu.:   58.05   3rd Qu.:  143.55                      3rd Qu.:5381175  
##  Max.   :80199.41   Max.   :80303.23                      Max.   :6872185  
##  NA's   :48         NA's   :48                                             
##      race5          educc          
##  Min.   :1.000   Length:131361     
##  1st Qu.:3.000   Class :character  
##  Median :5.000   Mode  :character  
##  Mean   :3.927                     
##  3rd Qu.:5.000                     
##  Max.   :5.000                     
## 
table(educb$educc)
## 
## less highschool more highschool 
##            6594          122271
reg2 <- lm(adjfinc~educc, data=educb)
summary(reg2)
## 
## Call:
## lm(formula = adjfinc ~ educc, data = educb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -991.7  -36.0  -14.2   14.6 4982.7 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            29.769      1.001   29.75   <2e-16 ***
## educcmore highschool   32.326      1.027   31.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.22 on 128843 degrees of freedom
##   (2516 observations deleted due to missingness)
## Multiple R-squared:  0.007627,   Adjusted R-squared:  0.007619 
## F-statistic: 990.3 on 1 and 128843 DF,  p-value: < 2.2e-16
  1. How would you write the equation?
#y=29.769 + 32.326x + E 
  1. What’s the R output of the regression analysis?
reg2 <- lm(adjfinc~educc, data=educb)
summary(reg2)
## 
## Call:
## lm(formula = adjfinc ~ educc, data = educb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -991.7  -36.0  -14.2   14.6 4982.7 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            29.769      1.001   29.75   <2e-16 ***
## educcmore highschool   32.326      1.027   31.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.22 on 128843 degrees of freedom
##   (2516 observations deleted due to missingness)
## Multiple R-squared:  0.007627,   Adjusted R-squared:  0.007619 
## F-statistic: 990.3 on 1 and 128843 DF,  p-value: < 2.2e-16
  1. How would you interpret the coefficient of education?
# On average, a person with more than high school education will earn about $32,326 more in family income than their counterparts with less than high school education.
# ($62,095 = 29.769+32.326)
  1. How would you interpret the intercept?
#  A person with less than or equal to high school education, will earn on average $29,769.

Please type your answers, including my original code in a R-markdown file. When you submit, please submit the Rpub link.

#bet