#T-test
> prb<-read_csv("PRB2013.csv", col_names=T)
Parsed with column specification:
cols(
.default = col_double(),
Country = col_character(),
Continent = col_character(),
Region = col_character()
)
See spec(...) for full column specifications.
> names(prb)<-tolower(names(prb))
> prb_new<-prb%>% #we are creating a new dataframe called prb_new
+ mutate(Africa=ifelse(prb$continent=="Africa",yes= "Africa",no= "Not Africa")) # this new dataframe contains an additional dummy variable, Africa
> #summary statistics by group using group_by
> prb_new%>%
+ group_by(Africa)%>%
+ summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 4
Africa means sds n
<chr> <dbl> <dbl> <int>
1 Africa 4.61 1.42 55
2 Not Africa 2.25 0.889 153
> #significant t-test
> t.test(tfr~Africa, data=prb_new)
Welch Two Sample t-test
data: tfr by Africa
t = 11.559, df = 69.813, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.954226 2.769268
sample estimates:
mean in group Africa mean in group Not Africa
4.612727 2.250980
> #Q1. The summary stats tell us: there are more 'NonAfrica' countries in the data than 'Africa'. 'Africa' observations have a much greater mean tfr than 'NonAfrica' observations. The 'Africa' observations have a greater SD, this indicates tfr for Africa has higher distribution (i.e. wider range).
> #Q2. Based on the t-test results, we can conclude that the mean tfr are significantly different for the groups since the t-stat is 11.56 and df is high (69.8).
> prb_new2<-prb%>%
+ mutate(
+ Af_As=ifelse(prb$continent=="Africa","Africa",
+ ifelse(prb$continent=="Asia","Asia",NA))
+
+ )
> prb_new2 <-prb_new2 %>% na.omit(prb_new2$Af_As)
> prb_new2%>%
+ group_by(Af_As)%>%
+ summarise(means=mean(imr, na.rm=T), sds=sd(imr, na.rm=T), n=n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 4
Af_As means sds n
<chr> <dbl> <dbl> <int>
1 Africa 63.6 25.2 40
2 Asia 29.0 21.9 22
> boxplot(prb_new2$imr~prb_new2$Af_As)
> t.test(imr~Af_As, data=prb_new2)
Welch Two Sample t-test
data: imr by Af_As
t = 5.6515, df = 48.856, p-value = 8.087e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
22.32233 46.95948
sample estimates:
mean in group Africa mean in group Asia
63.65000 29.00909
> #Q3.a The summary stats tell us: there are more African countries in the data than Asian. African countries have a much greater mean IMR than Asian countries. African countries have a slightly greater SD, indicating IMR for Africa has a somewhat greater distribution (i.e. wider range).
>
> #Q3.b The boxplots visually confirm that Africa has a greater mean (and median), and greater spread of the data. We can also see that data is right skewed for Asian countries.
>
> #Q3.cBased on the t-test results, we can conclude that the mean IMR are significantly different for the groups since the t-stat is 5.65 and df is 48.86.
#Moving from two groups to multiple groups
> #summary statistics by group
> prb%>%
+ group_by(continent)%>%
+ summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 6 x 4
continent means sds n
<chr> <dbl> <dbl> <int>
1 Africa 4.61 1.42 55
2 Asia 2.52 1.03 51
3 Europe 1.55 0.228 45
4 North America 2.21 0.546 27
5 Oceania 3.18 0.901 17
6 South America 2.5 0.476 13
> #Q6. There are six groups. The table contains summary statistics (N,mean and standard deviation) for Fertility rates(TFR) by continent groups. According to the output, Africa has the highest TFR and Europe has the lowest.
#ANOVA
> m1<-aov(tfr~continent, data=prb)
> m1
Call:
aov(formula = tfr ~ continent, data = prb)
Terms:
continent Residuals
Sum of Squares 266.6086 187.7163
Deg. of Freedom 5 202
Residual standard error: 0.9639962
Estimated effects may be unbalanced
> anova(m1)
Analysis of Variance Table
Response: tfr
Df Sum Sq Mean Sq F value Pr(>F)
continent 5 266.61 53.322 57.379 < 2.2e-16 ***
Residuals 202 187.72 0.929
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> #Q7. The F-ratio can be calculated by dividing the between-groups mean squares by the within-groups mean squares. Means squares are calculated by dividing summed squares by degrees of freedom.
> #Q8. Null: u1=u2=..=u6 --- Alt: At least one group's mean TFR is different.
> #The F-value indicates that we reject the Null in favor of the alternative.
#Simple Regression Analysis
> psid<- read_dta("psid2013.dta")
> names(psid)<-tolower(names(psid))
> #summary statistics by group
> psid%>%
+ summarise(educ_mean=mean(educ, na.rm=T),adjfinc_mean=mean(adjfinc, na.rm=T), n=n())
# A tibble: 1 x 3
educ_mean adjfinc_mean n
<dbl> <dbl> <int>
1 13.4 54.4 23134
> #Q9. mean education=13.4(years) --- mean family income=54.4(thousands)
> library(broom)
> fit<-lm(adjfinc~educ, data=psid)
> summary(fit)
Call:
lm(formula = adjfinc ~ educ, data = psid)
Residuals:
Min 1Q Median 3Q Max
-101.74 -28.94 -10.36 15.13 1751.25
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.1346 1.7674 -23.27 <2e-16 ***
educ 7.1438 0.1285 55.59 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 62.07 on 22959 degrees of freedom
(173 observations deleted due to missingness)
Multiple R-squared: 0.1186, Adjusted R-squared: 0.1186
F-statistic: 3091 on 1 and 22959 DF, p-value: < 2.2e-16
> anova(fit)
Analysis of Variance Table
Response: adjfinc
Df Sum Sq Mean Sq F value Pr(>F)
educ 1 11907473 11907473 3090.8 < 2.2e-16 ***
Residuals 22959 88450801 3853
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> #Q10.1 y=a+bx+e
> #Q10.2 Yes, the educ variable is not continuous.
> #Q10.3 See markdown file output --- Equation: family income=-41.1+7.1(educ)
> #Q10.4 It is estimated that a one unit(year) increase in education will result in a 7,100 dollar increase in family income.
> #Q10.5 See markdown file output
> #Q10.6 SSE=11907473 --- The sum of squares error (SSE) is the sum of differences between the observed values and the predicted values.
#Bonus attempt 1
> psid2<- psid%>%
+ mutate(
+ educ_gt12=ifelse(educ > 12,educ,0),
+ educ_le12=ifelse(educ <= 12,educ,0)
+ )
>
> fit2<-lm(adjfinc~educ_gt12+educ_le12, data=psid2)
> summary(fit2)
Call:
lm(formula = adjfinc ~ educ_gt12 + educ_le12, data = psid2)
Residuals:
Min 1Q Median 3Q Max
-100.46 -28.32 -9.86 15.56 1752.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.7218 2.5168 -10.62 <2e-16 ***
educ_gt12 6.3592 0.1613 39.44 <2e-16 ***
educ_le12 5.6301 0.2280 24.70 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 61.98 on 22958 degrees of freedom
(173 observations deleted due to missingness)
Multiple R-squared: 0.1211, Adjusted R-squared: 0.121
F-statistic: 1582 on 2 and 22958 DF, p-value: < 2.2e-16
> anova(fit2)
Analysis of Variance Table
Response: adjfinc
Df Sum Sq Mean Sq F value Pr(>F)
educ_gt12 1 9812082 9812082 2553.95 < 2.2e-16 ***
educ_le12 1 2343318 2343318 609.93 < 2.2e-16 ***
Residuals 22958 88202875 3842
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> #Q12.1 y=a+b1x1+b2x2+e
> #Q12.2 See markdown file output
> #Q12.3 It is estimated that a one unit(year) increase in education above 12 will result in a 6,360 dollar increase in family income. Whereas a one unit(year) increase in education up to 12 will result in a 5,630 dollar increase in family income.
> #Q12.4 The amount of family income when education is 0
>
> ####### I don't believe this attempt to be correct**
#Bonus attempt 2
> psid3<- psid%>%
+ mutate(
+ educ2=ifelse(educ > 12,"gtHS","leHS")
+ )
>
> fit3<-lm(adjfinc~educ2, data=psid3)
> summary(fit3)
Call:
lm(formula = adjfinc ~ educ2, data = psid3)
Residuals:
Min 1Q Median 3Q Max
-89.89 -28.62 -10.72 15.16 1781.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.7976 0.5877 122.16 <2e-16 ***
educ2leHS -35.5063 0.8408 -42.23 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 63.69 on 22959 degrees of freedom
(173 observations deleted due to missingness)
Multiple R-squared: 0.07207, Adjusted R-squared: 0.07203
F-statistic: 1783 on 1 and 22959 DF, p-value: < 2.2e-16
> #Q12.1 y=a+bx+e
> #Q12.2 See markdown file output
> #Q12.3 It is estimated that when education level is less than 12 years, family income will be 35,506 dollars less than those with education level greater than 12 years .
> #Q12.4 The amount of family income when education is 0. Since these are binary variables this would also be the income level for the group with greater than 12 years of education.