Title: "Statistics"
Author:"Bryan Solomon"
Date: "08/12/2021"
library(dplyr)
library(janitor)
library(haven)
library(ggplot2)
library(scales)
library(haven, quietly = TRUE)
income = read_dta("C:\\Users\\bryan\\Downloads\\acs2019exam (4).dta")
names(income) = tolower(names(income))
head(income)
## # A tibble: 6 x 4
## sex inctot fmarry degree
## <dbl+lbl> <dbl> <dbl> <dbl>
## 1 1 [male] 800 0 1
## 2 1 [male] 14400 0 3
## 3 1 [male] 19500 0 1
## 4 1 [male] 29500 0 2
## 5 1 [male] 19500 0 1
## 6 2 [female] 34000 0 1
#1. Plotting histograms for income by gender
library(ggplot2)
library(scales)
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
ggplot(data=income, mapping=aes(x = inctot)) +
geom_histogram(fill = "white", colour = "black") +
facet_grid(sex ~ .)+
geom_histogram(binwidth=25000)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
describeBy(income$inctot,income$sex)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew
## X1 1 48067 47771.3 51484.64 36000 39993.73 28169.4 4 874000 873996 5.12
## kurtosis se
## X1 44.91 234.83
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range
## X1 1 40703 40917.57 41655.08 32000 34995.57 26686.8 4 1100000 1099996
## skew kurtosis se
## X1 5.48 62.48 206.47
Results of the histogram plots show that the distribution of personal income was higher for men at different income groups of $25000 than that of women. Particulartly, the distribution for income between $25000 and $50000 for men is around 17000 while that of women is around 16000, between $50000 and $75000 for men was 12500 while that of women was 10000, between $75000 and $100000 men recorded around 5500 while women were around 4000. Overall, the distribution of the personal income for both groups shows a positive skew with a longer right tail. #2.Distribution of log trasnformed personal income by gender
library(ggplot2)
library(scales)
library(psych)
income$logincome<-log(income$inctot)
ggplot(data=income, mapping=aes(x = logincome)) +
geom_histogram(fill = "white", colour = "black") +
facet_grid(sex ~ .)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
the distribution of log transformed personal income shows a change in the shape of the histogram. The shape changes to a symmetric one with both groups showing normal distributions. The presentation is better than when raw personal income is used to map the distribution. #3. Distribution of the square root of the personal income by gender
library(ggplot2)
library(scales)
library(psych)
income$sqrtincome<-sqrt(income$inctot)
ggplot(data=income, mapping=aes(x = sqrtincome)) +
geom_histogram(fill = "white", colour = "black") +
facet_grid(sex ~ .)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Distribution of the square root of income presents a clear visual with frequencies in the men personal income showing higher numbers than that of women in each income class. These histograms have better visuals than those of raw values of personal income.
#4. Association between salary class and fmarry
income$inccat<-cut(income$inctot, breaks = c(0,25000,50000,75000, 100000,125000,150000, Inf), right = FALSE)
chiassoc<-chisq.test(income$inccat,income$fmarry)
chiassoc$observed
## income$fmarry
## income$inccat 0 1
## [0,2.5e+04) 29272 1098
## [2.5e+04,5e+04) 27484 1850
## [5e+04,7.5e+04) 14293 1647
## [7.5e+04,1e+05) 5651 814
## [1e+05,1.25e+05) 2826 447
## [1.25e+05,1.5e+05) 1026 215
## [1.5e+05,Inf) 1801 346
comparing using confidence intervals
library(dplyr)
table(income$inccat,income$fmarry)
##
## 0 1
## [0,2.5e+04) 29272 1098
## [2.5e+04,5e+04) 27484 1850
## [5e+04,7.5e+04) 14293 1647
## [7.5e+04,1e+05) 5651 814
## [1e+05,1.25e+05) 2826 447
## [1.25e+05,1.5e+05) 1026 215
## [1.5e+05,Inf) 1801 346
prop.test(c(29272,1098),c(82353,6417))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(29272, 1098) out of c(82353, 6417)
## X-squared = 897.95, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.1744764 0.1941986
## sample estimates:
## prop 1 prop 2
## 0.3554455 0.1711080
prop.test(c(27484,1850),c(82353,6417))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(27484, 1850) out of c(82353, 6417)
## X-squared = 55.345, df = 1, p-value = 1.011e-13
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.03381205 0.05706262
## sample estimates:
## prop 1 prop 2
## 0.3337340 0.2882967
prop.test(c(14293,1647),c(82353,6417))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(14293, 1647) out of c(82353, 6417)
## X-squared = 278.51, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.09418383 -0.07202469
## sample estimates:
## prop 1 prop 2
## 0.1735577 0.2566620
prop.test(c(5651,814),c(82353,6417))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(5651, 814) out of c(82353, 6417)
## X-squared = 298.09, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.06663913 -0.04982351
## sample estimates:
## prop 1 prop 2
## 0.06861924 0.12685055
chisq.test(income$inccat,income$fmarry)
##
## Pearson's Chi-squared test
##
## data: income$inccat and income$fmarry
## X-squared = 1777.8, df = 6, p-value < 2.2e-16
The shared distribution for income in married and the unmarried groups is different. Particularly, the proportion of those who are not married is higher in each income category than that of married persons. This shows that the income groups has a significant dependence or association with marriage,X-squared(6)=1777.8,p<.05. #5.Association of Income and gender for men and women
gendersplit=split(income, f=income$sex)
chfmen=chisq.test(gendersplit$`1`$inccat,gendersplit$`1`$fmarry)
chfmen
##
## Pearson's Chi-squared test
##
## data: gendersplit$`1`$inccat and gendersplit$`1`$fmarry
## X-squared = 1094.3, df = 6, p-value < 2.2e-16
chfmen$parameter
## df
## 6
chfmen$observed
## gendersplit$`1`$fmarry
## gendersplit$`1`$inccat 0 1
## [0,2.5e+04) 14747 504
## [2.5e+04,5e+04) 14610 997
## [5e+04,7.5e+04) 8140 915
## [7.5e+04,1e+05) 3368 505
## [1e+05,1.25e+05) 1786 279
## [1.25e+05,1.5e+05) 622 149
## [1.5e+05,Inf) 1214 231
chfwomen=chisq.test(gendersplit$`2`$inccat,gendersplit$`2`$fmarry)
chfwomen
##
## Pearson's Chi-squared test
##
## data: gendersplit$`2`$inccat and gendersplit$`2`$fmarry
## X-squared = 689.94, df = 6, p-value < 2.2e-16
chfwomen$parameter
## df
## 6
chfwomen$observed
## gendersplit$`2`$fmarry
## gendersplit$`2`$inccat 0 1
## [0,2.5e+04) 14525 594
## [2.5e+04,5e+04) 12874 853
## [5e+04,7.5e+04) 6153 732
## [7.5e+04,1e+05) 2283 309
## [1e+05,1.25e+05) 1040 168
## [1.25e+05,1.5e+05) 404 66
## [1.5e+05,Inf) 587 115
Gender seems to moderate the association between income groups and marriage because the chi-square value for association between marriage and income for men (X-sq(6)= 1094.3,p<.05) shows a higher value than the chisquare value of association between marriage and income groups for women (X-sq(6)=689.94,p<.05). However, the trend of association remains with higher records of income for the unmarried men and women than the married ones. #6. Association betweem Degree and marriage for men and women
library(dplyr)
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.1.2
library(sjlabelled)
## Warning: package 'sjlabelled' was built under R version 4.1.2
##
## Attaching package: 'sjlabelled'
## The following object is masked from 'package:ggplot2':
##
## as_label
## The following objects are masked from 'package:haven':
##
## as_factor, read_sas, read_spss, read_stata, write_sas, zap_labels
## The following object is masked from 'package:dplyr':
##
## as_label
library(emmeans)
## Warning: package 'emmeans' was built under R version 4.1.2
library(sjmisc)
## Warning: package 'sjmisc' was built under R version 4.1.2
##
## Attaching package: 'sjmisc'
## The following objects are masked from 'package:janitor':
##
## remove_empty_cols, remove_empty_rows
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.1.2
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
library(bayestestR)
## Warning: package 'bayestestR' was built under R version 4.1.2
library(bayesplot)
## Warning: package 'bayesplot' was built under R version 4.1.2
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.1.2
library(datawizard)
## Warning: package 'datawizard' was built under R version 4.1.2
##
## Attaching package: 'datawizard'
## The following objects are masked from 'package:sjmisc':
##
## center, reshape_longer, to_numeric
## The following object is masked from 'package:sjlabelled':
##
## to_numeric
library(insight)
## Warning: package 'insight' was built under R version 4.1.2
##
## Attaching package: 'insight'
## The following objects are masked from 'package:datawizard':
##
## data_match, data_relocate, data_restoretype, data_to_long,
## data_to_wide, reshape_ci, reshape_longer, reshape_wider, to_numeric
## The following object is masked from 'package:bayestestR':
##
## reshape_ci
## The following objects are masked from 'package:sjmisc':
##
## reshape_longer, to_numeric
## The following object is masked from 'package:sjlabelled':
##
## to_numeric
## The following object is masked from 'package:janitor':
##
## clean_names
sjPlot::tab_xtab(var.row = gendersplit$`1`$degree, var.col = gendersplit$`1`$fmarry, title = "Education achievement and marriage for men", show.row.prc = TRUE)
1\(degree</th> <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">`1`\)fmarry
|
Total | ||
|---|---|---|---|
| 0 | 1 | ||
| 1 |
4195 95.8 % |
184 4.2 % |
4379 100 % |
| 2 |
12651 95.4 % |
616 4.6 % |
13267 100 % |
| 3 |
13040 93.3 % |
936 6.7 % |
13976 100 % |
| 4 |
10150 89.5 % |
1191 10.5 % |
11341 100 % |
| 5 |
4451 87.2 % |
653 12.8 % |
5104 100 % |
| Total |
44487 92.6 % |
3580 7.4 % |
48067 100 % |
χ2=594.818 · df=4 · Cramer’s V=0.111 · p=0.000 |
sjPlot::tab_xtab(var.row=gendersplit$`2`$degree,var.col = gendersplit$`2`$fmarry,title= "Education achievement and marriage for women",show.row.prc=TRUE)
2\(degree</th> <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">`2`\)fmarry
|
Total | ||
|---|---|---|---|
| 0 | 1 | ||
| 1 |
2620 97.2 % |
76 2.8 % |
2696 100 % |
| 2 |
7866 96.7 % |
271 3.3 % |
8137 100 % |
| 3 |
12011 95 % |
637 5 % |
12648 100 % |
| 4 |
9738 90.3 % |
1047 9.7 % |
10785 100 % |
| 5 |
5631 87.5 % |
806 12.5 % |
6437 100 % |
| Total |
37866 93 % |
2837 7 % |
40703 100 % |
χ2=741.417 · df=4 · Cramer’s V=0.135 · p=0.000 |
degfmen=chisq.test(gendersplit$`1`$degree,gendersplit$`1`$fmarry)
degfmen
##
## Pearson's Chi-squared test
##
## data: gendersplit$`1`$degree and gendersplit$`1`$fmarry
## X-squared = 594.82, df = 4, p-value < 2.2e-16
degfwomen=chisq.test(gendersplit$`2`$degree,gendersplit$`2`$fmarry)
Results show that there is significant association between educational achievement and marriage across men and women. However, the association is stronger for men (V=13.5%)than women(V=11.1%) as supported by V-Cramers value of the explained variation. #7. Effect of education or degree achievement on the chances of being married or not. A logistics regression approach
library(ggplot2)
gendersplit$`1`$degree=as.factor(gendersplit$`1`$degree)
gendersplit$`2`$degree=as.factor(gendersplit$`2`$degree)
fmenlogit=glm(fmarry~degree,data=gendersplit$`1`, family = "binomial",)
fmenlogit
##
## Call: glm(formula = fmarry ~ degree, family = "binomial", data = gendersplit$`1`)
##
## Coefficients:
## (Intercept) degree2 degree3 degree4 degree5
## -3.1267 0.1045 0.4926 0.9840 1.2074
##
## Degrees of Freedom: 48066 Total (i.e. Null); 48062 Residual
## Null Deviance: 25480
## Residual Deviance: 24900 AIC: 24910
fwomenlogit=glm(fmarry~degree, data= gendersplit$`2`,family = "binomial")
fwomenlogit
##
## Call: glm(formula = fmarry ~ degree, family = "binomial", data = gendersplit$`2`)
##
## Coefficients:
## (Intercept) degree2 degree3 degree4 degree5
## -3.5402 0.1720 0.6034 1.3101 1.5962
##
## Degrees of Freedom: 40702 Total (i.e. Null); 40698 Residual
## Null Deviance: 20580
## Residual Deviance: 19850 AIC: 19860
#8. Effects of income levels on the chances of being married or not. A logistics approach
incmenlogit=glm(fmarry~inctot,data=gendersplit$`1`,family = "binomial")
summary(incmenlogit)
##
## Call:
## glm(formula = fmarry ~ inctot, family = "binomial", data = gendersplit$`1`)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9788 -0.3934 -0.3701 -0.3527 2.4002
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.823e+00 2.251e-02 -125.4 <2e-16 ***
## inctot 5.511e-06 2.296e-07 24.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25483 on 48066 degrees of freedom
## Residual deviance: 24961 on 48065 degrees of freedom
## AIC: 24965
##
## Number of Fisher Scoring iterations: 5
incwomenlogit=glm(fmarry~inctot,data=gendersplit$`2`,family="binomial")
summary(incwomenlogit)
##
## Call:
## glm(formula = fmarry ~ inctot, family = "binomial", data = gendersplit$`2`)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7294 -0.3845 -0.3629 -0.3445 2.4172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.866e+00 2.568e-02 -111.60 <2e-16 ***
## inctot 5.970e-06 3.279e-07 18.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20585 on 40702 degrees of freedom
## Residual deviance: 20280 on 40701 degrees of freedom
## AIC: 20284
##
## Number of Fisher Scoring iterations: 5
For women a unit percentage increase in income causes a increase in chances of being married while a unit percentage rise in income for men causes a increase in chances of being married #9.Effects of personal income level on the probability of being married. A logistics approach
lgincmenlogit=glm(fmarry~logincome,data=gendersplit$`1`,family = "binomial")
lgincwomenlogit=glm(fmarry~logincome,data=gendersplit$`2`,family = "binomial")
summary(lgincmenlogit)
##
## Call:
## glm(formula = fmarry ~ logincome, family = "binomial", data = gendersplit$`1`)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9287 -0.4338 -0.3689 -0.2871 3.5742
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.11565 0.22879 -39.84 <2e-16 ***
## logincome 0.62297 0.02112 29.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25483 on 48066 degrees of freedom
## Residual deviance: 24446 on 48065 degrees of freedom
## AIC: 24450
##
## Number of Fisher Scoring iterations: 6
summary(lgincwomenlogit)
##
## Call:
## glm(formula = fmarry ~ logincome, family = "binomial", data = gendersplit$`2`)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8131 -0.4186 -0.3729 -0.3025 3.4848
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.47834 0.24292 -30.79 <2e-16 ***
## logincome 0.47022 0.02292 20.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20585 on 40702 degrees of freedom
## Residual deviance: 20094 on 40701 degrees of freedom
## AIC: 20098
##
## Number of Fisher Scoring iterations: 6
#10. Role of income in the association between marriage and educational achievement
rolinclogitmen=glm(fmarry~degree+logincome,data=gendersplit$`1`,family = "binomial")
rolinclogitwomen=glm(fmarry~degree+logincome,data=gendersplit$`2`,family = "binomial")
summary(rolinclogitmen)
##
## Call:
## glm(formula = fmarry ~ degree + logincome, family = "binomial",
## data = gendersplit$`1`)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8854 -0.4357 -0.3570 -0.2821 3.4612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.17931 0.24341 -33.603 < 2e-16 ***
## degree2 -0.04838 0.08649 -0.559 0.5759
## degree3 0.20136 0.08376 2.404 0.0162 *
## degree4 0.44902 0.08462 5.306 1.12e-07 ***
## degree5 0.55188 0.09089 6.072 1.26e-09 ***
## logincome 0.51125 0.02291 22.313 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25483 on 48066 degrees of freedom
## Residual deviance: 24318 on 48061 degrees of freedom
## AIC: 24330
##
## Number of Fisher Scoring iterations: 6
summary(rolinclogitwomen)
##
## Call:
## glm(formula = fmarry ~ degree + logincome, family = "binomial",
## data = gendersplit$`2`)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6749 -0.4532 -0.3300 -0.2659 3.0404
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.77977 0.26237 -22.029 < 2e-16 ***
## degree2 0.10229 0.13202 0.775 0.438482
## degree3 0.45927 0.12416 3.699 0.000217 ***
## degree4 1.02216 0.12422 8.229 < 2e-16 ***
## degree5 1.24618 0.12722 9.795 < 2e-16 ***
## logincome 0.23643 0.02451 9.646 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20585 on 40702 degrees of freedom
## Residual deviance: 19744 on 40697 degrees of freedom
## AIC: 19756
##
## Number of Fisher Scoring iterations: 6
Lower educational achievement for both men and women does not result to significant effect when income level is accounted for. The other educational achievements are significant with reduced magnitude for predicting a marginal change in the probability of being married for both men and women. Since, income effect is significant in predicting probability of marriage for men and women, it plays a moderating effect; increasing the probability of marriage for men and reducing probability of marriage for marriage at different educational achievement levels.