Title: "Statistics"
Author:"Bryan Solomon"
Date: "08/12/2021"
library(dplyr)
library(janitor)
library(haven)
library(ggplot2)
library(scales)
library(haven, quietly = TRUE)
income = read_dta("C:\\Users\\bryan\\Downloads\\acs2019exam (4).dta")
names(income) = tolower(names(income))
head(income)
## # A tibble: 6 x 4
##          sex inctot fmarry degree
##    <dbl+lbl>  <dbl>  <dbl>  <dbl>
## 1 1 [male]      800      0      1
## 2 1 [male]    14400      0      3
## 3 1 [male]    19500      0      1
## 4 1 [male]    29500      0      2
## 5 1 [male]    19500      0      1
## 6 2 [female]  34000      0      1

#1. Plotting histograms for income by gender

library(ggplot2)
library(scales)
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
ggplot(data=income, mapping=aes(x = inctot)) +
  geom_histogram(fill = "white", colour = "black") +
  facet_grid(sex ~ .)+
  geom_histogram(binwidth=25000)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

describeBy(income$inctot,income$sex)
## 
##  Descriptive statistics by group 
## group: 1
##    vars     n    mean       sd median  trimmed     mad min    max  range skew
## X1    1 48067 47771.3 51484.64  36000 39993.73 28169.4   4 874000 873996 5.12
##    kurtosis     se
## X1    44.91 234.83
## ------------------------------------------------------------ 
## group: 2
##    vars     n     mean       sd median  trimmed     mad min     max   range
## X1    1 40703 40917.57 41655.08  32000 34995.57 26686.8   4 1100000 1099996
##    skew kurtosis     se
## X1 5.48    62.48 206.47

Results of the histogram plots show that the distribution of personal income was higher for men at different income groups of $25000 than that of women. Particulartly, the distribution for income between $25000 and $50000 for men is around 17000 while that of women is around 16000, between $50000 and $75000 for men was 12500 while that of women was 10000, between $75000 and $100000 men recorded around 5500 while women were around 4000. Overall, the distribution of the personal income for both groups shows a positive skew with a longer right tail. #2.Distribution of log trasnformed personal income by gender

library(ggplot2)
library(scales)
library(psych)
income$logincome<-log(income$inctot)
ggplot(data=income, mapping=aes(x = logincome)) +
  geom_histogram(fill = "white", colour = "black") +
  facet_grid(sex ~ .)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

the distribution of log transformed personal income shows a change in the shape of the histogram. The shape changes to a symmetric one with both groups showing normal distributions. The presentation is better than when raw personal income is used to map the distribution. #3. Distribution of the square root of the personal income by gender

library(ggplot2)
library(scales)
library(psych)
income$sqrtincome<-sqrt(income$inctot)
ggplot(data=income, mapping=aes(x = sqrtincome)) +
  geom_histogram(fill = "white", colour = "black") +
  facet_grid(sex ~ .)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution of the square root of income presents a clear visual with frequencies in the men personal income showing higher numbers than that of women in each income class. These histograms have better visuals than those of raw values of personal income.

#4. Association between salary class and fmarry

income$inccat<-cut(income$inctot, breaks = c(0,25000,50000,75000, 100000,125000,150000, Inf), right = FALSE)
chiassoc<-chisq.test(income$inccat,income$fmarry)
chiassoc$observed
##                     income$fmarry
## income$inccat            0     1
##   [0,2.5e+04)        29272  1098
##   [2.5e+04,5e+04)    27484  1850
##   [5e+04,7.5e+04)    14293  1647
##   [7.5e+04,1e+05)     5651   814
##   [1e+05,1.25e+05)    2826   447
##   [1.25e+05,1.5e+05)  1026   215
##   [1.5e+05,Inf)       1801   346

comparing using confidence intervals

library(dplyr)
table(income$inccat,income$fmarry)
##                     
##                          0     1
##   [0,2.5e+04)        29272  1098
##   [2.5e+04,5e+04)    27484  1850
##   [5e+04,7.5e+04)    14293  1647
##   [7.5e+04,1e+05)     5651   814
##   [1e+05,1.25e+05)    2826   447
##   [1.25e+05,1.5e+05)  1026   215
##   [1.5e+05,Inf)       1801   346
prop.test(c(29272,1098),c(82353,6417))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(29272, 1098) out of c(82353, 6417)
## X-squared = 897.95, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1744764 0.1941986
## sample estimates:
##    prop 1    prop 2 
## 0.3554455 0.1711080
prop.test(c(27484,1850),c(82353,6417))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(27484, 1850) out of c(82353, 6417)
## X-squared = 55.345, df = 1, p-value = 1.011e-13
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.03381205 0.05706262
## sample estimates:
##    prop 1    prop 2 
## 0.3337340 0.2882967
prop.test(c(14293,1647),c(82353,6417))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(14293, 1647) out of c(82353, 6417)
## X-squared = 278.51, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.09418383 -0.07202469
## sample estimates:
##    prop 1    prop 2 
## 0.1735577 0.2566620
prop.test(c(5651,814),c(82353,6417))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(5651, 814) out of c(82353, 6417)
## X-squared = 298.09, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.06663913 -0.04982351
## sample estimates:
##     prop 1     prop 2 
## 0.06861924 0.12685055
chisq.test(income$inccat,income$fmarry)
## 
##  Pearson's Chi-squared test
## 
## data:  income$inccat and income$fmarry
## X-squared = 1777.8, df = 6, p-value < 2.2e-16

The shared distribution for income in married and the unmarried groups is different. Particularly, the proportion of those who are not married is higher in each income category than that of married persons. This shows that the income groups has a significant dependence or association with marriage,X-squared(6)=1777.8,p<.05. #5.Association of Income and gender for men and women

gendersplit=split(income, f=income$sex)
chfmen=chisq.test(gendersplit$`1`$inccat,gendersplit$`1`$fmarry)
chfmen
## 
##  Pearson's Chi-squared test
## 
## data:  gendersplit$`1`$inccat and gendersplit$`1`$fmarry
## X-squared = 1094.3, df = 6, p-value < 2.2e-16
chfmen$parameter
## df 
##  6
chfmen$observed
##                       gendersplit$`1`$fmarry
## gendersplit$`1`$inccat     0     1
##     [0,2.5e+04)        14747   504
##     [2.5e+04,5e+04)    14610   997
##     [5e+04,7.5e+04)     8140   915
##     [7.5e+04,1e+05)     3368   505
##     [1e+05,1.25e+05)    1786   279
##     [1.25e+05,1.5e+05)   622   149
##     [1.5e+05,Inf)       1214   231
chfwomen=chisq.test(gendersplit$`2`$inccat,gendersplit$`2`$fmarry)
chfwomen
## 
##  Pearson's Chi-squared test
## 
## data:  gendersplit$`2`$inccat and gendersplit$`2`$fmarry
## X-squared = 689.94, df = 6, p-value < 2.2e-16
chfwomen$parameter
## df 
##  6
chfwomen$observed
##                       gendersplit$`2`$fmarry
## gendersplit$`2`$inccat     0     1
##     [0,2.5e+04)        14525   594
##     [2.5e+04,5e+04)    12874   853
##     [5e+04,7.5e+04)     6153   732
##     [7.5e+04,1e+05)     2283   309
##     [1e+05,1.25e+05)    1040   168
##     [1.25e+05,1.5e+05)   404    66
##     [1.5e+05,Inf)        587   115

Gender seems to moderate the association between income groups and marriage because the chi-square value for association between marriage and income for men (X-sq(6)= 1094.3,p<.05) shows a higher value than the chisquare value of association between marriage and income groups for women (X-sq(6)=689.94,p<.05). However, the trend of association remains with higher records of income for the unmarried men and women than the married ones. #6. Association betweem Degree and marriage for men and women

library(dplyr)
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.1.2
library(sjlabelled)
## Warning: package 'sjlabelled' was built under R version 4.1.2
## 
## Attaching package: 'sjlabelled'
## The following object is masked from 'package:ggplot2':
## 
##     as_label
## The following objects are masked from 'package:haven':
## 
##     as_factor, read_sas, read_spss, read_stata, write_sas, zap_labels
## The following object is masked from 'package:dplyr':
## 
##     as_label
library(emmeans)
## Warning: package 'emmeans' was built under R version 4.1.2
library(sjmisc)
## Warning: package 'sjmisc' was built under R version 4.1.2
## 
## Attaching package: 'sjmisc'
## The following objects are masked from 'package:janitor':
## 
##     remove_empty_cols, remove_empty_rows
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.1.2
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
library(bayestestR)
## Warning: package 'bayestestR' was built under R version 4.1.2
library(bayesplot)
## Warning: package 'bayesplot' was built under R version 4.1.2
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
##    * Does _not_ affect other ggplot2 plots
##    * See ?bayesplot_theme_set for details on theme setting
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.1.2
library(datawizard)
## Warning: package 'datawizard' was built under R version 4.1.2
## 
## Attaching package: 'datawizard'
## The following objects are masked from 'package:sjmisc':
## 
##     center, reshape_longer, to_numeric
## The following object is masked from 'package:sjlabelled':
## 
##     to_numeric
library(insight)
## Warning: package 'insight' was built under R version 4.1.2
## 
## Attaching package: 'insight'
## The following objects are masked from 'package:datawizard':
## 
##     data_match, data_relocate, data_restoretype, data_to_long,
##     data_to_wide, reshape_ci, reshape_longer, reshape_wider, to_numeric
## The following object is masked from 'package:bayestestR':
## 
##     reshape_ci
## The following objects are masked from 'package:sjmisc':
## 
##     reshape_longer, to_numeric
## The following object is masked from 'package:sjlabelled':
## 
##     to_numeric
## The following object is masked from 'package:janitor':
## 
##     clean_names
sjPlot::tab_xtab(var.row = gendersplit$`1`$degree, var.col = gendersplit$`1`$fmarry, title = "Education achievement and marriage for men", show.row.prc = TRUE)
Education achievement and marriage for men
1\(degree</th> <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">`1`\)fmarry Total
0 1
1 4195
95.8 %
184
4.2 %
4379
100 %
2 12651
95.4 %
616
4.6 %
13267
100 %
3 13040
93.3 %
936
6.7 %
13976
100 %
4 10150
89.5 %
1191
10.5 %
11341
100 %
5 4451
87.2 %
653
12.8 %
5104
100 %
Total 44487
92.6 %
3580
7.4 %
48067
100 %
χ2=594.818 · df=4 · Cramer’s V=0.111 · p=0.000
sjPlot::tab_xtab(var.row=gendersplit$`2`$degree,var.col = gendersplit$`2`$fmarry,title= "Education achievement and marriage for women",show.row.prc=TRUE)
Education achievement and marriage for women
2\(degree</th> <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">`2`\)fmarry Total
0 1
1 2620
97.2 %
76
2.8 %
2696
100 %
2 7866
96.7 %
271
3.3 %
8137
100 %
3 12011
95 %
637
5 %
12648
100 %
4 9738
90.3 %
1047
9.7 %
10785
100 %
5 5631
87.5 %
806
12.5 %
6437
100 %
Total 37866
93 %
2837
7 %
40703
100 %
χ2=741.417 · df=4 · Cramer’s V=0.135 · p=0.000
degfmen=chisq.test(gendersplit$`1`$degree,gendersplit$`1`$fmarry)
degfmen
## 
##  Pearson's Chi-squared test
## 
## data:  gendersplit$`1`$degree and gendersplit$`1`$fmarry
## X-squared = 594.82, df = 4, p-value < 2.2e-16
degfwomen=chisq.test(gendersplit$`2`$degree,gendersplit$`2`$fmarry)

Results show that there is significant association between educational achievement and marriage across men and women. However, the association is stronger for men (V=13.5%)than women(V=11.1%) as supported by V-Cramers value of the explained variation. #7. Effect of education or degree achievement on the chances of being married or not. A logistics regression approach

library(ggplot2)
gendersplit$`1`$degree=as.factor(gendersplit$`1`$degree)
gendersplit$`2`$degree=as.factor(gendersplit$`2`$degree)
fmenlogit=glm(fmarry~degree,data=gendersplit$`1`, family = "binomial",)
fmenlogit
## 
## Call:  glm(formula = fmarry ~ degree, family = "binomial", data = gendersplit$`1`)
## 
## Coefficients:
## (Intercept)      degree2      degree3      degree4      degree5  
##     -3.1267       0.1045       0.4926       0.9840       1.2074  
## 
## Degrees of Freedom: 48066 Total (i.e. Null);  48062 Residual
## Null Deviance:       25480 
## Residual Deviance: 24900     AIC: 24910
fwomenlogit=glm(fmarry~degree, data= gendersplit$`2`,family = "binomial")
fwomenlogit
## 
## Call:  glm(formula = fmarry ~ degree, family = "binomial", data = gendersplit$`2`)
## 
## Coefficients:
## (Intercept)      degree2      degree3      degree4      degree5  
##     -3.5402       0.1720       0.6034       1.3101       1.5962  
## 
## Degrees of Freedom: 40702 Total (i.e. Null);  40698 Residual
## Null Deviance:       20580 
## Residual Deviance: 19850     AIC: 19860

#8. Effects of income levels on the chances of being married or not. A logistics approach

incmenlogit=glm(fmarry~inctot,data=gendersplit$`1`,family = "binomial")
summary(incmenlogit)
## 
## Call:
## glm(formula = fmarry ~ inctot, family = "binomial", data = gendersplit$`1`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9788  -0.3934  -0.3701  -0.3527   2.4002  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.823e+00  2.251e-02  -125.4   <2e-16 ***
## inctot       5.511e-06  2.296e-07    24.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25483  on 48066  degrees of freedom
## Residual deviance: 24961  on 48065  degrees of freedom
## AIC: 24965
## 
## Number of Fisher Scoring iterations: 5
incwomenlogit=glm(fmarry~inctot,data=gendersplit$`2`,family="binomial")
summary(incwomenlogit)
## 
## Call:
## glm(formula = fmarry ~ inctot, family = "binomial", data = gendersplit$`2`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7294  -0.3845  -0.3629  -0.3445   2.4172  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.866e+00  2.568e-02 -111.60   <2e-16 ***
## inctot       5.970e-06  3.279e-07   18.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20585  on 40702  degrees of freedom
## Residual deviance: 20280  on 40701  degrees of freedom
## AIC: 20284
## 
## Number of Fisher Scoring iterations: 5

For women a unit percentage increase in income causes a increase in chances of being married while a unit percentage rise in income for men causes a increase in chances of being married #9.Effects of personal income level on the probability of being married. A logistics approach

lgincmenlogit=glm(fmarry~logincome,data=gendersplit$`1`,family = "binomial")
lgincwomenlogit=glm(fmarry~logincome,data=gendersplit$`2`,family = "binomial")
summary(lgincmenlogit)
## 
## Call:
## glm(formula = fmarry ~ logincome, family = "binomial", data = gendersplit$`1`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9287  -0.4338  -0.3689  -0.2871   3.5742  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.11565    0.22879  -39.84   <2e-16 ***
## logincome    0.62297    0.02112   29.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25483  on 48066  degrees of freedom
## Residual deviance: 24446  on 48065  degrees of freedom
## AIC: 24450
## 
## Number of Fisher Scoring iterations: 6
summary(lgincwomenlogit)
## 
## Call:
## glm(formula = fmarry ~ logincome, family = "binomial", data = gendersplit$`2`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8131  -0.4186  -0.3729  -0.3025   3.4848  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.47834    0.24292  -30.79   <2e-16 ***
## logincome    0.47022    0.02292   20.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20585  on 40702  degrees of freedom
## Residual deviance: 20094  on 40701  degrees of freedom
## AIC: 20098
## 
## Number of Fisher Scoring iterations: 6

#10. Role of income in the association between marriage and educational achievement

rolinclogitmen=glm(fmarry~degree+logincome,data=gendersplit$`1`,family = "binomial")
rolinclogitwomen=glm(fmarry~degree+logincome,data=gendersplit$`2`,family = "binomial")
summary(rolinclogitmen)
## 
## Call:
## glm(formula = fmarry ~ degree + logincome, family = "binomial", 
##     data = gendersplit$`1`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8854  -0.4357  -0.3570  -0.2821   3.4612  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.17931    0.24341 -33.603  < 2e-16 ***
## degree2     -0.04838    0.08649  -0.559   0.5759    
## degree3      0.20136    0.08376   2.404   0.0162 *  
## degree4      0.44902    0.08462   5.306 1.12e-07 ***
## degree5      0.55188    0.09089   6.072 1.26e-09 ***
## logincome    0.51125    0.02291  22.313  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25483  on 48066  degrees of freedom
## Residual deviance: 24318  on 48061  degrees of freedom
## AIC: 24330
## 
## Number of Fisher Scoring iterations: 6
summary(rolinclogitwomen)
## 
## Call:
## glm(formula = fmarry ~ degree + logincome, family = "binomial", 
##     data = gendersplit$`2`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6749  -0.4532  -0.3300  -0.2659   3.0404  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.77977    0.26237 -22.029  < 2e-16 ***
## degree2      0.10229    0.13202   0.775 0.438482    
## degree3      0.45927    0.12416   3.699 0.000217 ***
## degree4      1.02216    0.12422   8.229  < 2e-16 ***
## degree5      1.24618    0.12722   9.795  < 2e-16 ***
## logincome    0.23643    0.02451   9.646  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20585  on 40702  degrees of freedom
## Residual deviance: 19744  on 40697  degrees of freedom
## AIC: 19756
## 
## Number of Fisher Scoring iterations: 6

Lower educational achievement for both men and women does not result to significant effect when income level is accounted for. The other educational achievements are significant with reduced magnitude for predicting a marginal change in the probability of being married for both men and women. Since, income effect is significant in predicting probability of marriage for men and women, it plays a moderating effect; increasing the probability of marriage for men and reducing probability of marriage for marriage at different educational achievement levels.