final-exam-stats.knit

Title: "Statistics"
Author:"Bryan Solomon"
Date: "08/12/2021"

library(dplyr)
library(janitor)
library(haven)
library(ggplot2)
library(scales)

library(haven, quietly = TRUE)
income = read_dta("C:\\Users\\bryan\\Downloads\\acs2019exam (4).dta")
names(income) = tolower(names(income))
head(income)

## # A tibble: 6 x 4
##          sex inctot fmarry degree
##    <dbl+lbl>  <dbl>  <dbl>  <dbl>
## 1 1 [male]      800      0      1
## 2 1 [male]    14400      0      3
## 3 1 [male]    19500      0      1
## 4 1 [male]    29500      0      2
## 5 1 [male]    19500      0      1
## 6 2 [female]  34000      0      1

#1. Plotting histograms for income by gender

library(ggplot2)
library(scales)
library(psych)

## Warning: package 'psych' was built under R version 4.1.2

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:scales':
## 
##     alpha, rescale

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

ggplot(data=income, mapping=aes(x = inctot)) +
  geom_histogram(fill = "white", colour = "black") +
  facet_grid(sex ~ .)+
  geom_histogram(binwidth=25000)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

describeBy(income$inctot,income$sex)

## 
##  Descriptive statistics by group 
## group: 1
##    vars     n    mean       sd median  trimmed     mad min    max  range skew
## X1    1 48067 47771.3 51484.64  36000 39993.73 28169.4   4 874000 873996 5.12
##    kurtosis     se
## X1    44.91 234.83
## ------------------------------------------------------------ 
## group: 2
##    vars     n     mean       sd median  trimmed     mad min     max   range
## X1    1 40703 40917.57 41655.08  32000 34995.57 26686.8   4 1100000 1099996
##    skew kurtosis     se
## X1 5.48    62.48 206.47

Results of the histogram plots show that the distribution of personal income was higher for men at different income groups of $25000 than that of women. Particulartly, the distribution for income between $25000 and $50000 for men is around 17000 while that of women is around 16000, between $50000 and $75000 for men was 12500 while that of women was 10000, between $75000 and $100000 men recorded around 5500 while women were around 4000. Overall, the distribution of the personal income for both groups shows a positive skew with a longer right tail. #2.Distribution of log trasnformed personal income by gender

library(ggplot2)
library(scales)
library(psych)
income$logincome<-log(income$inctot)
ggplot(data=income, mapping=aes(x = logincome)) +
  geom_histogram(fill = "white", colour = "black") +
  facet_grid(sex ~ .)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

the distribution of log transformed personal income shows a change in the shape of the histogram. The shape changes to a symmetric one with both groups showing normal distributions. The presentation is better than when raw personal income is used to map the distribution. #3. Distribution of the square root of the personal income by gender

library(ggplot2)
library(scales)
library(psych)
income$sqrtincome<-sqrt(income$inctot)
ggplot(data=income, mapping=aes(x = sqrtincome)) +
  geom_histogram(fill = "white", colour = "black") +
  facet_grid(sex ~ .)+
ggtitle(label="Distribution of Personal Income by Gender")+
xlab(label="Personal Income")+
ylab(label="Frequency")+
scale_x_continuous(labels=label_dollar())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution of the square root of income presents a clear visual with frequencies in the men personal income showing higher numbers than that of women in each income class. These histograms have better visuals than those of raw values of personal income.

#4. Association between salary class and fmarry

income$inccat<-cut(income$inctot, breaks = c(0,25000,50000,75000, 100000,125000,150000, Inf), right = FALSE)
chiassoc<-chisq.test(income$inccat,income$fmarry)
chiassoc$observed

##                     income$fmarry
## income$inccat            0     1
##   [0,2.5e+04)        29272  1098
##   [2.5e+04,5e+04)    27484  1850
##   [5e+04,7.5e+04)    14293  1647
##   [7.5e+04,1e+05)     5651   814
##   [1e+05,1.25e+05)    2826   447
##   [1.25e+05,1.5e+05)  1026   215
##   [1.5e+05,Inf)       1801   346

comparing using confidence intervals

library(dplyr)
table(income$inccat,income$fmarry)

##                     
##                          0     1
##   [0,2.5e+04)        29272  1098
##   [2.5e+04,5e+04)    27484  1850
##   [5e+04,7.5e+04)    14293  1647
##   [7.5e+04,1e+05)     5651   814
##   [1e+05,1.25e+05)    2826   447
##   [1.25e+05,1.5e+05)  1026   215
##   [1.5e+05,Inf)       1801   346

prop.test(c(29272,1098),c(82353,6417))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(29272, 1098) out of c(82353, 6417)
## X-squared = 897.95, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1744764 0.1941986
## sample estimates:
##    prop 1    prop 2 
## 0.3554455 0.1711080

prop.test(c(27484,1850),c(82353,6417))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(27484, 1850) out of c(82353, 6417)
## X-squared = 55.345, df = 1, p-value = 1.011e-13
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.03381205 0.05706262
## sample estimates:
##    prop 1    prop 2 
## 0.3337340 0.2882967

prop.test(c(14293,1647),c(82353,6417))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(14293, 1647) out of c(82353, 6417)
## X-squared = 278.51, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.09418383 -0.07202469
## sample estimates:
##    prop 1    prop 2 
## 0.1735577 0.2566620

prop.test(c(5651,814),c(82353,6417))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(5651, 814) out of c(82353, 6417)
## X-squared = 298.09, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.06663913 -0.04982351
## sample estimates:
##     prop 1     prop 2 
## 0.06861924 0.12685055

chisq.test(income$inccat,income$fmarry)

## 
##  Pearson's Chi-squared test
## 
## data:  income$inccat and income$fmarry
## X-squared = 1777.8, df = 6, p-value < 2.2e-16

The shared distribution for income in married and the unmarried groups is different. Particularly, the proportion of those who are not married is higher in each income category than that of married persons. This shows that the income groups has a significant dependence or association with marriage,X-squared(6)=1777.8,p<.05. #5.Association of Income and gender for men and women

gendersplit=split(income, f=income$sex)
chfmen=chisq.test(gendersplit$`1`$inccat,gendersplit$`1`$fmarry)
chfmen

## 
##  Pearson's Chi-squared test
## 
## data:  gendersplit$`1`$inccat and gendersplit$`1`$fmarry
## X-squared = 1094.3, df = 6, p-value < 2.2e-16

chfmen$parameter

## df 
##  6

chfmen$observed

##                       gendersplit$`1`$fmarry
## gendersplit$`1`$inccat     0     1
##     [0,2.5e+04)        14747   504
##     [2.5e+04,5e+04)    14610   997
##     [5e+04,7.5e+04)     8140   915
##     [7.5e+04,1e+05)     3368   505
##     [1e+05,1.25e+05)    1786   279
##     [1.25e+05,1.5e+05)   622   149
##     [1.5e+05,Inf)       1214   231

chfwomen=chisq.test(gendersplit$`2`$inccat,gendersplit$`2`$fmarry)
chfwomen

## 
##  Pearson's Chi-squared test
## 
## data:  gendersplit$`2`$inccat and gendersplit$`2`$fmarry
## X-squared = 689.94, df = 6, p-value < 2.2e-16

chfwomen$parameter

## df 
##  6

chfwomen$observed

##                       gendersplit$`2`$fmarry
## gendersplit$`2`$inccat     0     1
##     [0,2.5e+04)        14525   594
##     [2.5e+04,5e+04)    12874   853
##     [5e+04,7.5e+04)     6153   732
##     [7.5e+04,1e+05)     2283   309
##     [1e+05,1.25e+05)    1040   168
##     [1.25e+05,1.5e+05)   404    66
##     [1.5e+05,Inf)        587   115

Gender seems to moderate the association between income groups and marriage because the chi-square value for association between marriage and income for men (X-sq(6)= 1094.3,p<.05) shows a higher value than the chisquare value of association between marriage and income groups for women (X-sq(6)=689.94,p<.05). However, the trend of association remains with higher records of income for the unmarried men and women than the married ones. #6. Association betweem Degree and marriage for men and women

library(dplyr)
library(sjPlot)

## Warning: package 'sjPlot' was built under R version 4.1.2

library(sjlabelled)

## Warning: package 'sjlabelled' was built under R version 4.1.2

## 
## Attaching package: 'sjlabelled'

## The following object is masked from 'package:ggplot2':
## 
##     as_label

## The following objects are masked from 'package:haven':
## 
##     as_factor, read_sas, read_spss, read_stata, write_sas, zap_labels

## The following object is masked from 'package:dplyr':
## 
##     as_label

library(emmeans)

## Warning: package 'emmeans' was built under R version 4.1.2

library(sjmisc)

## Warning: package 'sjmisc' was built under R version 4.1.2

## 
## Attaching package: 'sjmisc'

## The following objects are masked from 'package:janitor':
## 
##     remove_empty_cols, remove_empty_rows

library(effectsize)

## Warning: package 'effectsize' was built under R version 4.1.2

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

library(bayestestR)

## Warning: package 'bayestestR' was built under R version 4.1.2

library(bayesplot)

## Warning: package 'bayesplot' was built under R version 4.1.2

## This is bayesplot version 1.8.1

## - Online documentation and vignettes at mc-stan.org/bayesplot

## - bayesplot theme set to bayesplot::theme_default()

##    * Does _not_ affect other ggplot2 plots

##    * See ?bayesplot_theme_set for details on theme setting

library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.1.2

library(datawizard)

## Warning: package 'datawizard' was built under R version 4.1.2

## 
## Attaching package: 'datawizard'

## The following objects are masked from 'package:sjmisc':
## 
##     center, reshape_longer, to_numeric

## The following object is masked from 'package:sjlabelled':
## 
##     to_numeric

library(insight)

## Warning: package 'insight' was built under R version 4.1.2

## 
## Attaching package: 'insight'

## The following objects are masked from 'package:datawizard':
## 
##     data_match, data_relocate, data_restoretype, data_to_long,
##     data_to_wide, reshape_ci, reshape_longer, reshape_wider, to_numeric

## The following object is masked from 'package:bayestestR':
## 
##     reshape_ci

## The following objects are masked from 'package:sjmisc':
## 
##     reshape_longer, to_numeric

## The following object is masked from 'package:sjlabelled':
## 
##     to_numeric

## The following object is masked from 'package:janitor':
## 
##     clean_names

sjPlot::tab_xtab(var.row = gendersplit$`1`$degree, var.col = gendersplit$`1`$fmarry, title = "Education achievement and marriage for men", show.row.prc = TRUE)

Education achievement and marriage for men
`1`$degree</th> <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">`1`$fmarry	Total
	Total	0	1
1	4195 95.8 %	184 4.2 %	4379 100 %
2	12651 95.4 %	616 4.6 %	13267 100 %
3	13040 93.3 %	936 6.7 %	13976 100 %
4	10150 89.5 %	1191 10.5 %	11341 100 %
5	4451 87.2 %	653 12.8 %	5104 100 %
Total	44487 92.6 %	3580 7.4 %	48067 100 %
χ²=594.818 · df=4 · Cramer’s V=0.111 · p=0.000

sjPlot::tab_xtab(var.row=gendersplit$`2`$degree,var.col = gendersplit$`2`$fmarry,title= "Education achievement and marriage for women",show.row.prc=TRUE)

Education achievement and marriage for women
`2`$degree</th> <th style="border-top:double; text-align:center; font-style:italic; font-weight:normal;" colspan="2">`2`$fmarry	Total
	Total	0	1
1	2620 97.2 %	76 2.8 %	2696 100 %
2	7866 96.7 %	271 3.3 %	8137 100 %
3	12011 95 %	637 5 %	12648 100 %
4	9738 90.3 %	1047 9.7 %	10785 100 %
5	5631 87.5 %	806 12.5 %	6437 100 %
Total	37866 93 %	2837 7 %	40703 100 %
χ²=741.417 · df=4 · Cramer’s V=0.135 · p=0.000

degfmen=chisq.test(gendersplit$`1`$degree,gendersplit$`1`$fmarry)
degfmen

## 
##  Pearson's Chi-squared test
## 
## data:  gendersplit$`1`$degree and gendersplit$`1`$fmarry
## X-squared = 594.82, df = 4, p-value < 2.2e-16

degfwomen=chisq.test(gendersplit$`2`$degree,gendersplit$`2`$fmarry)

Results show that there is significant association between educational achievement and marriage across men and women. However, the association is stronger for men (V=13.5%)than women(V=11.1%) as supported by V-Cramers value of the explained variation. #7. Effect of education or degree achievement on the chances of being married or not. A logistics regression approach

library(ggplot2)
gendersplit$`1`$degree=as.factor(gendersplit$`1`$degree)
gendersplit$`2`$degree=as.factor(gendersplit$`2`$degree)
fmenlogit=glm(fmarry~degree,data=gendersplit$`1`, family = "binomial",)
fmenlogit

## 
## Call:  glm(formula = fmarry ~ degree, family = "binomial", data = gendersplit$`1`)
## 
## Coefficients:
## (Intercept)      degree2      degree3      degree4      degree5  
##     -3.1267       0.1045       0.4926       0.9840       1.2074  
## 
## Degrees of Freedom: 48066 Total (i.e. Null);  48062 Residual
## Null Deviance:       25480 
## Residual Deviance: 24900     AIC: 24910

fwomenlogit=glm(fmarry~degree, data= gendersplit$`2`,family = "binomial")
fwomenlogit

## 
## Call:  glm(formula = fmarry ~ degree, family = "binomial", data = gendersplit$`2`)
## 
## Coefficients:
## (Intercept)      degree2      degree3      degree4      degree5  
##     -3.5402       0.1720       0.6034       1.3101       1.5962  
## 
## Degrees of Freedom: 40702 Total (i.e. Null);  40698 Residual
## Null Deviance:       20580 
## Residual Deviance: 19850     AIC: 19860

#8. Effects of income levels on the chances of being married or not. A logistics approach

incmenlogit=glm(fmarry~inctot,data=gendersplit$`1`,family = "binomial")
summary(incmenlogit)

## 
## Call:
## glm(formula = fmarry ~ inctot, family = "binomial", data = gendersplit$`1`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9788  -0.3934  -0.3701  -0.3527   2.4002  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.823e+00  2.251e-02  -125.4   <2e-16 ***
## inctot       5.511e-06  2.296e-07    24.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25483  on 48066  degrees of freedom
## Residual deviance: 24961  on 48065  degrees of freedom
## AIC: 24965
## 
## Number of Fisher Scoring iterations: 5

incwomenlogit=glm(fmarry~inctot,data=gendersplit$`2`,family="binomial")
summary(incwomenlogit)

## 
## Call:
## glm(formula = fmarry ~ inctot, family = "binomial", data = gendersplit$`2`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7294  -0.3845  -0.3629  -0.3445   2.4172  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.866e+00  2.568e-02 -111.60   <2e-16 ***
## inctot       5.970e-06  3.279e-07   18.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20585  on 40702  degrees of freedom
## Residual deviance: 20280  on 40701  degrees of freedom
## AIC: 20284
## 
## Number of Fisher Scoring iterations: 5

For women a unit percentage increase in income causes a increase in chances of being married while a unit percentage rise in income for men causes a increase in chances of being married #9.Effects of personal income level on the probability of being married. A logistics approach

lgincmenlogit=glm(fmarry~logincome,data=gendersplit$`1`,family = "binomial")
lgincwomenlogit=glm(fmarry~logincome,data=gendersplit$`2`,family = "binomial")
summary(lgincmenlogit)

## 
## Call:
## glm(formula = fmarry ~ logincome, family = "binomial", data = gendersplit$`1`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9287  -0.4338  -0.3689  -0.2871   3.5742  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.11565    0.22879  -39.84   <2e-16 ***
## logincome    0.62297    0.02112   29.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25483  on 48066  degrees of freedom
## Residual deviance: 24446  on 48065  degrees of freedom
## AIC: 24450
## 
## Number of Fisher Scoring iterations: 6

summary(lgincwomenlogit)

## 
## Call:
## glm(formula = fmarry ~ logincome, family = "binomial", data = gendersplit$`2`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8131  -0.4186  -0.3729  -0.3025   3.4848  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.47834    0.24292  -30.79   <2e-16 ***
## logincome    0.47022    0.02292   20.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20585  on 40702  degrees of freedom
## Residual deviance: 20094  on 40701  degrees of freedom
## AIC: 20098
## 
## Number of Fisher Scoring iterations: 6

#10. Role of income in the association between marriage and educational achievement

rolinclogitmen=glm(fmarry~degree+logincome,data=gendersplit$`1`,family = "binomial")
rolinclogitwomen=glm(fmarry~degree+logincome,data=gendersplit$`2`,family = "binomial")
summary(rolinclogitmen)

## 
## Call:
## glm(formula = fmarry ~ degree + logincome, family = "binomial", 
##     data = gendersplit$`1`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8854  -0.4357  -0.3570  -0.2821   3.4612  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.17931    0.24341 -33.603  < 2e-16 ***
## degree2     -0.04838    0.08649  -0.559   0.5759    
## degree3      0.20136    0.08376   2.404   0.0162 *  
## degree4      0.44902    0.08462   5.306 1.12e-07 ***
## degree5      0.55188    0.09089   6.072 1.26e-09 ***
## logincome    0.51125    0.02291  22.313  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25483  on 48066  degrees of freedom
## Residual deviance: 24318  on 48061  degrees of freedom
## AIC: 24330
## 
## Number of Fisher Scoring iterations: 6

summary(rolinclogitwomen)

## 
## Call:
## glm(formula = fmarry ~ degree + logincome, family = "binomial", 
##     data = gendersplit$`2`)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6749  -0.4532  -0.3300  -0.2659   3.0404  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.77977    0.26237 -22.029  < 2e-16 ***
## degree2      0.10229    0.13202   0.775 0.438482    
## degree3      0.45927    0.12416   3.699 0.000217 ***
## degree4      1.02216    0.12422   8.229  < 2e-16 ***
## degree5      1.24618    0.12722   9.795  < 2e-16 ***
## logincome    0.23643    0.02451   9.646  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20585  on 40702  degrees of freedom
## Residual deviance: 19744  on 40697  degrees of freedom
## AIC: 19756
## 
## Number of Fisher Scoring iterations: 6

Lower educational achievement for both men and women does not result to significant effect when income level is accounted for. The other educational achievements are significant with reduced magnitude for predicting a marginal change in the probability of being married for both men and women. Since, income effect is significant in predicting probability of marriage for men and women, it plays a moderating effect; increasing the probability of marriage for men and reducing probability of marriage for marriage at different educational achievement levels.