MTH245 Group-A Analysis Outline
Elizabeth Yun, Mohini Banerjee, Kim Chi Ngo

Title:
Entering the Labor Force: What Are My Chances?

require(mosaic)

Main Questions:
What effects did age have on unemployment status in the U.S. from 2010-2011 in relation to one’s education level and one’s sex?
What was the probability of being unemployed in the U.S. from 2010-2011 given one’s education level?

Subquestion:
What was the relationship between age and education level in the U.S. from 2010-2011?

After first describing our univariate variables of employment status (Empstat), education level (Educ), age (Age), and sex (Sex) we will use the descriptive statistics and graphs outlined here to provide a picture of how these explanatory variables are distributed. In order to describe relationships between the explanatory variables we will begin with the distribution of Educ and Age alongside one another. Then we will fit two logistic regression models, first looking at Empstat and Age controlling for Educ. The second will look at Empstat and Age, controlling for Sex. Finally we plan to look at a two-way table comparing Empstat and Educ in order to calculate conditional probabilities of being unemployed given certain education levels.

file1 = "Group-A Unemployment Updated.csv"
emp = read.csv(file1)
emp10 = subset(emp, Year == 2011)
emp11 = subset(emp, Year == 2010)

Single-Variable Distributions:

  1. Employment status (categorical response)

Visual: Bar Graphs
Numerical: Frequency Tables

bargraph(~Empstat, data = emp10)

plot of chunk unnamed-chunk-3

bargraph(~Empstat, data = emp11)

plot of chunk unnamed-chunk-3

tally(~Empstat, data = emp10)
## 
##   employed unemployed      Total 
##      90653       5528      96181
tally(~Empstat, data = emp11)
## 
##   employed unemployed      Total 
##      92507       6251      98758
  1. Age (quantitative explanatory)

Visual: density plots
Numerical: descriptive statistics

densityplot(~Age, data = emp10)

plot of chunk unnamed-chunk-4

densityplot(~Age, data = emp11)

plot of chunk unnamed-chunk-4

favstats(~Age, data = emp10)
##  min Q1 median Q3 max  mean    sd     n missing
##   15 30     41 51  85 41.29 13.55 96181       0
favstats(~Age, data = emp11)
##  min Q1 median Q3 max  mean    sd     n missing
##   15 30     41 51  85 40.99 13.45 98758       0
  1. Education (categorical explanatory) visual: bar graph numerical: frequency table
bargraph(~Educ, data = emp10)

plot of chunk unnamed-chunk-5

bargraph(~Educ, data = emp11)

plot of chunk unnamed-chunk-5

tally(~Educ, data = emp10)
## 
##     assoc        ba   college doctorate    hsgrad    jrhigh        ma 
##      9923     20155     18005      1602     26764      1135      7877 
##      none   primary      prof    somehs     Total 
##       201      1821      1525      7173     96181
tally(~Educ, data = emp11)
## 
##     assoc        ba   college doctorate    hsgrad    jrhigh        ma 
##      9930     20362     18544      1507     28127      1276      7576 
##      none   primary      prof    somehs     Total 
##       185      1813      1551      7887     98758
  1. Sex (categorical explanatory) visual: bar graph numerical: frequency table
bargraph(~Sex, data = emp10)

plot of chunk unnamed-chunk-6

bargraph(~Sex, data = emp11)

plot of chunk unnamed-chunk-6

tally(~Sex, data = emp10)
## 
## Female   Male  Total 
##  46126  50055  96181
tally(~Sex, data = emp11)
## 
## Female   Male  Total 
##  47396  51362  98758

Variable Relationship Analyses:

1. Distribution of Age (quantitative) and Education (categorical)
visual: boxplot
numerical: descriptive statistics
inferential: maximum, Q1, mean, Q3, and, minimum of age by each education level

2010

bwplot(Age ~ Educ, data = emp10)

plot of chunk unnamed-chunk-7

favstats(Age ~ Educ, data = emp10)
##           min Q1 median Q3 max  mean    sd     n missing
## assoc      15 32     42 51  85 42.20 11.97  9923       0
## ba         15 32     41 51  85 42.21 12.11 20155       0
## college    15 26     38 50  85 39.02 14.18 18005       0
## doctorate  18 38     47 57  85 47.54 12.21  1602       0
## hsgrad     15 31     42 52  85 41.70 13.62 26764       0
## jrhigh     15 30     41 53  85 41.88 16.12  1135       0
## ma         16 36     45 54  85 45.36 11.64  7877       0
## none       16 36     48 55  80 46.41 13.81   201       0
## primary    15 34     44 53  85 43.76 12.47  1821       0
## prof       22 37     46 56  85 46.96 12.49  1525       0
## somehs     15 18     31 46  85 33.59 15.54  7173       0

2011

bwplot(Age ~ Educ, data = emp11)

plot of chunk unnamed-chunk-8

favstats(Age ~ Educ, data = emp11)
##           min Q1 median    Q3 max  mean    sd     n missing
## assoc      15 33     42 51.00  85 42.16 11.77  9930       0
## ba         15 32     41 51.00  85 41.96 12.00 20362       0
## college    15 27     39 50.00  85 39.11 13.97 18544       0
## doctorate  18 38     47 56.00  85 47.72 12.02  1507       0
## hsgrad     15 30     42 51.00  85 41.42 13.46 28127       0
## jrhigh     15 29     40 51.25  85 40.67 15.66  1276       0
## ma         15 36     44 54.00  85 45.07 11.57  7576       0
## none       16 32     45 55.00  80 43.98 14.46   185       0
## primary    15 34     43 52.00  85 43.51 12.62  1813       0
## prof       19 37     46 56.00  80 46.78 12.05  1551       0
## somehs     15 18     30 45.00  85 32.92 15.38  7887       0

2. Relationship between Employment (categorical) and Age (quantitative), controlling for Education (categorical)
visual: boxplot
numerical: descriptive statistics, summary and anova table
inferential: logistic regression of employment and age, controlling for education level

2010

y = ifelse(emp10$Empstat == "employed", 1, 0)
mod1a = glm(y ~ Age + Educ, data = emp10, family = "binomial")
summary(mod1a)
## 
## Call:
## glm(formula = y ~ Age + Educ, family = "binomial", data = emp10)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.181   0.226   0.295   0.382   0.715  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.96205    0.06768   28.99  < 2e-16 ***
## Age            0.03147    0.00112   28.16  < 2e-16 ***
## Educba         0.21013    0.06637    3.17   0.0015 ** 
## Educcollege   -0.32663    0.06164   -5.30  1.2e-07 ***
## Educdoctorate  1.03368    0.23659    4.37  1.2e-05 ***
## Educhsgrad    -0.64482    0.05765  -11.19  < 2e-16 ***
## Educjrhigh    -0.86442    0.11721   -7.37  1.6e-13 ***
## Educma         0.57416    0.09656    5.95  2.8e-09 ***
## Educnone      -0.75026    0.28373   -2.64   0.0082 ** 
## Educprimary   -0.87358    0.10030   -8.71  < 2e-16 ***
## Educprof       0.90353    0.22602    4.00  6.4e-05 ***
## Educsomehs    -1.20062    0.06319  -19.00  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 42312  on 96180  degrees of freedom
## Residual deviance: 39701  on 96169  degrees of freedom
## AIC: 39725
## 
## Number of Fisher Scoring iterations: 6
anova(mod1a, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                 96180      42312             
## Age   1     1385     96179      40927   <2e-16 ***
## Educ 10     1226     96169      39701   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plotPoints(jitter(y) ~ Age, data = emp10, col = "gray")
fit.empstat = makeFun(mod1a)
plotFun(fit.empstat(Age = x, Educ = "none") ~ x, add = TRUE, col = "dark red")
plotFun(fit.empstat(Age = x, Educ = "primary") ~ x, add = TRUE, col = "red")
plotFun(fit.empstat(Age = x, Educ = "jrhigh") ~ x, add = TRUE, col = "orange")
plotFun(fit.empstat(Age = x, Educ = "somehs") ~ x, add = TRUE, col = "gold")
plotFun(fit.empstat(Age = x, Educ = "hsgrad") ~ x, add = TRUE, col = "yellow")
plotFun(fit.empstat(Age = x, Educ = "college") ~ x, add = TRUE, col = "light green")
plotFun(fit.empstat(Age = x, Educ = "prof") ~ x, add = TRUE, col = "dark green")
plotFun(fit.empstat(Age = x, Educ = "assoc") ~ x, add = TRUE, col = "light blue")
plotFun(fit.empstat(Age = x, Educ = "ba") ~ x, add = TRUE, col = "dark blue")
plotFun(fit.empstat(Age = x, Educ = "ma") ~ x, add = TRUE, col = "purple")
plotFun(fit.empstat(Age = x, Educ = "doctorate") ~ x, add = TRUE, col = "magenta")

plot of chunk unnamed-chunk-9

2011

y = ifelse(emp10$Empstat == "employed", 1, 0)
mod1a = glm(y ~ Age + Educ, data = emp11, family = "binomial")
## Error: variable lengths differ (found for 'Age')
summary(mod1a)
## 
## Call:
## glm(formula = y ~ Age + Educ, family = "binomial", data = emp10)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.181   0.226   0.295   0.382   0.715  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.96205    0.06768   28.99  < 2e-16 ***
## Age            0.03147    0.00112   28.16  < 2e-16 ***
## Educba         0.21013    0.06637    3.17   0.0015 ** 
## Educcollege   -0.32663    0.06164   -5.30  1.2e-07 ***
## Educdoctorate  1.03368    0.23659    4.37  1.2e-05 ***
## Educhsgrad    -0.64482    0.05765  -11.19  < 2e-16 ***
## Educjrhigh    -0.86442    0.11721   -7.37  1.6e-13 ***
## Educma         0.57416    0.09656    5.95  2.8e-09 ***
## Educnone      -0.75026    0.28373   -2.64   0.0082 ** 
## Educprimary   -0.87358    0.10030   -8.71  < 2e-16 ***
## Educprof       0.90353    0.22602    4.00  6.4e-05 ***
## Educsomehs    -1.20062    0.06319  -19.00  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 42312  on 96180  degrees of freedom
## Residual deviance: 39701  on 96169  degrees of freedom
## AIC: 39725
## 
## Number of Fisher Scoring iterations: 6
anova(mod1a, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                 96180      42312             
## Age   1     1385     96179      40927   <2e-16 ***
## Educ 10     1226     96169      39701   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plotPoints(jitter(y) ~ Age, data = emp10, col = "gray")
fit.empstat = makeFun(mod1a)
plotFun(fit.empstat(Age = x, Educ = "none") ~ x, add = TRUE, col = "dark red")
plotFun(fit.empstat(Age = x, Educ = "primary") ~ x, add = TRUE, col = "red")
plotFun(fit.empstat(Age = x, Educ = "jrhigh") ~ x, add = TRUE, col = "orange")
plotFun(fit.empstat(Age = x, Educ = "somehs") ~ x, add = TRUE, col = "gold")
plotFun(fit.empstat(Age = x, Educ = "hsgrad") ~ x, add = TRUE, col = "yellow")
plotFun(fit.empstat(Age = x, Educ = "college") ~ x, add = TRUE, col = "light green")
plotFun(fit.empstat(Age = x, Educ = "prof") ~ x, add = TRUE, col = "dark green")
plotFun(fit.empstat(Age = x, Educ = "assoc") ~ x, add = TRUE, col = "light blue")
plotFun(fit.empstat(Age = x, Educ = "ba") ~ x, add = TRUE, col = "dark blue")
plotFun(fit.empstat(Age = x, Educ = "ma") ~ x, add = TRUE, col = "purple")
plotFun(fit.empstat(Age = x, Educ = "doctorate") ~ x, add = TRUE, col = "magenta")

plot of chunk unnamed-chunk-10

3. Relationship between Employment (categorical) and Age (quantitative), controlling for Sex (categorical)
visual: plot of logistic model with fitted lines for sexes
numerical: descriptive statistics, summary and anova table
inferential: logistic regression of employment and age, controlling for sex

2010

y = ifelse(emp10$Empstat == "employed", 1, 0)
mod2a = glm(y ~ Age + Sex, data = emp10, family = "binomial")
summary(mod2a)
## 
## Call:
## glm(formula = y ~ Age + Sex, family = "binomial", data = emp10)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.055   0.251   0.311   0.391   0.573  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.42642    0.04489    31.8   <2e-16 ***
## Age          0.04038    0.00113    35.9   <2e-16 ***
## SexMale     -0.30791    0.02833   -10.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 42312  on 96180  degrees of freedom
## Residual deviance: 40808  on 96178  degrees of freedom
## AIC: 40814
## 
## Number of Fisher Scoring iterations: 6
anova(mod2a, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                 96180      42312             
## Age   1     1385     96179      40927   <2e-16 ***
## Sex   1      120     96178      40808   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plotPoints(jitter(y) ~ Age, data = emp10, groups = Sex)
fit.empstat = makeFun(mod2a)
plotFun(fit.empstat(Age = x, Sex = "Female") ~ x, add = TRUE, col = "red")
plotFun(fit.empstat(Age = x, Sex = "Male") ~ x, add = TRUE)

plot of chunk unnamed-chunk-11

2011

y = ifelse(emp11$Empstat == "employed", 1, 0)
mod2b = glm(Empstat ~ Age + Sex, data = emp11, family = "binomial")
summary(mod2b)
## 
## Call:
## glm(formula = Empstat ~ Age + Sex, family = "binomial", data = emp11)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.602  -0.406  -0.330  -0.262   3.089  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.43422    0.04273   -33.6   <2e-16 ***
## Age         -0.03916    0.00106   -36.8   <2e-16 ***
## SexMale      0.40693    0.02697    15.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 46602  on 98757  degrees of freedom
## Residual deviance: 44914  on 98755  degrees of freedom
## AIC: 44920
## 
## Number of Fisher Scoring iterations: 6
anova(mod2b, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Empstat
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                 98757      46602             
## Age   1     1456     98756      45146   <2e-16 ***
## Sex   1      232     98755      44914   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plotPoints(jitter(y) ~ Age, data = emp11, groups = Sex)
fit.empstat = makeFun(mod2b)
plotFun(fit.empstat(Age = x, Sex = "Female") ~ x, add = TRUE, col = "red")
plotFun(fit.empstat(Age = x, Sex = "Male") ~ x, add = TRUE)

plot of chunk unnamed-chunk-12

4. Relationship between Employment (categorical) and Education Level (categorical)
visual: mosaic plot
numerical: descriptive statistics and conditional probabilities
inferential: two-way table comparing employment status and education level

2010

library(vcd)
## Error: there is no package called 'vcd'

mosaic(x, condvar=Educ, data= emp10, shade=TRUE, legend=TRUE)
*Need to write a formula or make a table for x that will compare Empstat to Educ
2011

library(vcd)
## Error: there is no package called 'vcd'

mosaic(x, condvar=Educ, data= emp11, shade=TRUE, legend=TRUE)