MTH245 Group-A Analysis Outline
Elizabeth Yun, Mohini Banerjee, Kim Chi Ngo
Title:
Entering the Labor Force: What Are My Chances?
require(mosaic)
Main Questions:
What effects did age have on unemployment status in the U.S. from 2010-2011 in relation to one’s education level and one’s sex?
What was the probability of being unemployed in the U.S. from 2010-2011 given one’s education level?
Subquestion:
What was the relationship between age and education level in the U.S. from 2010-2011?
After first describing our univariate variables of employment status (Empstat), education level (Educ), age (Age), and sex (Sex) we will use the descriptive statistics and graphs outlined here to provide a picture of how these explanatory variables are distributed. In order to describe relationships between the explanatory variables we will begin with the distribution of Educ and Age alongside one another. Then we will fit two logistic regression models, first looking at Empstat and Age controlling for Educ. The second will look at Empstat and Age, controlling for Sex. Finally we plan to look at a two-way table comparing Empstat and Educ in order to calculate conditional probabilities of being unemployed given certain education levels.
file1 = "Group-A Unemployment Updated.csv"
emp = read.csv(file1)
emp10 = subset(emp, Year == 2011)
emp11 = subset(emp, Year == 2010)
Single-Variable Distributions:
Visual: Bar Graphs
Numerical: Frequency Tables
bargraph(~Empstat, data = emp10)
bargraph(~Empstat, data = emp11)
tally(~Empstat, data = emp10)
##
## employed unemployed Total
## 90653 5528 96181
tally(~Empstat, data = emp11)
##
## employed unemployed Total
## 92507 6251 98758
Visual: density plots
Numerical: descriptive statistics
densityplot(~Age, data = emp10)
densityplot(~Age, data = emp11)
favstats(~Age, data = emp10)
## min Q1 median Q3 max mean sd n missing
## 15 30 41 51 85 41.29 13.55 96181 0
favstats(~Age, data = emp11)
## min Q1 median Q3 max mean sd n missing
## 15 30 41 51 85 40.99 13.45 98758 0
bargraph(~Educ, data = emp10)
bargraph(~Educ, data = emp11)
tally(~Educ, data = emp10)
##
## assoc ba college doctorate hsgrad jrhigh ma
## 9923 20155 18005 1602 26764 1135 7877
## none primary prof somehs Total
## 201 1821 1525 7173 96181
tally(~Educ, data = emp11)
##
## assoc ba college doctorate hsgrad jrhigh ma
## 9930 20362 18544 1507 28127 1276 7576
## none primary prof somehs Total
## 185 1813 1551 7887 98758
bargraph(~Sex, data = emp10)
bargraph(~Sex, data = emp11)
tally(~Sex, data = emp10)
##
## Female Male Total
## 46126 50055 96181
tally(~Sex, data = emp11)
##
## Female Male Total
## 47396 51362 98758
Variable Relationship Analyses:
1. Distribution of Age (quantitative) and Education (categorical)
visual: boxplot
numerical: descriptive statistics
inferential: maximum, Q1, mean, Q3, and, minimum of age by each education level
2010
bwplot(Age ~ Educ, data = emp10)
favstats(Age ~ Educ, data = emp10)
## min Q1 median Q3 max mean sd n missing
## assoc 15 32 42 51 85 42.20 11.97 9923 0
## ba 15 32 41 51 85 42.21 12.11 20155 0
## college 15 26 38 50 85 39.02 14.18 18005 0
## doctorate 18 38 47 57 85 47.54 12.21 1602 0
## hsgrad 15 31 42 52 85 41.70 13.62 26764 0
## jrhigh 15 30 41 53 85 41.88 16.12 1135 0
## ma 16 36 45 54 85 45.36 11.64 7877 0
## none 16 36 48 55 80 46.41 13.81 201 0
## primary 15 34 44 53 85 43.76 12.47 1821 0
## prof 22 37 46 56 85 46.96 12.49 1525 0
## somehs 15 18 31 46 85 33.59 15.54 7173 0
2011
bwplot(Age ~ Educ, data = emp11)
favstats(Age ~ Educ, data = emp11)
## min Q1 median Q3 max mean sd n missing
## assoc 15 33 42 51.00 85 42.16 11.77 9930 0
## ba 15 32 41 51.00 85 41.96 12.00 20362 0
## college 15 27 39 50.00 85 39.11 13.97 18544 0
## doctorate 18 38 47 56.00 85 47.72 12.02 1507 0
## hsgrad 15 30 42 51.00 85 41.42 13.46 28127 0
## jrhigh 15 29 40 51.25 85 40.67 15.66 1276 0
## ma 15 36 44 54.00 85 45.07 11.57 7576 0
## none 16 32 45 55.00 80 43.98 14.46 185 0
## primary 15 34 43 52.00 85 43.51 12.62 1813 0
## prof 19 37 46 56.00 80 46.78 12.05 1551 0
## somehs 15 18 30 45.00 85 32.92 15.38 7887 0
2. Relationship between Employment (categorical) and Age (quantitative), controlling for Education (categorical)
visual: boxplot
numerical: descriptive statistics, summary and anova table
inferential: logistic regression of employment and age, controlling for education level
2010
y = ifelse(emp10$Empstat == "employed", 1, 0)
mod1a = glm(y ~ Age + Educ, data = emp10, family = "binomial")
summary(mod1a)
##
## Call:
## glm(formula = y ~ Age + Educ, family = "binomial", data = emp10)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.181 0.226 0.295 0.382 0.715
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.96205 0.06768 28.99 < 2e-16 ***
## Age 0.03147 0.00112 28.16 < 2e-16 ***
## Educba 0.21013 0.06637 3.17 0.0015 **
## Educcollege -0.32663 0.06164 -5.30 1.2e-07 ***
## Educdoctorate 1.03368 0.23659 4.37 1.2e-05 ***
## Educhsgrad -0.64482 0.05765 -11.19 < 2e-16 ***
## Educjrhigh -0.86442 0.11721 -7.37 1.6e-13 ***
## Educma 0.57416 0.09656 5.95 2.8e-09 ***
## Educnone -0.75026 0.28373 -2.64 0.0082 **
## Educprimary -0.87358 0.10030 -8.71 < 2e-16 ***
## Educprof 0.90353 0.22602 4.00 6.4e-05 ***
## Educsomehs -1.20062 0.06319 -19.00 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 42312 on 96180 degrees of freedom
## Residual deviance: 39701 on 96169 degrees of freedom
## AIC: 39725
##
## Number of Fisher Scoring iterations: 6
anova(mod1a, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 96180 42312
## Age 1 1385 96179 40927 <2e-16 ***
## Educ 10 1226 96169 39701 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plotPoints(jitter(y) ~ Age, data = emp10, col = "gray")
fit.empstat = makeFun(mod1a)
plotFun(fit.empstat(Age = x, Educ = "none") ~ x, add = TRUE, col = "dark red")
plotFun(fit.empstat(Age = x, Educ = "primary") ~ x, add = TRUE, col = "red")
plotFun(fit.empstat(Age = x, Educ = "jrhigh") ~ x, add = TRUE, col = "orange")
plotFun(fit.empstat(Age = x, Educ = "somehs") ~ x, add = TRUE, col = "gold")
plotFun(fit.empstat(Age = x, Educ = "hsgrad") ~ x, add = TRUE, col = "yellow")
plotFun(fit.empstat(Age = x, Educ = "college") ~ x, add = TRUE, col = "light green")
plotFun(fit.empstat(Age = x, Educ = "prof") ~ x, add = TRUE, col = "dark green")
plotFun(fit.empstat(Age = x, Educ = "assoc") ~ x, add = TRUE, col = "light blue")
plotFun(fit.empstat(Age = x, Educ = "ba") ~ x, add = TRUE, col = "dark blue")
plotFun(fit.empstat(Age = x, Educ = "ma") ~ x, add = TRUE, col = "purple")
plotFun(fit.empstat(Age = x, Educ = "doctorate") ~ x, add = TRUE, col = "magenta")
2011
y = ifelse(emp10$Empstat == "employed", 1, 0)
mod1a = glm(y ~ Age + Educ, data = emp11, family = "binomial")
## Error: variable lengths differ (found for 'Age')
summary(mod1a)
##
## Call:
## glm(formula = y ~ Age + Educ, family = "binomial", data = emp10)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.181 0.226 0.295 0.382 0.715
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.96205 0.06768 28.99 < 2e-16 ***
## Age 0.03147 0.00112 28.16 < 2e-16 ***
## Educba 0.21013 0.06637 3.17 0.0015 **
## Educcollege -0.32663 0.06164 -5.30 1.2e-07 ***
## Educdoctorate 1.03368 0.23659 4.37 1.2e-05 ***
## Educhsgrad -0.64482 0.05765 -11.19 < 2e-16 ***
## Educjrhigh -0.86442 0.11721 -7.37 1.6e-13 ***
## Educma 0.57416 0.09656 5.95 2.8e-09 ***
## Educnone -0.75026 0.28373 -2.64 0.0082 **
## Educprimary -0.87358 0.10030 -8.71 < 2e-16 ***
## Educprof 0.90353 0.22602 4.00 6.4e-05 ***
## Educsomehs -1.20062 0.06319 -19.00 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 42312 on 96180 degrees of freedom
## Residual deviance: 39701 on 96169 degrees of freedom
## AIC: 39725
##
## Number of Fisher Scoring iterations: 6
anova(mod1a, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 96180 42312
## Age 1 1385 96179 40927 <2e-16 ***
## Educ 10 1226 96169 39701 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plotPoints(jitter(y) ~ Age, data = emp10, col = "gray")
fit.empstat = makeFun(mod1a)
plotFun(fit.empstat(Age = x, Educ = "none") ~ x, add = TRUE, col = "dark red")
plotFun(fit.empstat(Age = x, Educ = "primary") ~ x, add = TRUE, col = "red")
plotFun(fit.empstat(Age = x, Educ = "jrhigh") ~ x, add = TRUE, col = "orange")
plotFun(fit.empstat(Age = x, Educ = "somehs") ~ x, add = TRUE, col = "gold")
plotFun(fit.empstat(Age = x, Educ = "hsgrad") ~ x, add = TRUE, col = "yellow")
plotFun(fit.empstat(Age = x, Educ = "college") ~ x, add = TRUE, col = "light green")
plotFun(fit.empstat(Age = x, Educ = "prof") ~ x, add = TRUE, col = "dark green")
plotFun(fit.empstat(Age = x, Educ = "assoc") ~ x, add = TRUE, col = "light blue")
plotFun(fit.empstat(Age = x, Educ = "ba") ~ x, add = TRUE, col = "dark blue")
plotFun(fit.empstat(Age = x, Educ = "ma") ~ x, add = TRUE, col = "purple")
plotFun(fit.empstat(Age = x, Educ = "doctorate") ~ x, add = TRUE, col = "magenta")
3. Relationship between Employment (categorical) and Age (quantitative), controlling for Sex (categorical)
visual: plot of logistic model with fitted lines for sexes
numerical: descriptive statistics, summary and anova table
inferential: logistic regression of employment and age, controlling for sex
2010
y = ifelse(emp10$Empstat == "employed", 1, 0)
mod2a = glm(y ~ Age + Sex, data = emp10, family = "binomial")
summary(mod2a)
##
## Call:
## glm(formula = y ~ Age + Sex, family = "binomial", data = emp10)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.055 0.251 0.311 0.391 0.573
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.42642 0.04489 31.8 <2e-16 ***
## Age 0.04038 0.00113 35.9 <2e-16 ***
## SexMale -0.30791 0.02833 -10.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 42312 on 96180 degrees of freedom
## Residual deviance: 40808 on 96178 degrees of freedom
## AIC: 40814
##
## Number of Fisher Scoring iterations: 6
anova(mod2a, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 96180 42312
## Age 1 1385 96179 40927 <2e-16 ***
## Sex 1 120 96178 40808 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plotPoints(jitter(y) ~ Age, data = emp10, groups = Sex)
fit.empstat = makeFun(mod2a)
plotFun(fit.empstat(Age = x, Sex = "Female") ~ x, add = TRUE, col = "red")
plotFun(fit.empstat(Age = x, Sex = "Male") ~ x, add = TRUE)
2011
y = ifelse(emp11$Empstat == "employed", 1, 0)
mod2b = glm(Empstat ~ Age + Sex, data = emp11, family = "binomial")
summary(mod2b)
##
## Call:
## glm(formula = Empstat ~ Age + Sex, family = "binomial", data = emp11)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.602 -0.406 -0.330 -0.262 3.089
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.43422 0.04273 -33.6 <2e-16 ***
## Age -0.03916 0.00106 -36.8 <2e-16 ***
## SexMale 0.40693 0.02697 15.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 46602 on 98757 degrees of freedom
## Residual deviance: 44914 on 98755 degrees of freedom
## AIC: 44920
##
## Number of Fisher Scoring iterations: 6
anova(mod2b, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Empstat
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 98757 46602
## Age 1 1456 98756 45146 <2e-16 ***
## Sex 1 232 98755 44914 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plotPoints(jitter(y) ~ Age, data = emp11, groups = Sex)
fit.empstat = makeFun(mod2b)
plotFun(fit.empstat(Age = x, Sex = "Female") ~ x, add = TRUE, col = "red")
plotFun(fit.empstat(Age = x, Sex = "Male") ~ x, add = TRUE)
4. Relationship between Employment (categorical) and Education Level (categorical)
visual: mosaic plot
numerical: descriptive statistics and conditional probabilities
inferential: two-way table comparing employment status and education level
2010
library(vcd)
## Error: there is no package called 'vcd'
mosaic(x, condvar=Educ, data= emp10, shade=TRUE, legend=TRUE)
*Need to write a formula or make a table for x that will compare Empstat to Educ
2011
library(vcd)
## Error: there is no package called 'vcd'
mosaic(x, condvar=Educ, data= emp11, shade=TRUE, legend=TRUE)