The data was originally for
Prediction task is to determine whether a person makes over 50K a year
but I will be using this data to look at the dimensions:
age
: continuous.education-num
: 1-16 covering pre-school through Doctoral.fnlwgt
: continuous - see Description.
sex
: Female, Male.
capital-gain
: continuous.
capital-loss
: continuous.
hours-per-week
: continuous.
make under/over 50k/yr
: 1/0.
Description of fnlwgt
:
The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are: 1. A single cell estimate of the population 16+ for each state. 2. Controls for Hispanic Origin by age and sex. 3. Controls by Race, age and sex.
We use all three sets of controls in our weighting program and “rake” through them 6 times so that by the end we come back to all the controls we used.
The term estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population.
People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.
This tells me that this data is not controled for on a country-wide demographics.
library(RCurl)
library(tidyverse)
For this analysis I’ll be seeing if edu_num can be estimated given the other dimensions:
Load, clean, and review the data:
adults <- read.csv("adult.data", stringsAsFactors = F) %>%
mutate(age = X39, fnlwgt = X77516, education = Bachelors, edu_num = X13,
`marital-status`= Never.married,occupation = Adm.clerical,
relationship = Not.in.family, race = White, sex = Male, cap_gain = X2174,
cap_loss = X0, hrs_per_week = X40, cntry = United.States,
over_under_50k = X..50K) %>%
select(age:over_under_50k)
#View(adults)
adlts_dat <- adults %>% mutate(o_u_50k = ifelse(grepl('<=50K', over_under_50k),
0,
ifelse(grepl('>50K', over_under_50k),
1, 0)),
m_f = ifelse(grepl('Male', sex), 0, 1)) %>%
select(fnlwgt, age, edu_num, cap_gain, cap_loss, hrs_per_week, m_f, o_u_50k)
Here’s a preview ofthe cleaned data:
head(adlts_dat)
## fnlwgt age edu_num cap_gain cap_loss hrs_per_week m_f o_u_50k
## 1 83311 50 13 0 0 13 0 0
## 2 215646 38 9 0 0 40 0 0
## 3 234721 53 7 0 0 40 0 0
## 4 338409 28 13 0 0 40 1 0
## 5 284582 37 14 0 0 40 1 0
## 6 160187 49 5 0 0 16 1 0
#unique(adults$`marital-status`)
Lets plot the different dimensions:
#View(adlts_dat)
pairs(adlts_dat)
It’s bit too hard to see here, so I’ll dive in deeper and take a look at the lm
:
adlt_lm <- lm(edu_num ~ fnlwgt + age + cap_gain + cap_loss +
hrs_per_week + m_f + o_u_50k, data = adlts_dat)
summary(adlt_lm)
##
## Call:
## lm(formula = edu_num ~ fnlwgt + age + cap_gain + cap_loss + hrs_per_week +
## m_f + o_u_50k, data = adlts_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.928 -0.984 -0.117 1.436 7.534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.226e+00 6.816e-02 135.355 < 2e-16 ***
## fnlwgt -9.679e-07 1.265e-07 -7.653 2.02e-14 ***
## age -9.041e-03 1.007e-03 -8.975 < 2e-16 ***
## cap_gain 1.794e-05 1.853e-06 9.680 < 2e-16 ***
## cap_loss 2.146e-04 3.348e-05 6.410 1.48e-10 ***
## hrs_per_week 1.825e-02 1.128e-03 16.180 < 2e-16 ***
## m_f 4.207e-01 2.952e-02 14.250 < 2e-16 ***
## o_u_50k 1.961e+00 3.425e-02 57.265 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.4 on 32552 degrees of freedom
## Multiple R-squared: 0.1303, Adjusted R-squared: 0.1301
## F-statistic: 696.4 on 7 and 32552 DF, p-value: < 2.2e-16
My \(R^2\) is not very high and it doesn’t make me confident a linear model can predict the edu_num in this particular dataset.
Notes:
From the data source:
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))