MITx: 15.071x The Analytics Edge, Week1, Demographics and Employment in the US

Tarek Dib

Introduction

In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, use the September 2013 version of this rich, nationally representative dataset (available online).

The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey.

Variables

PeopleInHousehold: The number of people in the interviewee's household.
Region: The census region where the interviewee lives.
State: The state where the interviewee lives.
MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.
Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.
Married: The marriage status of the interviewee.
Sex: The sex of the interviewee.
Education: The maximum level of education obtained by the interviewee.
Race: The race of the interviewee.
Hispanic: Whether the interviewee is of Hispanic ethnicity.
CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.
Citizenship: The United States citizenship status of the interviewee.
EmploymentStatus: The status of employment of the interviewee.
Industry: The industry of employment of the interviewee (only available if they are employed).

Loading Data and Descriptive Statistics

# Set the directory at where the data is located
setwd("/home/tarek/Analytics/Week1/Rlectures/Data")
# Read the Data
CPS <- read.csv("CPSData.csv")
MetroAreaMap <- read.csv("MetroAreaCodes.csv")
CountryMap <- read.csv("CountryCodes.csv")
str(CPS)
## 'data.frame':    131302 obs. of  14 variables:
##  $ PeopleInHousehold : int  1 3 3 3 3 3 3 2 2 2 ...
##  $ Region            : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ State             : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ MetroAreaCode     : int  26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
##  $ Age               : int  85 21 37 18 52 24 26 71 43 52 ...
##  $ Married           : Factor w/ 5 levels "Divorced","Married",..: 5 3 3 3 5 3 3 1 1 3 ...
##  $ Sex               : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 1 2 2 ...
##  $ Education         : Factor w/ 8 levels "Associate degree",..: 1 4 4 6 1 2 4 4 4 2 ...
##  $ Race              : Factor w/ 6 levels "American Indian",..: 6 3 3 3 6 6 6 6 6 6 ...
##  $ Hispanic          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CountryOfBirthCode: int  57 57 57 57 57 57 57 57 57 57 ...
##  $ Citizenship       : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ EmploymentStatus  : Factor w/ 5 levels "Disabled","Employed",..: 4 5 1 3 2 2 2 2 3 2 ...
##  $ Industry          : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA 11 NA NA 11 4 14 4 NA 12 ...
summary(CPS)
##  PeopleInHousehold       Region               State       MetroAreaCode  
##  Min.   : 1.00     Midwest  :30684   California  :11570   Min.   :10420  
##  1st Qu.: 2.00     Northeast:25939   Texas       : 7077   1st Qu.:21780  
##  Median : 3.00     South    :41502   New York    : 5595   Median :34740  
##  Mean   : 3.28     West     :33177   Florida     : 5149   Mean   :35075  
##  3rd Qu.: 4.00                       Pennsylvania: 3930   3rd Qu.:41860  
##  Max.   :15.00                       Illinois    : 3912   Max.   :79600  
##                                      (Other)     :94069   NA's   :34238  
##       Age                Married          Sex       
##  Min.   : 0.0   Divorced     :11151   Female:67481  
##  1st Qu.:19.0   Married      :55509   Male  :63821  
##  Median :39.0   Never Married:30772                 
##  Mean   :38.8   Separated    : 2027                 
##  3rd Qu.:57.0   Widowed      : 6505                 
##  Max.   :85.0   NA's         :25338                 
##                                                     
##                    Education                   Race           Hispanic    
##  High school            :30906   American Indian :  1433   Min.   :0.000  
##  Bachelor's degree      :19443   Asian           :  6520   1st Qu.:0.000  
##  Some college, no degree:18863   Black           : 13913   Median :0.000  
##  No high school diploma :16095   Multiracial     :  2897   Mean   :0.139  
##  Associate degree       : 9913   Pacific Islander:   618   3rd Qu.:0.000  
##  (Other)                :10744   White           :105921   Max.   :1.000  
##  NA's                   :25338                                            
##  CountryOfBirthCode               Citizenship    
##  Min.   : 57.0      Citizen, Native     :116639  
##  1st Qu.: 57.0      Citizen, Naturalized:  7073  
##  Median : 57.0      Non-Citizen         :  7590  
##  Mean   : 82.7                                   
##  3rd Qu.: 57.0                                   
##  Max.   :555.0                                   
##                                                  
##            EmploymentStatus                               Industry    
##  Disabled          : 5712   Educational and health services   :15017  
##  Employed          :61733   Trade                             : 8933  
##  Not in Labor Force:15246   Professional and business services: 7519  
##  Retired           :18619   Manufacturing                     : 6791  
##  Unemployed        : 4203   Leisure and hospitality           : 6364  
##  NA's              :25789   (Other)                           :21618  
##                             NA's                              :65060
# the most common industry of employment
which.max(table(CPS$Industry))
## Educational and health services 
##                               4
# The states that the fewest and largest number of interviewees
which.min(table(CPS$State))
## New Mexico 
##         32
which.max(table(CPS$State))
## California 
##          5
# proportion of interviewees who are citizens of the United States
citizen = CPS[CPS$Citizenship == "Citizen, Native" | CPS$Citizenship == "Citizen, Naturalized", 
    ]
nrow(citizen)/nrow(CPS)
## [1] 0.9422
# Race and hispanic ethnicity
table(CPS$Race, CPS$Hispanic)
##                   
##                        0     1
##   American Indian   1129   304
##   Asian             6407   113
##   Black            13292   621
##   Multiracial       2449   448
##   Pacific Islander   541    77
##   White            89190 16731
# We can test the relationship between these four variable values and
# whether the Married variable is missing table(CPS$Region,
# is.na(CPS$Married)) table(CPS$Sex, is.na(CPS$Married)) table(CPS$Age,
# is.na(CPS$Married)) table(CPS$Citizenship, is.na(CPS$Married))

# States that had all interviewees living in a non-metropolitan area. And
# states that had all interviewees living in a metropolitan area
table(CPS$State, is.na(CPS$MetroAreaCode))  # Alaska and Wyoming have no interviewees living in a metropolitan area, and the District of Columbia, New Jersey, and Rhode Island have all interviewees living in a metro area. 
##                       
##                        FALSE  TRUE
##   Alabama               1020   356
##   Alaska                   0  1590
##   Arizona               1327   201
##   Arkansas               724   697
##   California           11333   237
##   Colorado              2545   380
##   Connecticut           2593   243
##   Delaware              1696   518
##   District of Columbia  1791     0
##   Florida               4947   202
##   Georgia               2250   557
##   Hawaii                1576   523
##   Idaho                  761   757
##   Illinois              3473   439
##   Indiana               1420   584
##   Iowa                  1297  1231
##   Kansas                1234   701
##   Kentucky               908   933
##   Louisiana             1216   234
##   Maine                  909  1354
##   Maryland              2978   222
##   Massachusetts         1858   129
##   Michigan              2517   546
##   Minnesota             2150   989
##   Mississippi            376   854
##   Missouri              1440   705
##   Montana                199  1015
##   Nebraska               816  1133
##   Nevada                1609   247
##   New Hampshire         1148  1514
##   New Jersey            2567     0
##   New Mexico             832   270
##   New York              5144   451
##   North Carolina        1642   977
##   North Dakota           432  1213
##   Ohio                  2754   924
##   Oklahoma              1024   499
##   Oregon                1519   424
##   Pennsylvania          3245   685
##   Rhode Island          2209     0
##   South Carolina        1139   519
##   South Dakota           595  1405
##   Tennessee             1149   635
##   Texas                 6060  1017
##   Utah                  1455   387
##   Vermont                657  1233
##   Virginia              2367   586
##   Washington            1937   429
##   West Virginia          344  1065
##   Wisconsin             1882   804
##   Wyoming                  0  1624

# Region of the United States has the largest proportion of interviewees
# living in a non-metropolitan area?
table(CPS$Region, is.na(CPS$MetroAreaCode))
##            
##             FALSE  TRUE
##   Midwest   20010 10674
##   Northeast 20330  5609
##   South     31631  9871
##   West      25093  8084
# States with highest numbers of interviewees living in non-metropolitan
# areas
sort(round(tapply(is.na(CPS$MetroAreaCode), CPS$State, mean), 2))
## District of Columbia           New Jersey         Rhode Island 
##                 0.00                 0.00                 0.00 
##           California              Florida        Massachusetts 
##                 0.02                 0.04                 0.06 
##             Maryland             New York          Connecticut 
##                 0.07                 0.08                 0.09 
##             Illinois              Arizona             Colorado 
##                 0.11                 0.13                 0.13 
##               Nevada                Texas            Louisiana 
##                 0.13                 0.14                 0.16 
##         Pennsylvania             Michigan           Washington 
##                 0.17                 0.18                 0.18 
##              Georgia             Virginia                 Utah 
##                 0.20                 0.20                 0.21 
##               Oregon             Delaware               Hawaii 
##                 0.22                 0.23                 0.25 
##           New Mexico                 Ohio              Alabama 
##                 0.25                 0.25                 0.26 
##              Indiana            Wisconsin       South Carolina 
##                 0.29                 0.30                 0.31 
##            Minnesota             Missouri             Oklahoma 
##                 0.32                 0.33                 0.33 
##               Kansas            Tennessee       North Carolina 
##                 0.36                 0.36                 0.37 
##             Arkansas                 Iowa                Idaho 
##                 0.49                 0.49                 0.50 
##             Kentucky        New Hampshire             Nebraska 
##                 0.51                 0.57                 0.58 
##                Maine              Vermont          Mississippi 
##                 0.60                 0.65                 0.69 
##         South Dakota         North Dakota        West Virginia 
##                 0.70                 0.74                 0.76 
##              Montana               Alaska              Wyoming 
##                 0.84                 1.00                 1.00

Integrating Metropolitan Area Data

To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap.

# Merges the two data frames on these columns, overwriting the CPS data
# frame with the result
CPS = merge(CPS, MetroAreaMap, by.x = "MetroAreaCode", by.y = "Code", all.x = TRUE)
summary(CPS)
##  MetroAreaCode   PeopleInHousehold       Region               State      
##  Min.   :10420   Min.   : 1.00     Midwest  :30684   California  :11570  
##  1st Qu.:21780   1st Qu.: 2.00     Northeast:25939   Texas       : 7077  
##  Median :34740   Median : 3.00     South    :41502   New York    : 5595  
##  Mean   :35075   Mean   : 3.28     West     :33177   Florida     : 5149  
##  3rd Qu.:41860   3rd Qu.: 4.00                       Pennsylvania: 3930  
##  Max.   :79600   Max.   :15.00                       Illinois    : 3912  
##  NA's   :34238                                       (Other)     :94069  
##       Age                Married          Sex       
##  Min.   : 0.0   Divorced     :11151   Female:67481  
##  1st Qu.:19.0   Married      :55509   Male  :63821  
##  Median :39.0   Never Married:30772                 
##  Mean   :38.8   Separated    : 2027                 
##  3rd Qu.:57.0   Widowed      : 6505                 
##  Max.   :85.0   NA's         :25338                 
##                                                     
##                    Education                   Race           Hispanic    
##  High school            :30906   American Indian :  1433   Min.   :0.000  
##  Bachelor's degree      :19443   Asian           :  6520   1st Qu.:0.000  
##  Some college, no degree:18863   Black           : 13913   Median :0.000  
##  No high school diploma :16095   Multiracial     :  2897   Mean   :0.139  
##  Associate degree       : 9913   Pacific Islander:   618   3rd Qu.:0.000  
##  (Other)                :10744   White           :105921   Max.   :1.000  
##  NA's                   :25338                                            
##  CountryOfBirthCode               Citizenship    
##  Min.   : 57.0      Citizen, Native     :116639  
##  1st Qu.: 57.0      Citizen, Naturalized:  7073  
##  Median : 57.0      Non-Citizen         :  7590  
##  Mean   : 82.7                                   
##  3rd Qu.: 57.0                                   
##  Max.   :555.0                                   
##                                                  
##            EmploymentStatus                               Industry    
##  Disabled          : 5712   Educational and health services   :15017  
##  Employed          :61733   Trade                             : 8933  
##  Not in Labor Force:15246   Professional and business services: 7519  
##  Retired           :18619   Manufacturing                     : 6791  
##  Unemployed        : 4203   Leisure and hospitality           : 6364  
##  NA's              :25789   (Other)                           :21618  
##                             NA's                              :65060  
##                                               MetroArea    
##  New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409  
##  Washington-Arlington-Alexandria, DC-VA-MD-WV      : 4177  
##  Los Angeles-Long Beach-Santa Ana, CA              : 4102  
##  Philadelphia-Camden-Wilmington, PA-NJ-DE          : 2855  
##  Chicago-Naperville-Joliet, IN-IN-WI               : 2772  
##  (Other)                                           :77749  
##  NA's                                              :34238
# Largest number of interviewees in a metropolitan area
which.max(table(CPS$MetroArea))
## New York-Northern New Jersey-Long Island, NY-NJ-PA 
##                                                173
# Metropolitan area that has the highest proportion of interviewees of
# Hispanic ethnicity
which.max(tapply(CPS$Hispanic, CPS$MetroArea, mean))
## Laredo, TX 
##        136
# Metropolitan areas in the United States from which at least 20% of
# interviewees are Asian.
which(tapply(CPS$Race == "Asian", CPS$MetroArea, mean) > 0.2)
##                       Honolulu, HI  San Francisco-Oakland-Fremont, CA 
##                                107                                220 
## San Jose-Sunnyvale-Santa Clara, CA              Vallejo-Fairfield, CA 
##                                221                                254
# Average that have no highschool diploma. Metro that has the least.
which.min(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean, 
    na.rm = T))
## Iowa City, IA 
##           112

Integrating Country of Birth Data

CPS = merge(CPS, CountryMap, by.x = "CountryOfBirthCode", by.y = "Code", all.x = TRUE)
summary(CPS)
##  CountryOfBirthCode MetroAreaCode   PeopleInHousehold       Region     
##  Min.   : 57.0      Min.   :10420   Min.   : 1.00     Midwest  :30684  
##  1st Qu.: 57.0      1st Qu.:21780   1st Qu.: 2.00     Northeast:25939  
##  Median : 57.0      Median :34740   Median : 3.00     South    :41502  
##  Mean   : 82.7      Mean   :35075   Mean   : 3.28     West     :33177  
##  3rd Qu.: 57.0      3rd Qu.:41860   3rd Qu.: 4.00                      
##  Max.   :555.0      Max.   :79600   Max.   :15.00                      
##                     NA's   :34238                                      
##           State            Age                Married          Sex       
##  California  :11570   Min.   : 0.0   Divorced     :11151   Female:67481  
##  Texas       : 7077   1st Qu.:19.0   Married      :55509   Male  :63821  
##  New York    : 5595   Median :39.0   Never Married:30772                 
##  Florida     : 5149   Mean   :38.8   Separated    : 2027                 
##  Pennsylvania: 3930   3rd Qu.:57.0   Widowed      : 6505                 
##  Illinois    : 3912   Max.   :85.0   NA's         :25338                 
##  (Other)     :94069                                                      
##                    Education                   Race           Hispanic    
##  High school            :30906   American Indian :  1433   Min.   :0.000  
##  Bachelor's degree      :19443   Asian           :  6520   1st Qu.:0.000  
##  Some college, no degree:18863   Black           : 13913   Median :0.000  
##  No high school diploma :16095   Multiracial     :  2897   Mean   :0.139  
##  Associate degree       : 9913   Pacific Islander:   618   3rd Qu.:0.000  
##  (Other)                :10744   White           :105921   Max.   :1.000  
##  NA's                   :25338                                            
##                Citizenship               EmploymentStatus
##  Citizen, Native     :116639   Disabled          : 5712  
##  Citizen, Naturalized:  7073   Employed          :61733  
##  Non-Citizen         :  7590   Not in Labor Force:15246  
##                                Retired           :18619  
##                                Unemployed        : 4203  
##                                NA's              :25789  
##                                                          
##                                Industry    
##  Educational and health services   :15017  
##  Trade                             : 8933  
##  Professional and business services: 7519  
##  Manufacturing                     : 6791  
##  Leisure and hospitality           : 6364  
##  (Other)                           :21618  
##  NA's                              :65060  
##                                               MetroArea    
##  New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409  
##  Washington-Arlington-Alexandria, DC-VA-MD-WV      : 4177  
##  Los Angeles-Long Beach-Santa Ana, CA              : 4102  
##  Philadelphia-Camden-Wilmington, PA-NJ-DE          : 2855  
##  Chicago-Naperville-Joliet, IN-IN-WI               : 2772  
##  (Other)                                           :77749  
##  NA's                                              :34238  
##           Country      
##  United States:115063  
##  Mexico       :  3921  
##  Philippines  :   839  
##  India        :   770  
##  China        :   581  
##  (Other)      :  9952  
##  NA's         :   176
# Outside North America, the most common place of birth
sort(table(CPS$Country))
## 
##                         Cyprus                         Kosovo 
##                              0                              0 
##         Oceania, not specified       Other U. S. Island Areas 
##                              0                              0 
##                          Wales               Northern Ireland 
##                              0                              2 
##                       Tanzania                     Azerbaijan 
##                              2                              3 
##                 Czechoslovakia               St. Kitts--Nevis 
##                              3                              3 
##                        Georgia                       Barbados 
##                              5                              6 
##                        Denmark                         Latvia 
##                              6                              6 
##                          Samoa                        Senegal 
##                              6                              6 
##                      Singapore                       Slovakia 
##                              6                              6 
##                          Tonga                       Zimbabwe 
##                              6                              6 
##   South America, not specified                      St. Lucia 
##                              7                              7 
##                        Algeria        Americas, not specified 
##                              9                              9 
##                         Belize                           Fiji 
##                              9                              9 
## St. Vincent and the Grenadines                        Bahamas 
##                              9                             10 
##                        Finland                         Kuwait 
##                             10                             10 
##                      Lithuania                 Czech Republic 
##                             10                             11 
##                       Dominica                       Paraguay 
##                             11                             11 
##                        Croatia                      Macedonia 
##                             12                             12 
##                        Moldova            Antigua and Barbuda 
##                             12                             13 
##                        Belgium                        Bermuda 
##                             13                             13 
##                        Bolivia                        Grenada 
##                             13                             13 
##                          Sudan                     Cape Verde 
##                             13                             15 
##                        Eritrea                   Sierra Leone 
##                             15                             15 
##                         Uganda                        Austria 
##                             15                             17 
##                        Morocco                      Sri Lanka 
##                             17                             17 
##                        Uruguay           U. S. Virgin Islands 
##                             17                             17 
##                        Albania                         Norway 
##                             18                             18 
##          Europe, not specified                     Uzbekistan 
##                             19                             19 
##     West Indies, not specified                       Malaysia 
##                             19                             20 
##                         Serbia                         Azores 
##                             20                             22 
##                           USSR                    New Zealand 
##                             22                             23 
##                    Switzerland                          Yemen 
##                             23                             23 
##                        Belarus                       Scotland 
##                             24                             24 
##                     Yugoslavia                        Hungary 
##                             24                             25 
##                    Afghanistan                      Indonesia 
##                             26                             26 
##                    Netherlands                         Sweden 
##                             28                             28 
##                       Bulgaria                     Costa Rica 
##                             29                             29 
##                   Saudi Arabia                           Guam 
##                             29                             31 
##                       Cameroon                          Syria 
##                             32                             32 
##                        Armenia                         Jordan 
##                             35                             36 
##                          Chile            Asia, not specified 
##                             37                             39 
##                        Ireland                          Spain 
##                             39                             41 
##                     Bangladesh                      Australia 
##                             42                             43 
##                          Nepal                         Panama 
##                             44                             44 
##                        Lebanon                Myanmar (Burma) 
##                             45                             45 
##                   South Africa                         Turkey 
##                             48                             48 
##                       Cambodia                        Liberia 
##                             49                             52 
##                          Kenya                        Romania 
##                             55                             55 
##                         Greece                         Israel 
##                             56                             57 
##            Trinidad and Tobago           Bosnia & Herzegovina 
##                             60                             61 
##                      Venezuela                      Argentina 
##                             61                             64 
##                      Hong Kong                       Portugal 
##                             64                             64 
##                          Egypt                        Somalia 
##                             65                             72 
##                         France                    South Korea 
##                             73                             73 
##                          Ghana                      Nicaragua 
##                             76                             76 
##                       Ethiopia                      Elsewhere 
##                             80                             81 
##                        Nigeria                           Iraq 
##                             85                             97 
##                           Laos                         Taiwan 
##                             98                            102 
##                        Ukraine                         Guyana 
##                            104                            109 
##                       Pakistan                 United Kingdom 
##                            109                            111 
##                       Thailand          Africa, not specified 
##                            128                            129 
##                        Ecuador                           Peru 
##                            136                            136 
##                           Iran                          Italy 
##                            144                            149 
##                         Brazil                         Poland 
##                            159                            162 
##                          Haiti                         Russia 
##                            167                            173 
##                        England                          Japan 
##                            179                            187 
##                       Honduras                       Columbia 
##                            189                            206 
##                        Jamaica                      Guatemala 
##                            217                            309 
##             Dominican Republic                          Korea 
##                            330                            334 
##                         Canada                           Cuba 
##                            410                            426 
##                        Germany                        Vietnam 
##                            438                            458 
##                    El Salvador                    Puerto Rico 
##                            477                            518 
##                          China                          India 
##                            581                            770 
##                    Philippines                         Mexico 
##                            839                           3921 
##                  United States 
##                         115063
# proportion of the interviewees from the 'New York-Northern New
# Jersey-Long Island, NY-NJ-PA' metropolitan area who have a country of
# birth that is not the United State
t = table(CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA", 
    CPS$Country != "United States")
t
##        
##         FALSE  TRUE
##   FALSE 78757 12744
##   TRUE   3736  1668
# Convert the table into a dataframe
df = as.data.frame(t)
df$Freq[df$Var1 == TRUE][2]/(df$Freq[df$Var1 == TRUE][1] + df$Freq[df$Var1 == 
    TRUE][2])
## [1] 0.3087
# Which metropolitan area has the largest number (note -- not proportion)
# of interviewees with a country of birth in India, Brazil and Somalia,
# respectively?
which.max(tapply(CPS$Country == "India", CPS$MetroArea, sum, na.rm = TRUE))
## New York-Northern New Jersey-Long Island, NY-NJ-PA 
##                                                173
which.max(tapply(CPS$Country == "Brazil", CPS$MetroArea, sum, na.rm = TRUE))
## Boston-Cambridge-Quincy, MA-NH 
##                             34
which.max(tapply(CPS$Country == "Somalia", CPS$MetroArea, sum, na.rm = TRUE))
## Minneapolis-St Paul-Bloomington, MN-WI 
##                                    160