In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, use the September 2013 version of this rich, nationally representative dataset (available online).
The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey.
PeopleInHousehold: The number of people in the interviewee's household.
Region: The census region where the interviewee lives.
State: The state where the interviewee lives.
MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.
Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.
Married: The marriage status of the interviewee.
Sex: The sex of the interviewee.
Education: The maximum level of education obtained by the interviewee.
Race: The race of the interviewee.
Hispanic: Whether the interviewee is of Hispanic ethnicity.
CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.
Citizenship: The United States citizenship status of the interviewee.
EmploymentStatus: The status of employment of the interviewee.
Industry: The industry of employment of the interviewee (only available if they are employed).
# Set the directory at where the data is located
setwd("/home/tarek/Analytics/Week1/Rlectures/Data")
# Read the Data
CPS <- read.csv("CPSData.csv")
MetroAreaMap <- read.csv("MetroAreaCodes.csv")
CountryMap <- read.csv("CountryCodes.csv")
str(CPS)
## 'data.frame': 131302 obs. of 14 variables:
## $ PeopleInHousehold : int 1 3 3 3 3 3 3 2 2 2 ...
## $ Region : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ MetroAreaCode : int 26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
## $ Age : int 85 21 37 18 52 24 26 71 43 52 ...
## $ Married : Factor w/ 5 levels "Divorced","Married",..: 5 3 3 3 5 3 3 1 1 3 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 1 2 2 ...
## $ Education : Factor w/ 8 levels "Associate degree",..: 1 4 4 6 1 2 4 4 4 2 ...
## $ Race : Factor w/ 6 levels "American Indian",..: 6 3 3 3 6 6 6 6 6 6 ...
## $ Hispanic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CountryOfBirthCode: int 57 57 57 57 57 57 57 57 57 57 ...
## $ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 4 5 1 3 2 2 2 2 3 2 ...
## $ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA 11 NA NA 11 4 14 4 NA 12 ...
summary(CPS)
## PeopleInHousehold Region State MetroAreaCode
## Min. : 1.00 Midwest :30684 California :11570 Min. :10420
## 1st Qu.: 2.00 Northeast:25939 Texas : 7077 1st Qu.:21780
## Median : 3.00 South :41502 New York : 5595 Median :34740
## Mean : 3.28 West :33177 Florida : 5149 Mean :35075
## 3rd Qu.: 4.00 Pennsylvania: 3930 3rd Qu.:41860
## Max. :15.00 Illinois : 3912 Max. :79600
## (Other) :94069 NA's :34238
## Age Married Sex
## Min. : 0.0 Divorced :11151 Female:67481
## 1st Qu.:19.0 Married :55509 Male :63821
## Median :39.0 Never Married:30772
## Mean :38.8 Separated : 2027
## 3rd Qu.:57.0 Widowed : 6505
## Max. :85.0 NA's :25338
##
## Education Race Hispanic
## High school :30906 American Indian : 1433 Min. :0.000
## Bachelor's degree :19443 Asian : 6520 1st Qu.:0.000
## Some college, no degree:18863 Black : 13913 Median :0.000
## No high school diploma :16095 Multiracial : 2897 Mean :0.139
## Associate degree : 9913 Pacific Islander: 618 3rd Qu.:0.000
## (Other) :10744 White :105921 Max. :1.000
## NA's :25338
## CountryOfBirthCode Citizenship
## Min. : 57.0 Citizen, Native :116639
## 1st Qu.: 57.0 Citizen, Naturalized: 7073
## Median : 57.0 Non-Citizen : 7590
## Mean : 82.7
## 3rd Qu.: 57.0
## Max. :555.0
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
# the most common industry of employment
which.max(table(CPS$Industry))
## Educational and health services
## 4
# The states that the fewest and largest number of interviewees
which.min(table(CPS$State))
## New Mexico
## 32
which.max(table(CPS$State))
## California
## 5
# proportion of interviewees who are citizens of the United States
citizen = CPS[CPS$Citizenship == "Citizen, Native" | CPS$Citizenship == "Citizen, Naturalized",
]
nrow(citizen)/nrow(CPS)
## [1] 0.9422
# Race and hispanic ethnicity
table(CPS$Race, CPS$Hispanic)
##
## 0 1
## American Indian 1129 304
## Asian 6407 113
## Black 13292 621
## Multiracial 2449 448
## Pacific Islander 541 77
## White 89190 16731
# We can test the relationship between these four variable values and
# whether the Married variable is missing table(CPS$Region,
# is.na(CPS$Married)) table(CPS$Sex, is.na(CPS$Married)) table(CPS$Age,
# is.na(CPS$Married)) table(CPS$Citizenship, is.na(CPS$Married))
# States that had all interviewees living in a non-metropolitan area. And
# states that had all interviewees living in a metropolitan area
table(CPS$State, is.na(CPS$MetroAreaCode)) # Alaska and Wyoming have no interviewees living in a metropolitan area, and the District of Columbia, New Jersey, and Rhode Island have all interviewees living in a metro area.
##
## FALSE TRUE
## Alabama 1020 356
## Alaska 0 1590
## Arizona 1327 201
## Arkansas 724 697
## California 11333 237
## Colorado 2545 380
## Connecticut 2593 243
## Delaware 1696 518
## District of Columbia 1791 0
## Florida 4947 202
## Georgia 2250 557
## Hawaii 1576 523
## Idaho 761 757
## Illinois 3473 439
## Indiana 1420 584
## Iowa 1297 1231
## Kansas 1234 701
## Kentucky 908 933
## Louisiana 1216 234
## Maine 909 1354
## Maryland 2978 222
## Massachusetts 1858 129
## Michigan 2517 546
## Minnesota 2150 989
## Mississippi 376 854
## Missouri 1440 705
## Montana 199 1015
## Nebraska 816 1133
## Nevada 1609 247
## New Hampshire 1148 1514
## New Jersey 2567 0
## New Mexico 832 270
## New York 5144 451
## North Carolina 1642 977
## North Dakota 432 1213
## Ohio 2754 924
## Oklahoma 1024 499
## Oregon 1519 424
## Pennsylvania 3245 685
## Rhode Island 2209 0
## South Carolina 1139 519
## South Dakota 595 1405
## Tennessee 1149 635
## Texas 6060 1017
## Utah 1455 387
## Vermont 657 1233
## Virginia 2367 586
## Washington 1937 429
## West Virginia 344 1065
## Wisconsin 1882 804
## Wyoming 0 1624
# Region of the United States has the largest proportion of interviewees
# living in a non-metropolitan area?
table(CPS$Region, is.na(CPS$MetroAreaCode))
##
## FALSE TRUE
## Midwest 20010 10674
## Northeast 20330 5609
## South 31631 9871
## West 25093 8084
# States with highest numbers of interviewees living in non-metropolitan
# areas
sort(round(tapply(is.na(CPS$MetroAreaCode), CPS$State, mean), 2))
## District of Columbia New Jersey Rhode Island
## 0.00 0.00 0.00
## California Florida Massachusetts
## 0.02 0.04 0.06
## Maryland New York Connecticut
## 0.07 0.08 0.09
## Illinois Arizona Colorado
## 0.11 0.13 0.13
## Nevada Texas Louisiana
## 0.13 0.14 0.16
## Pennsylvania Michigan Washington
## 0.17 0.18 0.18
## Georgia Virginia Utah
## 0.20 0.20 0.21
## Oregon Delaware Hawaii
## 0.22 0.23 0.25
## New Mexico Ohio Alabama
## 0.25 0.25 0.26
## Indiana Wisconsin South Carolina
## 0.29 0.30 0.31
## Minnesota Missouri Oklahoma
## 0.32 0.33 0.33
## Kansas Tennessee North Carolina
## 0.36 0.36 0.37
## Arkansas Iowa Idaho
## 0.49 0.49 0.50
## Kentucky New Hampshire Nebraska
## 0.51 0.57 0.58
## Maine Vermont Mississippi
## 0.60 0.65 0.69
## South Dakota North Dakota West Virginia
## 0.70 0.74 0.76
## Montana Alaska Wyoming
## 0.84 1.00 1.00
To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap.
# Merges the two data frames on these columns, overwriting the CPS data
# frame with the result
CPS = merge(CPS, MetroAreaMap, by.x = "MetroAreaCode", by.y = "Code", all.x = TRUE)
summary(CPS)
## MetroAreaCode PeopleInHousehold Region State
## Min. :10420 Min. : 1.00 Midwest :30684 California :11570
## 1st Qu.:21780 1st Qu.: 2.00 Northeast:25939 Texas : 7077
## Median :34740 Median : 3.00 South :41502 New York : 5595
## Mean :35075 Mean : 3.28 West :33177 Florida : 5149
## 3rd Qu.:41860 3rd Qu.: 4.00 Pennsylvania: 3930
## Max. :79600 Max. :15.00 Illinois : 3912
## NA's :34238 (Other) :94069
## Age Married Sex
## Min. : 0.0 Divorced :11151 Female:67481
## 1st Qu.:19.0 Married :55509 Male :63821
## Median :39.0 Never Married:30772
## Mean :38.8 Separated : 2027
## 3rd Qu.:57.0 Widowed : 6505
## Max. :85.0 NA's :25338
##
## Education Race Hispanic
## High school :30906 American Indian : 1433 Min. :0.000
## Bachelor's degree :19443 Asian : 6520 1st Qu.:0.000
## Some college, no degree:18863 Black : 13913 Median :0.000
## No high school diploma :16095 Multiracial : 2897 Mean :0.139
## Associate degree : 9913 Pacific Islander: 618 3rd Qu.:0.000
## (Other) :10744 White :105921 Max. :1.000
## NA's :25338
## CountryOfBirthCode Citizenship
## Min. : 57.0 Citizen, Native :116639
## 1st Qu.: 57.0 Citizen, Naturalized: 7073
## Median : 57.0 Non-Citizen : 7590
## Mean : 82.7
## 3rd Qu.: 57.0
## Max. :555.0
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
## MetroArea
## New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409
## Washington-Arlington-Alexandria, DC-VA-MD-WV : 4177
## Los Angeles-Long Beach-Santa Ana, CA : 4102
## Philadelphia-Camden-Wilmington, PA-NJ-DE : 2855
## Chicago-Naperville-Joliet, IN-IN-WI : 2772
## (Other) :77749
## NA's :34238
# Largest number of interviewees in a metropolitan area
which.max(table(CPS$MetroArea))
## New York-Northern New Jersey-Long Island, NY-NJ-PA
## 173
# Metropolitan area that has the highest proportion of interviewees of
# Hispanic ethnicity
which.max(tapply(CPS$Hispanic, CPS$MetroArea, mean))
## Laredo, TX
## 136
# Metropolitan areas in the United States from which at least 20% of
# interviewees are Asian.
which(tapply(CPS$Race == "Asian", CPS$MetroArea, mean) > 0.2)
## Honolulu, HI San Francisco-Oakland-Fremont, CA
## 107 220
## San Jose-Sunnyvale-Santa Clara, CA Vallejo-Fairfield, CA
## 221 254
# Average that have no highschool diploma. Metro that has the least.
which.min(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean,
na.rm = T))
## Iowa City, IA
## 112
CPS = merge(CPS, CountryMap, by.x = "CountryOfBirthCode", by.y = "Code", all.x = TRUE)
summary(CPS)
## CountryOfBirthCode MetroAreaCode PeopleInHousehold Region
## Min. : 57.0 Min. :10420 Min. : 1.00 Midwest :30684
## 1st Qu.: 57.0 1st Qu.:21780 1st Qu.: 2.00 Northeast:25939
## Median : 57.0 Median :34740 Median : 3.00 South :41502
## Mean : 82.7 Mean :35075 Mean : 3.28 West :33177
## 3rd Qu.: 57.0 3rd Qu.:41860 3rd Qu.: 4.00
## Max. :555.0 Max. :79600 Max. :15.00
## NA's :34238
## State Age Married Sex
## California :11570 Min. : 0.0 Divorced :11151 Female:67481
## Texas : 7077 1st Qu.:19.0 Married :55509 Male :63821
## New York : 5595 Median :39.0 Never Married:30772
## Florida : 5149 Mean :38.8 Separated : 2027
## Pennsylvania: 3930 3rd Qu.:57.0 Widowed : 6505
## Illinois : 3912 Max. :85.0 NA's :25338
## (Other) :94069
## Education Race Hispanic
## High school :30906 American Indian : 1433 Min. :0.000
## Bachelor's degree :19443 Asian : 6520 1st Qu.:0.000
## Some college, no degree:18863 Black : 13913 Median :0.000
## No high school diploma :16095 Multiracial : 2897 Mean :0.139
## Associate degree : 9913 Pacific Islander: 618 3rd Qu.:0.000
## (Other) :10744 White :105921 Max. :1.000
## NA's :25338
## Citizenship EmploymentStatus
## Citizen, Native :116639 Disabled : 5712
## Citizen, Naturalized: 7073 Employed :61733
## Non-Citizen : 7590 Not in Labor Force:15246
## Retired :18619
## Unemployed : 4203
## NA's :25789
##
## Industry
## Educational and health services :15017
## Trade : 8933
## Professional and business services: 7519
## Manufacturing : 6791
## Leisure and hospitality : 6364
## (Other) :21618
## NA's :65060
## MetroArea
## New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409
## Washington-Arlington-Alexandria, DC-VA-MD-WV : 4177
## Los Angeles-Long Beach-Santa Ana, CA : 4102
## Philadelphia-Camden-Wilmington, PA-NJ-DE : 2855
## Chicago-Naperville-Joliet, IN-IN-WI : 2772
## (Other) :77749
## NA's :34238
## Country
## United States:115063
## Mexico : 3921
## Philippines : 839
## India : 770
## China : 581
## (Other) : 9952
## NA's : 176
# Outside North America, the most common place of birth
sort(table(CPS$Country))
##
## Cyprus Kosovo
## 0 0
## Oceania, not specified Other U. S. Island Areas
## 0 0
## Wales Northern Ireland
## 0 2
## Tanzania Azerbaijan
## 2 3
## Czechoslovakia St. Kitts--Nevis
## 3 3
## Georgia Barbados
## 5 6
## Denmark Latvia
## 6 6
## Samoa Senegal
## 6 6
## Singapore Slovakia
## 6 6
## Tonga Zimbabwe
## 6 6
## South America, not specified St. Lucia
## 7 7
## Algeria Americas, not specified
## 9 9
## Belize Fiji
## 9 9
## St. Vincent and the Grenadines Bahamas
## 9 10
## Finland Kuwait
## 10 10
## Lithuania Czech Republic
## 10 11
## Dominica Paraguay
## 11 11
## Croatia Macedonia
## 12 12
## Moldova Antigua and Barbuda
## 12 13
## Belgium Bermuda
## 13 13
## Bolivia Grenada
## 13 13
## Sudan Cape Verde
## 13 15
## Eritrea Sierra Leone
## 15 15
## Uganda Austria
## 15 17
## Morocco Sri Lanka
## 17 17
## Uruguay U. S. Virgin Islands
## 17 17
## Albania Norway
## 18 18
## Europe, not specified Uzbekistan
## 19 19
## West Indies, not specified Malaysia
## 19 20
## Serbia Azores
## 20 22
## USSR New Zealand
## 22 23
## Switzerland Yemen
## 23 23
## Belarus Scotland
## 24 24
## Yugoslavia Hungary
## 24 25
## Afghanistan Indonesia
## 26 26
## Netherlands Sweden
## 28 28
## Bulgaria Costa Rica
## 29 29
## Saudi Arabia Guam
## 29 31
## Cameroon Syria
## 32 32
## Armenia Jordan
## 35 36
## Chile Asia, not specified
## 37 39
## Ireland Spain
## 39 41
## Bangladesh Australia
## 42 43
## Nepal Panama
## 44 44
## Lebanon Myanmar (Burma)
## 45 45
## South Africa Turkey
## 48 48
## Cambodia Liberia
## 49 52
## Kenya Romania
## 55 55
## Greece Israel
## 56 57
## Trinidad and Tobago Bosnia & Herzegovina
## 60 61
## Venezuela Argentina
## 61 64
## Hong Kong Portugal
## 64 64
## Egypt Somalia
## 65 72
## France South Korea
## 73 73
## Ghana Nicaragua
## 76 76
## Ethiopia Elsewhere
## 80 81
## Nigeria Iraq
## 85 97
## Laos Taiwan
## 98 102
## Ukraine Guyana
## 104 109
## Pakistan United Kingdom
## 109 111
## Thailand Africa, not specified
## 128 129
## Ecuador Peru
## 136 136
## Iran Italy
## 144 149
## Brazil Poland
## 159 162
## Haiti Russia
## 167 173
## England Japan
## 179 187
## Honduras Columbia
## 189 206
## Jamaica Guatemala
## 217 309
## Dominican Republic Korea
## 330 334
## Canada Cuba
## 410 426
## Germany Vietnam
## 438 458
## El Salvador Puerto Rico
## 477 518
## China India
## 581 770
## Philippines Mexico
## 839 3921
## United States
## 115063
# proportion of the interviewees from the 'New York-Northern New
# Jersey-Long Island, NY-NJ-PA' metropolitan area who have a country of
# birth that is not the United State
t = table(CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA",
CPS$Country != "United States")
t
##
## FALSE TRUE
## FALSE 78757 12744
## TRUE 3736 1668
# Convert the table into a dataframe
df = as.data.frame(t)
df$Freq[df$Var1 == TRUE][2]/(df$Freq[df$Var1 == TRUE][1] + df$Freq[df$Var1 ==
TRUE][2])
## [1] 0.3087
# Which metropolitan area has the largest number (note -- not proportion)
# of interviewees with a country of birth in India, Brazil and Somalia,
# respectively?
which.max(tapply(CPS$Country == "India", CPS$MetroArea, sum, na.rm = TRUE))
## New York-Northern New Jersey-Long Island, NY-NJ-PA
## 173
which.max(tapply(CPS$Country == "Brazil", CPS$MetroArea, sum, na.rm = TRUE))
## Boston-Cambridge-Quincy, MA-NH
## 34
which.max(tapply(CPS$Country == "Somalia", CPS$MetroArea, sum, na.rm = TRUE))
## Minneapolis-St Paul-Bloomington, MN-WI
## 160