In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset (available online).
The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData.csv, which has the following variables:
PeopleInHousehold: The number of people in the interviewee’s household.
Region: The census region where the interviewee lives.
State: The state where the interviewee lives.
MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.
Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.
Married: The marriage status of the interviewee.
Sex: The sex of the interviewee.
Education: The maximum level of education obtained by the interviewee.
Race: The race of the interviewee.
Hispanic: Whether the interviewee is of Hispanic ethnicity.
CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.
Citizenship: The United States citizenship status of the interviewee.
EmploymentStatus: The status of employment of the interviewee.
Industry: The industry of employment of the interviewee (only available if they are employed).
Load the dataset from CPSData.csv into a data frame called CPS, and view the dataset with the summary() and str() commands.
CPS <- read.csv("CPSData.csv")
summary(CPS)
## PeopleInHousehold Region State MetroAreaCode
## Min. : 1.000 Midwest :30684 California :11570 Min. :10420
## 1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780
## Median : 3.000 South :41502 New York : 5595 Median :34740
## Mean : 3.284 West :33177 Florida : 5149 Mean :35075
## 3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860
## Max. :15.000 Illinois : 3912 Max. :79600
## (Other) :94069 NA's :34238
## Age Married Sex
## Min. : 0.00 Divorced :11151 Female:67481
## 1st Qu.:19.00 Married :55509 Male :63821
## Median :39.00 Never Married:30772
## Mean :38.83 Separated : 2027
## 3rd Qu.:57.00 Widowed : 6505
## Max. :85.00 NA's :25338
##
## Education Race
## High school :30906 American Indian : 1433
## Bachelor's degree :19443 Asian : 6520
## Some college, no degree:18863 Black : 13913
## No high school diploma :16095 Multiracial : 2897
## Associate degree : 9913 Pacific Islander: 618
## (Other) :10744 White :105921
## NA's :25338
## Hispanic CountryOfBirthCode Citizenship
## Min. :0.0000 Min. : 57.00 Citizen, Native :116639
## 1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073
## Median :0.0000 Median : 57.00 Non-Citizen : 7590
## Mean :0.1393 Mean : 82.68
## 3rd Qu.:0.0000 3rd Qu.: 57.00
## Max. :1.0000 Max. :555.00
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
str(CPS)
## 'data.frame': 131302 obs. of 14 variables:
## $ PeopleInHousehold : int 1 3 3 3 3 3 3 2 2 2 ...
## $ Region : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ MetroAreaCode : int 26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
## $ Age : int 85 21 37 18 52 24 26 71 43 52 ...
## $ Married : Factor w/ 5 levels "Divorced","Married",..: 5 3 3 3 5 3 3 1 1 3 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 1 2 2 ...
## $ Education : Factor w/ 8 levels "Associate degree",..: 1 4 4 6 1 2 4 4 4 2 ...
## $ Race : Factor w/ 6 levels "American Indian",..: 6 3 3 3 6 6 6 6 6 6 ...
## $ Hispanic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CountryOfBirthCode: int 57 57 57 57 57 57 57 57 57 57 ...
## $ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 4 5 1 3 2 2 2 2 3 2 ...
## $ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA 11 NA NA 11 4 14 4 NA 12 ...
Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
summary(CPS)
## PeopleInHousehold Region State MetroAreaCode
## Min. : 1.000 Midwest :30684 California :11570 Min. :10420
## 1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780
## Median : 3.000 South :41502 New York : 5595 Median :34740
## Mean : 3.284 West :33177 Florida : 5149 Mean :35075
## 3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860
## Max. :15.000 Illinois : 3912 Max. :79600
## (Other) :94069 NA's :34238
## Age Married Sex
## Min. : 0.00 Divorced :11151 Female:67481
## 1st Qu.:19.00 Married :55509 Male :63821
## Median :39.00 Never Married:30772
## Mean :38.83 Separated : 2027
## 3rd Qu.:57.00 Widowed : 6505
## Max. :85.00 NA's :25338
##
## Education Race
## High school :30906 American Indian : 1433
## Bachelor's degree :19443 Asian : 6520
## Some college, no degree:18863 Black : 13913
## No high school diploma :16095 Multiracial : 2897
## Associate degree : 9913 Pacific Islander: 618
## (Other) :10744 White :105921
## NA's :25338
## Hispanic CountryOfBirthCode Citizenship
## Min. :0.0000 Min. : 57.00 Citizen, Native :116639
## 1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073
## Median :0.0000 Median : 57.00 Non-Citizen : 7590
## Mean :0.1393 Mean : 82.68
## 3rd Qu.:0.0000 3rd Qu.: 57.00
## Max. :1.0000 Max. :555.00
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
table(CPS$Industry)
##
## Agriculture, forestry, fishing, and hunting
## 1307
## Armed forces
## 29
## Construction
## 4387
## Educational and health services
## 15017
## Financial
## 4347
## Information
## 1328
## Leisure and hospitality
## 6364
## Manufacturing
## 6791
## Mining
## 550
## Other services
## 3224
## Professional and business services
## 7519
## Public administration
## 3186
## Trade
## 8933
## Transportation and utilities
## 3260
max(table(CPS$Industry))
## [1] 15017
Recall from the homework assignment “The Analytical Detective” that you can call the sort() function on the output of the table() function to obtain a sorted breakdown of a variable. For instance, sort(table(CPS$Region)) sorts the regions by the number of interviewees from that region.
Which state has the fewest interviewees?
table(CPS$State)
##
## Alabama Alaska Arizona
## 1376 1590 1528
## Arkansas California Colorado
## 1421 11570 2925
## Connecticut Delaware District of Columbia
## 2836 2214 1791
## Florida Georgia Hawaii
## 5149 2807 2099
## Idaho Illinois Indiana
## 1518 3912 2004
## Iowa Kansas Kentucky
## 2528 1935 1841
## Louisiana Maine Maryland
## 1450 2263 3200
## Massachusetts Michigan Minnesota
## 1987 3063 3139
## Mississippi Missouri Montana
## 1230 2145 1214
## Nebraska Nevada New Hampshire
## 1949 1856 2662
## New Jersey New Mexico New York
## 2567 1102 5595
## North Carolina North Dakota Ohio
## 2619 1645 3678
## Oklahoma Oregon Pennsylvania
## 1523 1943 3930
## Rhode Island South Carolina South Dakota
## 2209 1658 2000
## Tennessee Texas Utah
## 1784 7077 1842
## Vermont Virginia Washington
## 1890 2953 2366
## West Virginia Wisconsin Wyoming
## 1409 2686 1624
sort(table(CPS$State))
##
## New Mexico Montana Mississippi
## 1102 1214 1230
## Alabama West Virginia Arkansas
## 1376 1409 1421
## Louisiana Idaho Oklahoma
## 1450 1518 1523
## Arizona Alaska Wyoming
## 1528 1590 1624
## North Dakota South Carolina Tennessee
## 1645 1658 1784
## District of Columbia Kentucky Utah
## 1791 1841 1842
## Nevada Vermont Kansas
## 1856 1890 1935
## Oregon Nebraska Massachusetts
## 1943 1949 1987
## South Dakota Indiana Hawaii
## 2000 2004 2099
## Missouri Rhode Island Delaware
## 2145 2209 2214
## Maine Washington Iowa
## 2263 2366 2528
## New Jersey North Carolina New Hampshire
## 2567 2619 2662
## Wisconsin Georgia Connecticut
## 2686 2807 2836
## Colorado Virginia Michigan
## 2925 2953 3063
## Minnesota Maryland Ohio
## 3139 3200 3678
## Illinois Pennsylvania Florida
## 3912 3930 5149
## New York Texas California
## 5595 7077 11570
Which state has the largest number of interviewees?
table(CPS$State)
##
## Alabama Alaska Arizona
## 1376 1590 1528
## Arkansas California Colorado
## 1421 11570 2925
## Connecticut Delaware District of Columbia
## 2836 2214 1791
## Florida Georgia Hawaii
## 5149 2807 2099
## Idaho Illinois Indiana
## 1518 3912 2004
## Iowa Kansas Kentucky
## 2528 1935 1841
## Louisiana Maine Maryland
## 1450 2263 3200
## Massachusetts Michigan Minnesota
## 1987 3063 3139
## Mississippi Missouri Montana
## 1230 2145 1214
## Nebraska Nevada New Hampshire
## 1949 1856 2662
## New Jersey New Mexico New York
## 2567 1102 5595
## North Carolina North Dakota Ohio
## 2619 1645 3678
## Oklahoma Oregon Pennsylvania
## 1523 1943 3930
## Rhode Island South Carolina South Dakota
## 2209 1658 2000
## Tennessee Texas Utah
## 1784 7077 1842
## Vermont Virginia Washington
## 1890 2953 2366
## West Virginia Wisconsin Wyoming
## 1409 2686 1624
sort(table(CPS$State))
##
## New Mexico Montana Mississippi
## 1102 1214 1230
## Alabama West Virginia Arkansas
## 1376 1409 1421
## Louisiana Idaho Oklahoma
## 1450 1518 1523
## Arizona Alaska Wyoming
## 1528 1590 1624
## North Dakota South Carolina Tennessee
## 1645 1658 1784
## District of Columbia Kentucky Utah
## 1791 1841 1842
## Nevada Vermont Kansas
## 1856 1890 1935
## Oregon Nebraska Massachusetts
## 1943 1949 1987
## South Dakota Indiana Hawaii
## 2000 2004 2099
## Missouri Rhode Island Delaware
## 2145 2209 2214
## Maine Washington Iowa
## 2263 2366 2528
## New Jersey North Carolina New Hampshire
## 2567 2619 2662
## Wisconsin Georgia Connecticut
## 2686 2807 2836
## Colorado Virginia Michigan
## 2925 2953 3063
## Minnesota Maryland Ohio
## 3139 3200 3678
## Illinois Pennsylvania Florida
## 3912 3930 5149
## New York Texas California
## 5595 7077 11570
What proportion of interviewees are citizens of the United States?
table(CPS$Citizenship)
##
## Citizen, Native Citizen, Naturalized Non-Citizen
## 116639 7073 7590
116639/(116639+7073+7590)
## [1] 0.8883261
The CPS differentiates between race (with possible values American Indian, Asian, Black, Pacific Islander, White, or Multiracial) and ethnicity. A number of interviewees are of Hispanic ethnicity, as captured by the Hispanic variable. For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
table(CPS$Race, CPS$Hispanic)
##
## 0 1
## American Indian 1129 304
## Asian 6407 113
## Black 13292 621
## Multiracial 2449 448
## Pacific Islander 541 77
## White 89190 16731
Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
summary(CPS)
## PeopleInHousehold Region State MetroAreaCode
## Min. : 1.000 Midwest :30684 California :11570 Min. :10420
## 1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780
## Median : 3.000 South :41502 New York : 5595 Median :34740
## Mean : 3.284 West :33177 Florida : 5149 Mean :35075
## 3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860
## Max. :15.000 Illinois : 3912 Max. :79600
## (Other) :94069 NA's :34238
## Age Married Sex
## Min. : 0.00 Divorced :11151 Female:67481
## 1st Qu.:19.00 Married :55509 Male :63821
## Median :39.00 Never Married:30772
## Mean :38.83 Separated : 2027
## 3rd Qu.:57.00 Widowed : 6505
## Max. :85.00 NA's :25338
##
## Education Race
## High school :30906 American Indian : 1433
## Bachelor's degree :19443 Asian : 6520
## Some college, no degree:18863 Black : 13913
## No high school diploma :16095 Multiracial : 2897
## Associate degree : 9913 Pacific Islander: 618
## (Other) :10744 White :105921
## NA's :25338
## Hispanic CountryOfBirthCode Citizenship
## Min. :0.0000 Min. : 57.00 Citizen, Native :116639
## 1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073
## Median :0.0000 Median : 57.00 Non-Citizen : 7590
## Mean :0.1393 Mean : 82.68
## 3rd Qu.:0.0000 3rd Qu.: 57.00
## Max. :1.0000 Max. :555.00
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
Often when evaluating a new dataset, we try to identify if there is a pattern in the missing values in the dataset. We will try to determine if there is a pattern in the missing values of the Married variable. The function
is.na(CPS$Married)
returns a vector of TRUE/FALSE values for whether the Married variable is missing. We can see the breakdown of whether Married is missing based on the reported value of the Region variable with the function
table(CPS$Region, is.na(CPS$Married))
Which is the most accurate:
table(CPS$Region, is.na(CPS$Married))
##
## FALSE TRUE
## Midwest 24609 6075
## Northeast 21432 4507
## South 33535 7967
## West 26388 6789
table(CPS$Sex, is.na(CPS$Married))
##
## FALSE TRUE
## Female 55264 12217
## Male 50700 13121
table(CPS$Age, is.na(CPS$Married))
##
## FALSE TRUE
## 0 0 1283
## 1 0 1559
## 2 0 1574
## 3 0 1693
## 4 0 1695
## 5 0 1795
## 6 0 1721
## 7 0 1681
## 8 0 1729
## 9 0 1748
## 10 0 1750
## 11 0 1721
## 12 0 1797
## 13 0 1802
## 14 0 1790
## 15 1795 0
## 16 1751 0
## 17 1764 0
## 18 1596 0
## 19 1517 0
## 20 1398 0
## 21 1525 0
## 22 1536 0
## 23 1638 0
## 24 1627 0
## 25 1604 0
## 26 1643 0
## 27 1657 0
## 28 1736 0
## 29 1645 0
## 30 1854 0
## 31 1762 0
## 32 1790 0
## 33 1804 0
## 34 1653 0
## 35 1716 0
## 36 1663 0
## 37 1531 0
## 38 1530 0
## 39 1542 0
## 40 1571 0
## 41 1673 0
## 42 1711 0
## 43 1819 0
## 44 1764 0
## 45 1749 0
## 46 1665 0
## 47 1647 0
## 48 1791 0
## 49 1989 0
## 50 1966 0
## 51 1931 0
## 52 1935 0
## 53 1994 0
## 54 1912 0
## 55 1895 0
## 56 1935 0
## 57 1827 0
## 58 1874 0
## 59 1758 0
## 60 1746 0
## 61 1735 0
## 62 1595 0
## 63 1596 0
## 64 1519 0
## 65 1569 0
## 66 1577 0
## 67 1227 0
## 68 1130 0
## 69 1062 0
## 70 1195 0
## 71 1031 0
## 72 941 0
## 73 896 0
## 74 842 0
## 75 763 0
## 76 729 0
## 77 698 0
## 78 659 0
## 79 661 0
## 80 2664 0
## 85 2446 0
table(CPS$Citizenship, is.na(CPS$Married))
##
## FALSE TRUE
## Citizen, Native 91956 24683
## Citizen, Naturalized 6910 163
## Non-Citizen 7098 492
As mentioned in the variable descriptions, MetroAreaCode is missing if an interviewee does not live in a metropolitan area. Using the same technique as in the previous question, answer the following questions about people who live in non-metropolitan areas.
How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
"2"
## [1] "2"
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
table(CPS$State, is.na(CPS$MetroAreaCode))
##
## FALSE TRUE
## Alabama 1020 356
## Alaska 0 1590
## Arizona 1327 201
## Arkansas 724 697
## California 11333 237
## Colorado 2545 380
## Connecticut 2593 243
## Delaware 1696 518
## District of Columbia 1791 0
## Florida 4947 202
## Georgia 2250 557
## Hawaii 1576 523
## Idaho 761 757
## Illinois 3473 439
## Indiana 1420 584
## Iowa 1297 1231
## Kansas 1234 701
## Kentucky 908 933
## Louisiana 1216 234
## Maine 909 1354
## Maryland 2978 222
## Massachusetts 1858 129
## Michigan 2517 546
## Minnesota 2150 989
## Mississippi 376 854
## Missouri 1440 705
## Montana 199 1015
## Nebraska 816 1133
## Nevada 1609 247
## New Hampshire 1148 1514
## New Jersey 2567 0
## New Mexico 832 270
## New York 5144 451
## North Carolina 1642 977
## North Dakota 432 1213
## Ohio 2754 924
## Oklahoma 1024 499
## Oregon 1519 424
## Pennsylvania 3245 685
## Rhode Island 2209 0
## South Carolina 1139 519
## South Dakota 595 1405
## Tennessee 1149 635
## Texas 6060 1017
## Utah 1455 387
## Vermont 657 1233
## Virginia 2367 586
## Washington 1937 429
## West Virginia 344 1065
## Wisconsin 1882 804
## Wyoming 0 1624
"3"
## [1] "3"
Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
table(CPS$Region, is.na(CPS$MetroAreaCode))
##
## FALSE TRUE
## Midwest 20010 10674
## Northeast 20330 5609
## South 31631 9871
## West 25093 8084
While we were able to use the table() command to compute the proportion of interviewees from each region not living in a metropolitan area, it was somewhat tedious (it involved manually computing the proportion for each region) and isn’t something you would want to do if there were a larger number of options. It turns out there is a less tedious way to compute the proportion of values that are TRUE. The mean() function, which takes the average of the values passed to it, will treat TRUE as 1 and FALSE as 0, meaning it returns the proportion of values that are true. For instance, mean(c(TRUE, FALSE, TRUE, TRUE)) returns 0.75. Knowing this, use tapply() with the mean function to answer the following questions:
Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
tapply(is.na(CPS$MetroAreaCode), CPS$State, mean)
## Alabama Alaska Arizona
## 0.25872093 1.00000000 0.13154450
## Arkansas California Colorado
## 0.49049965 0.02048401 0.12991453
## Connecticut Delaware District of Columbia
## 0.08568406 0.23396567 0.00000000
## Florida Georgia Hawaii
## 0.03923092 0.19843249 0.24916627
## Idaho Illinois Indiana
## 0.49868248 0.11221881 0.29141717
## Iowa Kansas Kentucky
## 0.48694620 0.36227390 0.50678979
## Louisiana Maine Maryland
## 0.16137931 0.59832081 0.06937500
## Massachusetts Michigan Minnesota
## 0.06492199 0.17825661 0.31506849
## Mississippi Missouri Montana
## 0.69430894 0.32867133 0.83607908
## Nebraska Nevada New Hampshire
## 0.58132376 0.13308190 0.56874530
## New Jersey New Mexico New York
## 0.00000000 0.24500907 0.08060769
## North Carolina North Dakota Ohio
## 0.37304315 0.73738602 0.25122349
## Oklahoma Oregon Pennsylvania
## 0.32764281 0.21821925 0.17430025
## Rhode Island South Carolina South Dakota
## 0.00000000 0.31302774 0.70250000
## Tennessee Texas Utah
## 0.35594170 0.14370496 0.21009772
## Vermont Virginia Washington
## 0.65238095 0.19844226 0.18131868
## West Virginia Wisconsin Wyoming
## 0.75585522 0.29932986 1.00000000
"Wisconsin"
## [1] "Wisconsin"
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
sort(tapply(is.na(CPS$MetroAreaCode), CPS$State, mean))
## District of Columbia New Jersey Rhode Island
## 0.00000000 0.00000000 0.00000000
## California Florida Massachusetts
## 0.02048401 0.03923092 0.06492199
## Maryland New York Connecticut
## 0.06937500 0.08060769 0.08568406
## Illinois Colorado Arizona
## 0.11221881 0.12991453 0.13154450
## Nevada Texas Louisiana
## 0.13308190 0.14370496 0.16137931
## Pennsylvania Michigan Washington
## 0.17430025 0.17825661 0.18131868
## Georgia Virginia Utah
## 0.19843249 0.19844226 0.21009772
## Oregon Delaware New Mexico
## 0.21821925 0.23396567 0.24500907
## Hawaii Ohio Alabama
## 0.24916627 0.25122349 0.25872093
## Indiana Wisconsin South Carolina
## 0.29141717 0.29932986 0.31302774
## Minnesota Oklahoma Missouri
## 0.31506849 0.32764281 0.32867133
## Tennessee Kansas North Carolina
## 0.35594170 0.36227390 0.37304315
## Iowa Arkansas Idaho
## 0.48694620 0.49049965 0.49868248
## Kentucky New Hampshire Nebraska
## 0.50678979 0.56874530 0.58132376
## Maine Vermont Mississippi
## 0.59832081 0.65238095 0.69430894
## South Dakota North Dakota West Virginia
## 0.70250000 0.73738602 0.75585522
## Montana Alaska Wyoming
## 0.83607908 1.00000000 1.00000000
Codes like MetroAreaCode and CountryOfBirthCode are a compact way to encode factor variables with text as their possible values, and they are therefore quite common in survey datasets. In fact, all but one of the variables in this dataset were actually stored by a numeric code in the original CPS datafile.
When analyzing a variable stored by a numeric code, we will often want to convert it into the values the codes represent. To do this, we will use a dictionary, which maps the the code to the actual value of the variable. We have provided dictionaries MetroAreaCodes.csv and CountryCodes.csv, which respectively map MetroAreaCode and CountryOfBirthCode into their true values. Read these two dictionaries into data frames MetroAreaMap and CountryMap.
How many observations (codes for metropolitan areas) are there in MetroAreaMap?
MetroAreaMap <- read.csv("MetroAreaCodes.csv")
CountryMap <- read.csv("CountryCodes.csv")
str(MetroAreaMap)
## 'data.frame': 271 obs. of 2 variables:
## $ Code : int 460 3000 3160 3610 3720 6450 10420 10500 10580 10740 ...
## $ MetroArea: Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 12 92 97 117 122 195 1 3 2 4 ...
How many observations (codes for countries) are there in CountryMap?
str(CountryMap)
## 'data.frame': 149 obs. of 2 variables:
## $ Code : int 57 66 73 78 96 100 102 103 104 105 ...
## $ Country: Factor w/ 149 levels "Afghanistan",..: 139 57 105 135 97 3 11 18 24 37 ...
To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap. The following command merges the two data frames on these columns, overwriting the CPS data frame with the result:
CPS = merge(CPS, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
The first two arguments determine the data frames to be merged (they are called “x” and “y”, respectively, in the subsequent parameters to the merge function). by.x=“MetroAreaCode” means we’re matching on the MetroAreaCode variable from the “x” data frame (CPS), while by.y=“Code” means we’re matching on the Code variable from the “y” data frame (MetroAreaMap). Finally, all.x=TRUE means we want to keep all rows from the “x” data frame (CPS), even if some of the rows’ MetroAreaCode doesn’t match any codes in MetroAreaMap (for those familiar with database terminology, this parameter makes the operation a left outer join instead of an inner join).
Review the new version of the CPS data frame with the summary() and str() functions. What is the name of the variable that was added to the data frame by the merge() operation?
str(CPS)
## 'data.frame': 131302 obs. of 14 variables:
## $ PeopleInHousehold : int 1 3 3 3 3 3 3 2 2 2 ...
## $ Region : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ MetroAreaCode : int 26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
## $ Age : int 85 21 37 18 52 24 26 71 43 52 ...
## $ Married : Factor w/ 5 levels "Divorced","Married",..: 5 3 3 3 5 3 3 1 1 3 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 1 2 2 ...
## $ Education : Factor w/ 8 levels "Associate degree",..: 1 4 4 6 1 2 4 4 4 2 ...
## $ Race : Factor w/ 6 levels "American Indian",..: 6 3 3 3 6 6 6 6 6 6 ...
## $ Hispanic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CountryOfBirthCode: int 57 57 57 57 57 57 57 57 57 57 ...
## $ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 4 5 1 3 2 2 2 2 3 2 ...
## $ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA 11 NA NA 11 4 14 4 NA 12 ...
How many interviewees have a missing value for the new metropolitan area variable? Note that all of these interviewees would have been removed from the merged data frame if we did not include the all.x=TRUE parameter.
summary(CPS)
## PeopleInHousehold Region State MetroAreaCode
## Min. : 1.000 Midwest :30684 California :11570 Min. :10420
## 1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780
## Median : 3.000 South :41502 New York : 5595 Median :34740
## Mean : 3.284 West :33177 Florida : 5149 Mean :35075
## 3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860
## Max. :15.000 Illinois : 3912 Max. :79600
## (Other) :94069 NA's :34238
## Age Married Sex
## Min. : 0.00 Divorced :11151 Female:67481
## 1st Qu.:19.00 Married :55509 Male :63821
## Median :39.00 Never Married:30772
## Mean :38.83 Separated : 2027
## 3rd Qu.:57.00 Widowed : 6505
## Max. :85.00 NA's :25338
##
## Education Race
## High school :30906 American Indian : 1433
## Bachelor's degree :19443 Asian : 6520
## Some college, no degree:18863 Black : 13913
## No high school diploma :16095 Multiracial : 2897
## Associate degree : 9913 Pacific Islander: 618
## (Other) :10744 White :105921
## NA's :25338
## Hispanic CountryOfBirthCode Citizenship
## Min. :0.0000 Min. : 57.00 Citizen, Native :116639
## 1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073
## Median :0.0000 Median : 57.00 Non-Citizen : 7590
## Mean :0.1393 Mean : 82.68
## 3rd Qu.:0.0000 3rd Qu.: 57.00
## Max. :1.0000 Max. :555.00
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
Which of the following metropolitan areas has the largest number of interviewees?
table(CPS$MetroArea)
##
## 10420 10500 10580 10740 10900 11020 11100 11300 11340 11460 11500 11540
## 231 68 268 609 334 82 88 62 64 85 61 125
## 11700 12020 12060 12100 12260 12420 12540 12580 12940 13140 13380 13460
## 116 65 1552 111 161 516 245 1483 262 123 70 140
## 13740 13780 13820 14020 14060 14260 14500 14540 14740 15180 15380 15940
## 199 73 392 104 40 644 171 29 87 79 344 118
## 15980 16300 16580 16620 16700 16740 16860 16980 17020 17140 17460 17660
## 146 196 122 262 232 517 167 2772 60 719 681 117
## 17820 17860 17900 17980 18140 18580 19100 19340 19380 19460 19500 19660
## 372 47 291 59 551 132 1863 240 268 96 81 140
## 19740 19780 19820 20100 20260 20500 20740 20940 21340 21500 21660 21780
## 1504 501 1354 456 126 189 110 99 244 87 196 99
## 22020 22140 22180 22220 22420 22460 22660 22900 23020 23060 23420 23540
## 432 64 77 215 102 63 206 105 80 136 303 70
## 24340 24540 24580 24660 24860 25060 25180 25420 25500 25860 26100 26180
## 304 162 136 251 185 65 86 174 90 57 78 1576
## 26420 26580 26620 26900 26980 27100 27140 27260 27340 27500 27740 27780
## 1649 82 117 570 131 70 222 393 63 99 52 63
## 27900 28020 28100 28140 28660 28700 28740 28940 29100 29180 29340 29460
## 59 127 87 962 101 67 87 168 114 181 81 149
## 29540 29620 29700 29740 29820 29940 30020 30460 30780 30980 31100 31140
## 156 119 89 107 1299 98 97 198 404 65 4102 519
## 31180 31340 31420 31460 31540 32580 32780 32820 32900 33100 33140 33260
## 63 73 65 57 284 195 82 348 106 1554 77 51
## 33340 33460 33660 33700 33740 33780 33860 34740 34820 34900 34940 34980
## 714 1942 110 158 179 63 103 90 102 61 82 505
## 35380 35620 35660 36100 36140 36260 36420 36500 36540 36740 36780 37100
## 367 5409 51 76 30 423 604 99 957 610 85 267
## 37340 37460 37860 37900 37980 38060 38300 38900 38940 39100 39140 39340
## 168 59 107 112 2855 971 732 1089 109 201 54 309
## 39380 39460 39540 39580 39740 39900 40060 40140 40220 40380 40420 40900
## 130 48 119 336 142 310 490 1290 66 307 114 667
## 40980 41060 41180 41420 41500 41540 41620 41700 41740 41860 41940 42020
## 74 82 956 170 104 74 723 607 907 1386 670 77
## 42060 42100 42140 42220 42260 42340 42540 42660 43340 43620 43780 43900
## 132 66 52 129 192 202 176 1255 146 595 81 99
## 44060 44100 44180 44220 44700 45060 45220 45300 45780 45820 45940 46060
## 156 76 161 34 193 223 43 842 235 182 91 302
## 46140 46220 46540 46660 46700 46940 47020 47220 47260 47300 47380 47580
## 323 78 80 42 133 79 116 54 597 121 79 42
## 47900 47940 48140 48620 49180 49420 49620 49660 70750 70900 71650 71950
## 4177 156 96 427 127 112 117 153 208 75 2229 730
## 72400 72850 73450 74500 75700 76450 76750 77200 77350 78100 78700 79600
## 657 112 885 66 506 203 701 2284 262 155 157 144
Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity? Hint: Use tapply() with mean, as in the previous subproblem. Calling sort() on the output of tapply() could also be helpful here.
tapply(CPS$Hispanic, CPS$MetroArea, mean)
## 10420 10500 10580 10740 10900 11020
## 0.012987013 0.044117647 0.041044776 0.441707718 0.086826347 0.012195122
## 11100 11300 11340 11460 11500 11540
## 0.261363636 0.064516129 0.000000000 0.000000000 0.049180328 0.008000000
## 11700 12020 12060 12100 12260 12420
## 0.060344828 0.123076923 0.085695876 0.090090090 0.093167702 0.310077519
## 12540 12580 12940 13140 13380 13460
## 0.489795918 0.082265678 0.038167939 0.227642276 0.014285714 0.021428571
## 13740 13780 13820 14020 14060 14260
## 0.030150754 0.041095890 0.053571429 0.000000000 0.000000000 0.093167702
## 14500 14540 14740 15180 15380 15940
## 0.146198830 0.000000000 0.057471264 0.797468354 0.017441860 0.076271186
## 15980 16300 16580 16620 16700 16740
## 0.438356164 0.015306122 0.032786885 0.007633588 0.017241379 0.117988395
## 16860 16980 17020 17140 17460 17660
## 0.077844311 0.167388167 0.116666667 0.040333797 0.060205580 0.025641026
## 17820 17860 17900 17980 18140 18580
## 0.120967742 0.042553191 0.079037801 0.203389831 0.043557169 0.606060606
## 19100 19340 19380 19460 19500 19660
## 0.283950617 0.091666667 0.003731343 0.052083333 0.000000000 0.100000000
## 19740 19780 19820 20100 20260 20500
## 0.232047872 0.073852295 0.037666174 0.057017544 0.015873016 0.111111111
## 20740 20940 21340 21500 21660 21780
## 0.000000000 0.686868687 0.790983607 0.022988506 0.076530612 0.020202020
## 22020 22140 22180 22220 22420 22460
## 0.025462963 0.234375000 0.155844156 0.148837209 0.039215686 0.000000000
## 22660 22900 23020 23060 23420 23540
## 0.121359223 0.085714286 0.112500000 0.036764706 0.409240924 0.042857143
## 24340 24540 24580 24660 24860 25060
## 0.138157895 0.160493827 0.125000000 0.075697211 0.037837838 0.015384615
## 25180 25420 25500 25860 26100 26180
## 0.000000000 0.022988506 0.000000000 0.087719298 0.012820513 0.059644670
## 26420 26580 26620 26900 26980 27100
## 0.359005458 0.000000000 0.000000000 0.071929825 0.030534351 0.000000000
## 27140 27260 27340 27500 27740 27780
## 0.009009009 0.091603053 0.126984127 0.030303030 0.038461538 0.000000000
## 27900 28020 28100 28140 28660 28700
## 0.016949153 0.031496063 0.114942529 0.121621622 0.386138614 0.014925373
## 28740 28940 29100 29180 29340 29460
## 0.068965517 0.005952381 0.017543860 0.060773481 0.024691358 0.134228188
## 29540 29620 29700 29740 29820 29940
## 0.102564103 0.084033613 0.966292135 0.542056075 0.251732102 0.040816327
## 30020 30460 30780 30980 31100 31140
## 0.123711340 0.040404040 0.037128713 0.292307692 0.460263286 0.038535645
## 31180 31340 31420 31460 31540 32580
## 0.285714286 0.027397260 0.000000000 0.614035088 0.024647887 0.948717949
## 32780 32820 32900 33100 33140 33260
## 0.085365854 0.028735632 0.566037736 0.467824968 0.038961039 0.352941176
## 33340 33460 33660 33700 33740 33780
## 0.085434174 0.052008239 0.000000000 0.341772152 0.005586592 0.063492063
## 33860 34740 34820 34900 34940 34980
## 0.009708738 0.022222222 0.058823529 0.229508197 0.182926829 0.069306931
## 35380 35620 35660 36100 36140 36260
## 0.111716621 0.228508042 0.019607843 0.157894737 0.066666667 0.144208038
## 36420 36500 36540 36740 36780 37100
## 0.107615894 0.121212121 0.070010449 0.213114754 0.011764706 0.359550562
## 37340 37460 37860 37900 37980 38060
## 0.053571429 0.067796610 0.028037383 0.062500000 0.078458844 0.254376931
## 38300 38900 38940 39100 39140 39340
## 0.016393443 0.094582185 0.100917431 0.273631841 0.129629630 0.064724919
## 39380 39460 39540 39580 39740 39900
## 0.307692308 0.041666667 0.058823529 0.119047619 0.211267606 0.196774194
## 40060 40140 40220 40380 40420 40900
## 0.042857143 0.502325581 0.030303030 0.058631922 0.043859649 0.263868066
## 40980 41060 41180 41420 41500 41540
## 0.027027027 0.012195122 0.030334728 0.211764706 0.557692308 0.000000000
## 41620 41700 41740 41860 41940 42020
## 0.154910097 0.644151565 0.269018743 0.199855700 0.316417910 0.246753247
## 42060 42100 42140 42220 42260 42340
## 0.401515152 0.151515152 0.461538462 0.232558140 0.046875000 0.000000000
## 42540 42660 43340 43620 43780 43900
## 0.136363636 0.088446215 0.082191781 0.042016807 0.049382716 0.020202020
## 44060 44100 44180 44220 44700 45060
## 0.025641026 0.013157895 0.043478261 0.029411765 0.321243523 0.080717489
## 45220 45300 45780 45820 45940 46060
## 0.069767442 0.159144893 0.034042553 0.093406593 0.131868132 0.506622517
## 46140 46220 46540 46660 46700 46940
## 0.114551084 0.102564103 0.075000000 0.047619048 0.210526316 0.075949367
## 47020 47220 47260 47300 47380 47580
## 0.465517241 0.407407407 0.050251256 0.438016529 0.329113924 0.000000000
## 47900 47940 48140 48620 49180 49420
## 0.121378980 0.108974359 0.010416667 0.133489461 0.055118110 0.357142857
## 49620 49660 70750 70900 71650 71950
## 0.042735043 0.032679739 0.014423077 0.000000000 0.069537909 0.112328767
## 72400 72850 73450 74500 75700 76450
## 0.009132420 0.339285714 0.105084746 0.090909091 0.073122530 0.103448276
## 76750 77200 77350 78100 78700 79600
## 0.011412268 0.114273205 0.030534351 0.219354839 0.248407643 0.083333333
sort(tapply(CPS$Hispanic, CPS$MetroArea, mean))
## 11340 11460 14020 14060 14540 19500
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 20740 22460 25180 25500 26580 26620
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 27100 27780 31420 33660 41540 42340
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 47580 70900 19380 33740 28940 16620
## 0.000000000 0.000000000 0.003731343 0.005586592 0.005952381 0.007633588
## 11540 27140 72400 33860 48140 76750
## 0.008000000 0.009009009 0.009132420 0.009708738 0.010416667 0.011412268
## 36780 11020 41060 26100 10420 44100
## 0.011764706 0.012195122 0.012195122 0.012820513 0.012987013 0.013157895
## 13380 70750 28700 16300 25060 20260
## 0.014285714 0.014423077 0.014925373 0.015306122 0.015384615 0.015873016
## 38300 27900 16700 15380 29100 35660
## 0.016393443 0.016949153 0.017241379 0.017441860 0.017543860 0.019607843
## 21780 43900 13460 34740 21500 25420
## 0.020202020 0.020202020 0.021428571 0.022222222 0.022988506 0.022988506
## 31540 29340 22020 17660 44060 40980
## 0.024647887 0.024691358 0.025462963 0.025641026 0.025641026 0.027027027
## 31340 37860 32820 44220 13740 27500
## 0.027397260 0.028037383 0.028735632 0.029411765 0.030150754 0.030303030
## 40220 41180 26980 77350 28020 49660
## 0.030303030 0.030334728 0.030534351 0.030534351 0.031496063 0.032679739
## 16580 45780 23060 30780 19820 24860
## 0.032786885 0.034042553 0.036764706 0.037128713 0.037666174 0.037837838
## 12940 27740 31140 33140 22420 17140
## 0.038167939 0.038461538 0.038535645 0.038961039 0.039215686 0.040333797
## 30460 29940 10580 13780 39460 43620
## 0.040404040 0.040816327 0.041044776 0.041095890 0.041666667 0.042016807
## 17860 49620 23540 40060 44180 18140
## 0.042553191 0.042735043 0.042857143 0.042857143 0.043478261 0.043557169
## 40420 10500 42260 46660 11500 43780
## 0.043859649 0.044117647 0.046875000 0.047619048 0.049180328 0.049382716
## 47260 33460 19460 13820 37340 49180
## 0.050251256 0.052008239 0.052083333 0.053571429 0.053571429 0.055118110
## 20100 14740 40380 34820 39540 26180
## 0.057017544 0.057471264 0.058631922 0.058823529 0.058823529 0.059644670
## 17460 11700 29180 37900 33780 11300
## 0.060205580 0.060344828 0.060773481 0.062500000 0.063492063 0.064516129
## 39340 36140 37460 28740 34980 71650
## 0.064724919 0.066666667 0.067796610 0.068965517 0.069306931 0.069537909
## 45220 36540 26900 75700 19780 46540
## 0.069767442 0.070010449 0.071929825 0.073122530 0.073852295 0.075000000
## 24660 46940 15940 21660 16860 37980
## 0.075697211 0.075949367 0.076271186 0.076530612 0.077844311 0.078458844
## 17900 45060 43340 12580 79600 29620
## 0.079037801 0.080717489 0.082191781 0.082265678 0.083333333 0.084033613
## 32780 33340 12060 22900 10900 25860
## 0.085365854 0.085434174 0.085695876 0.085714286 0.086826347 0.087719298
## 42660 12100 74500 27260 19340 12260
## 0.088446215 0.090090090 0.090909091 0.091603053 0.091666667 0.093167702
## 14260 45820 38900 19660 38940 29540
## 0.093167702 0.093406593 0.094582185 0.100000000 0.100917431 0.102564103
## 46220 76450 73450 36420 47940 20500
## 0.102564103 0.103448276 0.105084746 0.107615894 0.108974359 0.111111111
## 35380 71950 23020 77200 46140 28100
## 0.111716621 0.112328767 0.112500000 0.114273205 0.114551084 0.114942529
## 17020 16740 39580 17820 36500 22660
## 0.116666667 0.117988395 0.119047619 0.120967742 0.121212121 0.121359223
## 47900 28140 12020 30020 24580 27340
## 0.121378980 0.121621622 0.123076923 0.123711340 0.125000000 0.126984127
## 39140 45940 48620 29460 42540 24340
## 0.129629630 0.131868132 0.133489461 0.134228188 0.136363636 0.138157895
## 36260 14500 22220 42100 41620 22180
## 0.144208038 0.146198830 0.148837209 0.151515152 0.154910097 0.155844156
## 36100 45300 24540 16980 34940 39900
## 0.157894737 0.159144893 0.160493827 0.167388167 0.182926829 0.196774194
## 41860 17980 46700 39740 41420 36740
## 0.199855700 0.203389831 0.210526316 0.211267606 0.211764706 0.213114754
## 78100 13140 35620 34900 19740 42220
## 0.219354839 0.227642276 0.228508042 0.229508197 0.232047872 0.232558140
## 22140 42020 78700 29820 38060 11100
## 0.234375000 0.246753247 0.248407643 0.251732102 0.254376931 0.261363636
## 40900 41740 39100 19100 31180 30980
## 0.263868066 0.269018743 0.273631841 0.283950617 0.285714286 0.292307692
## 39380 12420 41940 44700 47380 72850
## 0.307692308 0.310077519 0.316417910 0.321243523 0.329113924 0.339285714
## 33700 33260 49420 26420 37100 28660
## 0.341772152 0.352941176 0.357142857 0.359005458 0.359550562 0.386138614
## 42060 47220 23420 47300 15980 10740
## 0.401515152 0.407407407 0.409240924 0.438016529 0.438356164 0.441707718
## 31100 42140 47020 33100 12540 40140
## 0.460263286 0.461538462 0.465517241 0.467824968 0.489795918 0.502325581
## 46060 29740 41500 32900 18580 31460
## 0.506622517 0.542056075 0.557692308 0.566037736 0.606060606 0.614035088
## 41700 20940 21340 15180 32580 29700
## 0.644151565 0.686868687 0.790983607 0.797468354 0.948717949 0.966292135
Remembering that CPS$Race == “Asian” returns a TRUE/FALSE vector of whether an interviewee is Asian, determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
sort(tapply(CPS$Race == "Asian", CPS$MetroArea, mean))
## 10500 11020 11100 11300 11540 11700
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 13140 13740 13780 14020 14540 15940
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 16620 17020 17980 19500 20500 20740
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 21340 21500 22140 22460 25180 26620
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 27100 27140 27500 27740 27900 28100
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 28660 28700 28940 29180 29620 29700
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 30980 31180 31340 31420 31460 32580
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 33140 33260 33780 34740 34820 35660
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 36140 36780 38940 39100 39380 39460
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 39540 39740 40220 40420 40980 41060
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 41420 41540 42100 42140 42540 43340
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 43780 43900 44220 45220 46220 46540
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 46660 46940 47020 47220 47380 48140
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 70900 74500 78100 78700 41180 35380
## 0.000000000 0.000000000 0.000000000 0.000000000 0.002092050 0.002724796
## 41700 16700 33740 16860 33700 13460
## 0.003294893 0.004310345 0.005586592 0.005988024 0.006329114 0.007142857
## 19380 42060 42220 45780 17660 49620
## 0.007462687 0.007575758 0.007751938 0.008510638 0.008547009 0.008547009
## 49420 24340 43620 21780 29940 17460
## 0.008928571 0.009868421 0.010084034 0.010101010 0.010204082 0.010279001
## 30020 14260 25420 28740 31140 32780
## 0.010309278 0.010869565 0.011494253 0.011494253 0.011560694 0.012195122
## 24540 44180 13820 47940 39340 49660
## 0.012345679 0.012422360 0.012755102 0.012820513 0.012944984 0.013071895
## 36100 10900 18580 20100 16740 42260
## 0.013157895 0.014970060 0.015151515 0.015350877 0.015473888 0.015625000
## 28020 49180 27780 17820 16580 34900
## 0.015748031 0.015748031 0.015873016 0.016129032 0.016393443 0.016393443
## 37460 32820 18140 39140 29740 37860
## 0.016949153 0.017241379 0.018148820 0.018518519 0.018691589 0.018691589
## 44060 22660 22420 42340 46060 20940
## 0.019230769 0.019417476 0.019607843 0.019801980 0.019867550 0.020202020
## 21660 19340 19660 45820 17140 30780
## 0.020408163 0.020833333 0.021428571 0.021978022 0.022253129 0.022277228
## 10580 12940 14740 70750 34940 26900
## 0.022388060 0.022900763 0.022988506 0.024038462 0.024390244 0.024561404
## 12260 26100 22180 36260 77350 47260
## 0.024844720 0.025641026 0.025974026 0.026004728 0.026717557 0.026800670
## 29460 17900 22020 13380 33860 36540
## 0.026845638 0.027491409 0.027777778 0.028571429 0.029126214 0.029258098
## 10420 48620 12020 25060 11340 19740
## 0.030303030 0.030444965 0.030769231 0.030769231 0.031250000 0.031914894
## 24860 37980 25500 15980 28140 79600
## 0.032432432 0.032924694 0.033333333 0.034246575 0.034303534 0.034722222
## 36420 25860 30460 33100 37340 41620
## 0.034768212 0.035087719 0.035353535 0.035392535 0.035714286 0.035961272
## 33660 26580 40060 23060 23020 19780
## 0.036363636 0.036585366 0.036734694 0.036764706 0.037500000 0.037924152
## 38060 38300 71950 77200 45300 20260
## 0.038105046 0.038251366 0.038356164 0.038966725 0.039192399 0.039682540
## 45060 10740 19460 76750 23540 19820
## 0.040358744 0.041050903 0.041666667 0.042796006 0.042857143 0.043574594
## 45940 75700 27340 33340 27260 72400
## 0.043956044 0.047430830 0.047619048 0.047619048 0.048346056 0.048706240
## 11500 46140 39580 36740 22220 42020
## 0.049180328 0.049535604 0.050595238 0.050819672 0.051162791 0.051948052
## 71650 12420 15380 44100 26980 37900
## 0.052041274 0.052325581 0.052325581 0.052631579 0.053435115 0.053571429
## 31540 32900 22900 34980 29540 12580
## 0.056338028 0.056603774 0.057142857 0.057425743 0.057692308 0.057990560
## 39900 16980 14500 26420 40140 72850
## 0.058064516 0.058441558 0.058479532 0.061249242 0.062015504 0.062500000
## 19100 17860 40380 16300 73450 38900
## 0.062801932 0.063829787 0.065146580 0.066326531 0.066666667 0.069788797
## 47900 12060 76450 29340 37100 14060
## 0.070624850 0.072809278 0.073891626 0.074074074 0.074906367 0.075000000
## 15180 33460 29820 24660 12540 11460
## 0.075949367 0.076725026 0.078521940 0.079681275 0.081632653 0.082352941
## 29100 24580 47300 42660 35620 41500
## 0.087719298 0.088235294 0.090909091 0.099601594 0.104270660 0.125000000
## 36500 31100 41740 40900 12100 44700
## 0.131313131 0.135056070 0.142227122 0.142428786 0.144144144 0.155440415
## 47580 23420 46700 41940 41860 26180
## 0.166666667 0.184818482 0.203007519 0.241791045 0.246753247 0.501903553
Normally, we would look at the sorted proportion of interviewees from each metropolitan area who have not received a high school diploma with the command:
sort(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean))
However, none of the interviewees aged 14 and younger have an education value reported, so the mean value is reported as NA for each metropolitan area. To get mean (and related functions, like sum) to ignore missing values, you can pass the parameter na.rm=TRUE. Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
Just as we did with the metropolitan area information, merge in the country of birth information from the CountryMap data frame, replacing the CPS data frame with the result. If you accidentally overwrite CPS with the wrong values, remember that you can restore it by re-loading the data frame from CPSData.csv and then merging in the metropolitan area information using the command provided in the previous subproblem.
What is the name of the variable added to the CPS data frame by this merge operation?
"CPS"
## [1] "CPS"
How many interviewees have a missing value for the new country of birth variable?
CPS = merge(CPS, CountryMap, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
summary(CPS)
## CountryOfBirthCode PeopleInHousehold Region
## Min. : 57.00 Min. : 1.000 Midwest :30684
## 1st Qu.: 57.00 1st Qu.: 2.000 Northeast:25939
## Median : 57.00 Median : 3.000 South :41502
## Mean : 82.68 Mean : 3.284 West :33177
## 3rd Qu.: 57.00 3rd Qu.: 4.000
## Max. :555.00 Max. :15.000
##
## State MetroAreaCode Age
## California :11570 Min. :10420 Min. : 0.00
## Texas : 7077 1st Qu.:21780 1st Qu.:19.00
## New York : 5595 Median :34740 Median :39.00
## Florida : 5149 Mean :35075 Mean :38.83
## Pennsylvania: 3930 3rd Qu.:41860 3rd Qu.:57.00
## Illinois : 3912 Max. :79600 Max. :85.00
## (Other) :94069 NA's :34238
## Married Sex Education
## Divorced :11151 Female:67481 High school :30906
## Married :55509 Male :63821 Bachelor's degree :19443
## Never Married:30772 Some college, no degree:18863
## Separated : 2027 No high school diploma :16095
## Widowed : 6505 Associate degree : 9913
## NA's :25338 (Other) :10744
## NA's :25338
## Race Hispanic Citizenship
## American Indian : 1433 Min. :0.0000 Citizen, Native :116639
## Asian : 6520 1st Qu.:0.0000 Citizen, Naturalized: 7073
## Black : 13913 Median :0.0000 Non-Citizen : 7590
## Multiracial : 2897 Mean :0.1393
## Pacific Islander: 618 3rd Qu.:0.0000
## White :105921 Max. :1.0000
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
## Country
## United States:115063
## Mexico : 3921
## Philippines : 839
## India : 770
## China : 581
## (Other) : 9952
## NA's : 176
Among all interviewees born outside of North America, which country was the most common place of birth?
sort(table(CPS$Country))
##
## Cyprus Kosovo
## 0 0
## Oceania, not specified Other U. S. Island Areas
## 0 0
## Wales Northern Ireland
## 0 2
## Tanzania Azerbaijan
## 2 3
## Czechoslovakia St. Kitts--Nevis
## 3 3
## Georgia Barbados
## 5 6
## Denmark Latvia
## 6 6
## Samoa Senegal
## 6 6
## Singapore Slovakia
## 6 6
## Tonga Zimbabwe
## 6 6
## South America, not specified St. Lucia
## 7 7
## Algeria Americas, not specified
## 9 9
## Belize Fiji
## 9 9
## St. Vincent and the Grenadines Bahamas
## 9 10
## Finland Kuwait
## 10 10
## Lithuania Czech Republic
## 10 11
## Dominica Paraguay
## 11 11
## Croatia Macedonia
## 12 12
## Moldova Antigua and Barbuda
## 12 13
## Belgium Bermuda
## 13 13
## Bolivia Grenada
## 13 13
## Sudan Cape Verde
## 13 15
## Eritrea Sierra Leone
## 15 15
## Uganda Austria
## 15 17
## Morocco Sri Lanka
## 17 17
## U. S. Virgin Islands Uruguay
## 17 17
## Albania Norway
## 18 18
## Europe, not specified Uzbekistan
## 19 19
## West Indies, not specified Malaysia
## 19 20
## Serbia Azores
## 20 22
## USSR New Zealand
## 22 23
## Switzerland Yemen
## 23 23
## Belarus Scotland
## 24 24
## Yugoslavia Hungary
## 24 25
## Afghanistan Indonesia
## 26 26
## Netherlands Sweden
## 28 28
## Bulgaria Costa Rica
## 29 29
## Saudi Arabia Guam
## 29 31
## Cameroon Syria
## 32 32
## Armenia Jordan
## 35 36
## Chile Asia, not specified
## 37 39
## Ireland Spain
## 39 41
## Bangladesh Australia
## 42 43
## Nepal Panama
## 44 44
## Lebanon Myanmar (Burma)
## 45 45
## South Africa Turkey
## 48 48
## Cambodia Liberia
## 49 52
## Kenya Romania
## 55 55
## Greece Israel
## 56 57
## Trinidad and Tobago Bosnia & Herzegovina
## 60 61
## Venezuela Argentina
## 61 64
## Hong Kong Portugal
## 64 64
## Egypt Somalia
## 65 72
## France South Korea
## 73 73
## Ghana Nicaragua
## 76 76
## Ethiopia Elsewhere
## 80 81
## Nigeria Iraq
## 85 97
## Laos Taiwan
## 98 102
## Ukraine Guyana
## 104 109
## Pakistan United Kingdom
## 109 111
## Thailand Africa, not specified
## 128 129
## Ecuador Peru
## 136 136
## Iran Italy
## 144 149
## Brazil Poland
## 159 162
## Haiti Russia
## 167 173
## England Japan
## 179 187
## Honduras Columbia
## 189 206
## Jamaica Guatemala
## 217 309
## Dominican Republic Korea
## 330 334
## Canada Cuba
## 410 426
## Germany Vietnam
## 438 458
## El Salvador Puerto Rico
## 477 518
## China India
## 581 770
## Philippines Mexico
## 839 3921
## United States
## 115063
What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States? For this computation, don’t include people from this metropolitan area who have a missing country of birth.
table(CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA", CPS$Country != "United States")
##
## FALSE TRUE
## FALSE 82493 14412
Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India? Hint – remember to include na.rm=TRUE if you are using tapply() to answer this question.
sort(tapply(CPS$Country == "India", CPS$MetroArea, sum, na.rm=TRUE))
## 10420 10500 10580 10900 11020 11100 11300 11460 11500 11540 11700 12020
## 0 0 0 0 0 0 0 0 0 0 0 0
## 12260 12940 13140 13380 13460 13740 13780 14020 14500 14540 14740 15380
## 0 0 0 0 0 0 0 0 0 0 0 0
## 15940 15980 16300 16580 16620 16860 17020 17660 17820 17860 17980 18140
## 0 0 0 0 0 0 0 0 0 0 0 0
## 18580 19340 19380 19460 19500 19740 20100 20260 20500 20740 20940 21340
## 0 0 0 0 0 0 0 0 0 0 0 0
## 21500 21660 21780 22020 22140 22180 22420 22460 22660 22900 23020 23540
## 0 0 0 0 0 0 0 0 0 0 0 0
## 24340 24540 24580 24660 25060 25180 25500 25860 26100 26580 26620 27100
## 0 0 0 0 0 0 0 0 0 0 0 0
## 27140 27340 27500 27740 27780 27900 28020 28100 28660 28700 28740 28940
## 0 0 0 0 0 0 0 0 0 0 0 0
## 29100 29180 29340 29460 29540 29620 29700 29740 30020 30460 30980 31140
## 0 0 0 0 0 0 0 0 0 0 0 0
## 31180 31340 31420 31460 32580 32780 32900 33140 33260 33660 33700 33740
## 0 0 0 0 0 0 0 0 0 0 0 0
## 33780 33860 34740 34820 34900 35660 36100 36140 36780 37340 37460 37860
## 0 0 0 0 0 0 0 0 0 0 0 0
## 38940 39100 39140 39380 39460 39540 39580 39740 40060 40140 40220 40420
## 0 0 0 0 0 0 0 0 0 0 0 0
## 40980 41060 41180 41420 41500 41540 41700 42020 42060 42100 42140 42220
## 0 0 0 0 0 0 0 0 0 0 0 0
## 42260 42340 42540 43340 43620 43780 43900 44060 44180 44220 44700 45220
## 0 0 0 0 0 0 0 0 0 0 0 0
## 45780 45820 46220 46540 46660 46700 46940 47020 47220 47260 47380 47940
## 0 0 0 0 0 0 0 0 0 0 0 0
## 48140 48620 49420 49620 49660 70750 70900 72850 74500 76750 78100 78700
## 0 0 0 0 0 0 0 0 0 0 0 0
## 79600 11340 14060 14260 17140 17900 24860 25420 27260 29940 34940 35380
## 0 1 1 1 1 1 1 1 1 1 1 1
## 36500 39340 45060 46060 12100 12540 13820 16700 17460 19660 23060 29820
## 1 1 1 1 2 2 2 2 2 2 2 2
## 32820 33100 34980 36260 36420 37100 38060 40380 41620 44100 49180 72400
## 2 2 2 2 2 2 2 2 2 2 2 2
## 10740 26980 31540 39900 47300 76450 16740 26900 36540 37900 41740 45940
## 3 3 3 3 3 3 4 4 4 4 4 4
## 46140 77350 36740 42660 12420 15180 19780 30780 38900 47580 75700 45300
## 4 4 5 5 6 6 6 6 6 6 6 7
## 22220 40900 26180 28140 71650 33340 71950 77200 26420 12580 23420 38300
## 8 8 9 11 11 12 12 14 15 16 16 16
## 19100 31100 41940 33460 73450 12060 41860 19820 16980 37980 47900 35620
## 18 19 19 23 26 27 27 30 31 32 50 96
sort(tapply(CPS$Country == "Brazil", CPS$MetroArea, sum, na.rm=TRUE))
## 10500 10580 10900 11020 11100 11300 11340 11460 11500 11540 11700 12020
## 0 0 0 0 0 0 0 0 0 0 0 0
## 12100 12260 12420 12540 12580 12940 13140 13380 13460 13740 13780 13820
## 0 0 0 0 0 0 0 0 0 0 0 0
## 14020 14060 14260 14500 14540 15180 15380 16300 16580 16620 16700 16860
## 0 0 0 0 0 0 0 0 0 0 0 0
## 17460 17660 17820 17860 17980 18140 18580 19380 19460 19500 19660 19780
## 0 0 0 0 0 0 0 0 0 0 0 0
## 19820 20100 20260 20500 20740 20940 21340 21500 21660 21780 22020 22140
## 0 0 0 0 0 0 0 0 0 0 0 0
## 22180 22220 22420 22460 22660 22900 23020 23060 23420 23540 24340 24540
## 0 0 0 0 0 0 0 0 0 0 0 0
## 24580 24660 24860 25060 25180 25420 25500 25860 26100 26180 26420 26580
## 0 0 0 0 0 0 0 0 0 0 0 0
## 26620 26900 26980 27100 27140 27340 27500 27740 27780 27900 28020 28100
## 0 0 0 0 0 0 0 0 0 0 0 0
## 28660 28700 28740 28940 29100 29180 29340 29460 29540 29620 29700 29740
## 0 0 0 0 0 0 0 0 0 0 0 0
## 29820 29940 30020 30460 30780 30980 31180 31340 31420 31460 31540 32580
## 0 0 0 0 0 0 0 0 0 0 0 0
## 32780 32820 32900 33140 33260 33340 33660 33700 33780 34740 34820 34900
## 0 0 0 0 0 0 0 0 0 0 0 0
## 34940 34980 35380 35660 36100 36140 36260 36420 36500 36540 36780 37340
## 0 0 0 0 0 0 0 0 0 0 0 0
## 37460 37900 38300 38900 38940 39100 39140 39340 39380 39460 39580 39740
## 0 0 0 0 0 0 0 0 0 0 0 0
## 39900 40060 40140 40220 40420 40980 41060 41180 41500 41540 41700 41740
## 0 0 0 0 0 0 0 0 0 0 0 0
## 42020 42060 42100 42140 42220 42260 42340 42540 43340 43620 43780 43900
## 0 0 0 0 0 0 0 0 0 0 0 0
## 44060 44100 44180 44220 44700 45060 45220 45780 45820 46060 46140 46220
## 0 0 0 0 0 0 0 0 0 0 0 0
## 46540 46660 46700 46940 47020 47220 47300 47380 47580 47940 48140 49180
## 0 0 0 0 0 0 0 0 0 0 0 0
## 49420 49620 49660 70750 72400 75700 76450 76750 77350 78100 79600 10420
## 0 0 0 0 0 0 0 0 0 0 0 1
## 10740 12060 14740 15980 17020 17140 19740 28140 31140 33460 33740 33860
## 1 1 1 1 1 1 1 1 1 1 1 1
## 37100 37860 39540 40380 41420 41940 42660 45300 45940 47260 48620 73450
## 1 1 1 1 1 1 1 1 1 1 1 1
## 74500 78700 16740 16980 17900 19100 27260 36740 40900 70900 15940 38060
## 1 1 2 2 2 2 2 2 2 2 3 3
## 41620 77200 19340 37980 72850 41860 35620 71950 47900 31100 33100 71650
## 3 3 4 4 5 6 7 7 8 9 16 18
sort(tapply(CPS$Country == "Somalia", CPS$MetroArea, sum, na.rm=TRUE))
## 10420 10500 10580 10740 10900 11020 11100 11300 11340 11460 11500 11540
## 0 0 0 0 0 0 0 0 0 0 0 0
## 11700 12020 12060 12100 12260 12420 12540 12580 12940 13140 13380 13460
## 0 0 0 0 0 0 0 0 0 0 0 0
## 13740 13780 13820 14020 14060 14260 14500 14540 14740 15180 15380 15940
## 0 0 0 0 0 0 0 0 0 0 0 0
## 15980 16300 16580 16620 16700 16740 16860 16980 17020 17140 17460 17660
## 0 0 0 0 0 0 0 0 0 0 0 0
## 17820 17860 17900 17980 18580 19100 19340 19460 19500 19660 19740 19780
## 0 0 0 0 0 0 0 0 0 0 0 0
## 19820 20100 20260 20500 20740 20940 21340 21500 21660 21780 22140 22180
## 0 0 0 0 0 0 0 0 0 0 0 0
## 22220 22420 22460 22660 22900 23020 23060 23420 23540 24340 24540 24580
## 0 0 0 0 0 0 0 0 0 0 0 0
## 24660 24860 25060 25180 25420 25500 25860 26100 26180 26580 26620 26900
## 0 0 0 0 0 0 0 0 0 0 0 0
## 26980 27100 27140 27260 27340 27500 27740 27780 27900 28020 28100 28140
## 0 0 0 0 0 0 0 0 0 0 0 0
## 28660 28700 28740 28940 29100 29180 29340 29460 29540 29620 29700 29740
## 0 0 0 0 0 0 0 0 0 0 0 0
## 29820 29940 30020 30460 30780 30980 31100 31140 31180 31340 31420 31460
## 0 0 0 0 0 0 0 0 0 0 0 0
## 31540 32580 32780 32820 32900 33100 33140 33260 33340 33660 33700 33740
## 0 0 0 0 0 0 0 0 0 0 0 0
## 33780 33860 34740 34820 34900 34940 34980 35380 35620 35660 36100 36140
## 0 0 0 0 0 0 0 0 0 0 0 0
## 36260 36420 36500 36540 36740 36780 37100 37340 37460 37860 37900 37980
## 0 0 0 0 0 0 0 0 0 0 0 0
## 38300 38940 39100 39140 39340 39380 39460 39540 39580 39740 39900 40140
## 0 0 0 0 0 0 0 0 0 0 0 0
## 40220 40380 40420 40900 40980 41180 41420 41500 41540 41620 41700 41740
## 0 0 0 0 0 0 0 0 0 0 0 0
## 41860 41940 42020 42060 42100 42140 42220 42260 42340 42540 43340 43780
## 0 0 0 0 0 0 0 0 0 0 0 0
## 43900 44060 44100 44180 44220 44700 45060 45220 45300 45780 45820 45940
## 0 0 0 0 0 0 0 0 0 0 0 0
## 46060 46140 46220 46540 46660 46700 46940 47020 47220 47260 47300 47380
## 0 0 0 0 0 0 0 0 0 0 0 0
## 47580 47900 47940 48140 48620 49180 49420 49620 49660 70750 70900 71650
## 0 0 0 0 0 0 0 0 0 0 0 0
## 71950 72850 73450 74500 75700 76450 77200 77350 78100 78700 79600 19380
## 0 0 0 0 0 0 0 0 0 0 0 1
## 40060 26420 43620 38900 72400 76750 18140 22020 38060 41060 42660 33460
## 1 2 2 3 3 3 5 5 7 7 7 17
"New York-Northern New Jersey-Long Island, NY-NJ-PA"
## [1] "New York-Northern New Jersey-Long Island, NY-NJ-PA"
In Brazil?
"Boston-Cambridge-Quincy, MA-NH"
## [1] "Boston-Cambridge-Quincy, MA-NH"
In Somalia?
"Minneapolis-St Paul-Bloomington, MN-WI"
## [1] "Minneapolis-St Paul-Bloomington, MN-WI"