In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset (available online).
The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData.csv, which has the following variables:
PeopleInHousehold: The number of people in the interviewee’s household.
Region: The census region where the interviewee lives.
State: The state where the interviewee lives.
MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.
Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.
Married: The marriage status of the interviewee.
Sex: The sex of the interviewee.
Education: The maximum level of education obtained by the interviewee.
Race: The race of the interviewee.
Hispanic: Whether the interviewee is of Hispanic ethnicity.
CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.
Citizenship: The United States citizenship status of the interviewee.
EmploymentStatus: The status of employment of the interviewee.
Industry: The industry of employment of the interviewee (only available if they are employed).
Load the dataset from CPSData.csv into a data frame called CPS, and view the dataset with the summary() and str() commands. 131302
CPS=CPSData
summary(CPS)
PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.000 Length:131302 Length:131302 Min. :10420 Min. : 0.00
1st Qu.: 2.000 Class :character Class :character 1st Qu.:21780 1st Qu.:19.00
Median : 3.000 Mode :character Mode :character Median :34740 Median :39.00
Mean : 3.284 Mean :35075 Mean :38.83
3rd Qu.: 4.000 3rd Qu.:41860 3rd Qu.:57.00
Max. :15.000 Max. :79600 Max. :85.00
NA's :34238
Married Sex Education Race
Length:131302 Length:131302 Length:131302 Length:131302
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.0000 Min. : 57.00 Length:131302 Length:131302
1st Qu.:0.0000 1st Qu.: 57.00 Class :character Class :character
Median :0.0000 Median : 57.00 Mode :character Mode :character
Mean :0.1393 Mean : 82.68
3rd Qu.:0.0000 3rd Qu.: 57.00
Max. :1.0000 Max. :555.00
Industry
Length:131302
Class :character
Mode :character
str(CPS)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 131302 obs. of 14 variables:
$ PeopleInHousehold : int 1 3 3 3 3 3 3 2 2 2 ...
$ Region : chr "South" "South" "South" "South" ...
$ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ MetroAreaCode : int 26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
$ Age : int 85 21 37 18 52 24 26 71 43 52 ...
$ Married : chr "Widowed" "Never Married" "Never Married" "Never Married" ...
$ Sex : chr "Female" "Male" "Female" "Male" ...
$ Education : chr "Associate degree" "High school" "High school" "No high school diploma" ...
$ Race : chr "White" "Black" "Black" "Black" ...
$ Hispanic : int 0 0 0 0 0 0 0 0 0 0 ...
$ CountryOfBirthCode: int 57 57 57 57 57 57 57 57 57 57 ...
$ Citizenship : chr "Citizen, Native" "Citizen, Native" "Citizen, Native" "Citizen, Native" ...
$ EmploymentStatus : chr "Retired" "Unemployed" "Disabled" "Not in Labor Force" ...
$ Industry : chr NA "Professional and business services" NA NA ...
- attr(*, "spec")=List of 2
..$ cols :List of 14
.. ..$ PeopleInHousehold : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Region : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ State : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ MetroAreaCode : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Age : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Married : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Sex : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Education : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Race : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Hispanic : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ CountryOfBirthCode: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Citizenship : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ EmploymentStatus : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Industry : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
table(CPS$Industry)
Agriculture, forestry, fishing, and hunting Armed forces
1307 29
Construction Educational and health services
4387 15017
Financial Information
4347 1328
Leisure and hospitality Manufacturing
6364 6791
Mining Other services
550 3224
Professional and business services Public administration
7519 3186
Trade Transportation and utilities
8933 3260
Recall from the homework assignment “The Analytical Detective” that you can call the sort() function on the output of the table() function to obtain a sorted breakdown of a variable. For instance, sort(table(CPS$Region)) sorts the regions by the number of interviewees from that region.
Which state has the fewest interviewees? New Mexico
sort(table(CPS$State))
New Mexico Montana Mississippi Alabama
1102 1214 1230 1376
West Virginia Arkansas Louisiana Idaho
1409 1421 1450 1518
Oklahoma Arizona Alaska Wyoming
1523 1528 1590 1624
North Dakota South Carolina Tennessee District of Columbia
1645 1658 1784 1791
Kentucky Utah Nevada Vermont
1841 1842 1856 1890
Kansas Oregon Nebraska Massachusetts
1935 1943 1949 1987
South Dakota Indiana Hawaii Missouri
2000 2004 2099 2145
Rhode Island Delaware Maine Washington
2209 2214 2263 2366
Iowa New Jersey North Carolina New Hampshire
2528 2567 2619 2662
Wisconsin Georgia Connecticut Colorado
2686 2807 2836 2925
Virginia Michigan Minnesota Maryland
2953 3063 3139 3200
Ohio Illinois Pennsylvania Florida
3678 3912 3930 5149
New York Texas California
5595 7077 11570
Which state has the largest number of interviewees? California
What proportion of interviewees are citizens of the United States?
116639/131302
[1] 0.8883261
The CPS differentiates between race (with possible values American Indian, Asian, Black, Pacific Islander, White, or Multiracial) and ethnicity. A number of interviewees are of Hispanic ethnicity, as captured by the Hispanic variable. For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
summary(CPS)
PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.000 Length:131302 Length:131302 Min. :10420 Min. : 0.00
1st Qu.: 2.000 Class :character Class :character 1st Qu.:21780 1st Qu.:19.00
Median : 3.000 Mode :character Mode :character Median :34740 Median :39.00
Mean : 3.284 Mean :35075 Mean :38.83
3rd Qu.: 4.000 3rd Qu.:41860 3rd Qu.:57.00
Max. :15.000 Max. :79600 Max. :85.00
NA's :34238
Married Sex Education Race
Length:131302 Length:131302 Length:131302 Length:131302
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.0000 Min. : 57.00 Length:131302 Length:131302
1st Qu.:0.0000 1st Qu.: 57.00 Class :character Class :character
Median :0.0000 Median : 57.00 Mode :character Mode :character
Mean :0.1393 Mean : 82.68
3rd Qu.:0.0000 3rd Qu.: 57.00
Max. :1.0000 Max. :555.00
Industry
Length:131302
Class :character
Mode :character
Often when evaluating a new dataset, we try to identify if there is a pattern in the missing values in the dataset. We will try to determine if there is a pattern in the missing values of the Married variable. The function
is.na(CPS$Married)
returns a vector of TRUE/FALSE values for whether the Married variable is missing. We can see the breakdown of whether Married is missing based on the reported value of the Region variable with the function
table(CPS$Region, is.na(CPS$Married))
Which is the most accurate:
is.na(CPS$Married)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[15] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[29] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[43] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[57] TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[71] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[99] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[113] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[127] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[141] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[183] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
[197] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[211] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[225] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[239] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[267] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
[281] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[295] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[309] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[323] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[351] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
[365] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[379] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[393] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[407] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[421] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[435] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[449] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
[463] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[477] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[491] TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[519] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[533] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[547] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
[561] TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
[575] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[589] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[603] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[617] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
[631] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[645] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[659] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[673] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[687] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[701] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
[715] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[729] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[743] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[757] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[771] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
[785] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[799] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[813] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[827] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
[841] TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[855] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[869] FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[883] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[897] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[911] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[925] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[939] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[953] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[967] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[981] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[995] FALSE TRUE FALSE FALSE FALSE TRUE
[ reached getOption("max.print") -- omitted 130302 entries ]
table(CPS$Region, is.na(CPS$Married))
FALSE TRUE
Midwest 24609 6075
Northeast 21432 4507
South 33535 7967
West 26388 6789
As mentioned in the variable descriptions, MetroAreaCode is missing if an interviewee does not live in a metropolitan area. Using the same technique as in the previous question, answer the following questions about people who live in non-metropolitan areas.
How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state). 2
table(CPS$State, is.na(CPS$MetroAreaCode))
FALSE TRUE
Alabama 1020 356
Alaska 0 1590
Arizona 1327 201
Arkansas 724 697
California 11333 237
Colorado 2545 380
Connecticut 2593 243
Delaware 1696 518
District of Columbia 1791 0
Florida 4947 202
Georgia 2250 557
Hawaii 1576 523
Idaho 761 757
Illinois 3473 439
Indiana 1420 584
Iowa 1297 1231
Kansas 1234 701
Kentucky 908 933
Louisiana 1216 234
Maine 909 1354
Maryland 2978 222
Massachusetts 1858 129
Michigan 2517 546
Minnesota 2150 989
Mississippi 376 854
Missouri 1440 705
Montana 199 1015
Nebraska 816 1133
Nevada 1609 247
New Hampshire 1148 1514
New Jersey 2567 0
New Mexico 832 270
New York 5144 451
North Carolina 1642 977
North Dakota 432 1213
Ohio 2754 924
Oklahoma 1024 499
Oregon 1519 424
Pennsylvania 3245 685
Rhode Island 2209 0
South Carolina 1139 519
South Dakota 595 1405
Tennessee 1149 635
Texas 6060 1017
Utah 1455 387
Vermont 657 1233
Virginia 2367 586
Washington 1937 429
West Virginia 344 1065
Wisconsin 1882 804
Wyoming 0 1624
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state. 3
Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
While we were able to use the table() command to compute the proportion of interviewees from each region not living in a metropolitan area, it was somewhat tedious (it involved manually computing the proportion for each region) and isn’t something you would want to do if there were a larger number of options. It turns out there is a less tedious way to compute the proportion of values that are TRUE. The mean() function, which takes the average of the values passed to it, will treat TRUE as 1 and FALSE as 0, meaning it returns the proportion of values that are true. For instance, mean(c(TRUE, FALSE, TRUE, TRUE)) returns 0.75. Knowing this, use tapply() with the mean function to answer the following questions:
Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
Codes like MetroAreaCode and CountryOfBirthCode are a compact way to encode factor variables with text as their possible values, and they are therefore quite common in survey datasets. In fact, all but one of the variables in this dataset were actually stored by a numeric code in the original CPS datafile.
When analyzing a variable stored by a numeric code, we will often want to convert it into the values the codes represent. To do this, we will use a dictionary, which maps the the code to the actual value of the variable. We have provided dictionaries MetroAreaCodes.csv and CountryCodes.csv, which respectively map MetroAreaCode and CountryOfBirthCode into their true values. Read these two dictionaries into data frames MetroAreaMap and CountryMap.
How many observations (codes for metropolitan areas) are there in MetroAreaMap? 271
str(MetroAreaMap)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 271 obs. of 2 variables:
$ Code : chr "00460" "03000" "03160" "03610" ...
$ MetroArea: chr "Appleton-Oshkosh-Neenah, WI" "Grand Rapids-Muskegon-Holland, MI" "Greenville-Spartanburg-Anderson, SC" "Jamestown, NY" ...
- attr(*, "spec")=List of 2
..$ cols :List of 2
.. ..$ Code : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ MetroArea: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
How many observations (codes for countries) are there in CountryMap? 149
str(CountryMap)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 149 obs. of 2 variables:
$ Code : int 57 66 73 78 96 100 102 103 104 105 ...
$ Country: chr "United States" "Guam" "Puerto Rico" "U. S. Virgin Islands" ...
- attr(*, "spec")=List of 2
..$ cols :List of 2
.. ..$ Code : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Country: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap. The following command merges the two data frames on these columns, overwriting the CPS data frame with the result:
CPS = merge(CPS, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
The first two arguments determine the data frames to be merged (they are called “x” and “y”, respectively, in the subsequent parameters to the merge function). by.x=“MetroAreaCode” means we’re matching on the MetroAreaCode variable from the “x” data frame (CPS), while by.y=“Code” means we’re matching on the Code variable from the “y” data frame (MetroAreaMap). Finally, all.x=TRUE means we want to keep all rows from the “x” data frame (CPS), even if some of the rows’ MetroAreaCode doesn’t match any codes in MetroAreaMap (for those familiar with database terminology, this parameter makes the operation a left outer join instead of an inner join).
Review the new version of the CPS data frame with the summary() and str() functions. What is the name of the variable that was added to the data frame by the merge() operation?
CPS = merge(CPS, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
summary(CPS)
MetroAreaCode PeopleInHousehold Region State Age
Min. :10420 Min. : 1.000 Length:131302 Length:131302 Min. : 0.00
1st Qu.:21780 1st Qu.: 2.000 Class :character Class :character 1st Qu.:19.00
Median :34740 Median : 3.000 Mode :character Mode :character Median :39.00
Mean :35075 Mean : 3.284 Mean :38.83
3rd Qu.:41860 3rd Qu.: 4.000 3rd Qu.:57.00
Max. :79600 Max. :15.000 Max. :85.00
NA's :34238
Married Sex Education Race
Length:131302 Length:131302 Length:131302 Length:131302
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.0000 Min. : 57.00 Length:131302 Length:131302
1st Qu.:0.0000 1st Qu.: 57.00 Class :character Class :character
Median :0.0000 Median : 57.00 Mode :character Mode :character
Mean :0.1393 Mean : 82.68
3rd Qu.:0.0000 3rd Qu.: 57.00
Max. :1.0000 Max. :555.00
Industry MetroArea
Length:131302 Length:131302
Class :character Class :character
Mode :character Mode :character
str(CPS)
'data.frame': 131302 obs. of 15 variables:
$ MetroAreaCode : int 10420 10420 10420 10420 10420 10420 10420 10420 10420 10420 ...
$ PeopleInHousehold : int 4 4 2 4 1 3 4 4 2 3 ...
$ Region : chr "Midwest" "Midwest" "Midwest" "Midwest" ...
$ State : chr "Ohio" "Ohio" "Ohio" "Ohio" ...
$ Age : int 2 9 73 40 63 19 30 6 60 32 ...
$ Married : chr NA NA "Married" "Married" ...
$ Sex : chr "Male" "Male" "Female" "Female" ...
$ Education : chr NA NA "Some college, no degree" "High school" ...
$ Race : chr "White" "White" "White" "White" ...
$ Hispanic : int 0 0 0 0 0 0 0 1 0 0 ...
$ CountryOfBirthCode: int 57 57 57 362 57 57 203 57 57 57 ...
$ Citizenship : chr "Citizen, Native" "Citizen, Native" "Citizen, Native" "Citizen, Naturalized" ...
$ EmploymentStatus : chr NA NA "Retired" "Not in Labor Force" ...
$ Industry : chr NA NA NA NA ...
$ MetroArea : chr "Akron, OH" "Akron, OH" "Akron, OH" "Akron, OH" ...
How many interviewees have a missing value for the new metropolitan area variable? Note that all of these interviewees would have been removed from the merged data frame if we did not include the all.x=TRUE parameter.
Which of the following metropolitan areas has the largest number of interviewees?
table(CPS$MetroArea)
Akron, OH
231
Albany-Schenectady-Troy, NY
268
Albany, GA
68
Albuquerque, NM
609
Allentown-Bethlehem-Easton, PA-NJ
334
Altoona, PA
82
Amarillo, TX
88
Anderson, IN
62
Anderson, SC
64
Ann Arbor, MI
85
Anniston-Oxford, AL
61
Appleton,WI
125
Asheville, NC
116
Athens-Clark County, GA
65
Atlanta-Sandy Springs-Marietta, GA
1552
Atlantic City, NJ
111
Augusta-Richmond County, GA-SC
161
Austin-Round Rock, TX
516
Bakersfield, CA
245
Baltimore-Towson, MD
1483
Bangor, ME
208
Barnstable Town, MA
75
Baton Rouge, LA
262
Beaumont-Port Author, TX
123
Bellingham, WA
70
Bend, OR
140
Billings, MT
199
Binghamton, NY
73
Birmingham-Hoover, AL
392
Bloomington-Normal IL
40
Bloomington, IN
104
Boise City-Nampa, ID
644
Boston-Cambridge-Quincy, MA-NH
2229
Boulder, CO
171
Bowling Green, KY
29
Bremerton-Silverdale, WA
87
Bridgeport-Stamford-Norwalk, CT
730
Brownsville-Harlingen, TX
79
Buffalo-Niagara Falls, NY
344
Burlington-South Burlington, VT
657
Canton-Massillon, OH
118
Cape Coral-Fort Myers, FL
146
Cedar Rapids, IA
196
Champaign-Urbana, IL
122
Charleston-North Charleston, SC
232
Charleston, WV
262
Charlotte-Gastonia-Concord, NC-SC
517
Chattanooga, TN-GA
167
Chicago-Naperville-Joliet, IN-IN-WI
2772
Chico, CA
60
Cincinnati-Middletown, OH-KY-IN
719
Cleveland-Elyria-Mentor, OH
681
Coeur d'Alene, ID
117
Colorado Springs, CO
372
Columbia, MO
47
Columbia, SC
291
Columbus, GA-AL
59
Columbus, OH
551
Corpus Christi, TX
132
Dallas-Fort Worth-Arlington, TX
1863
Danbury, CT
112
Davenport-Moline-Rock Island, IA-IL
240
Dayton, OH
268
Decatur, Al
96
Decatur, IL
81
Deltona-Daytona Beach-Ormond Beach, FL
140
Denver-Aurora, CO
1504
Des Moines, IA
501
Detroit-Warren-Livonia, MI
1354
Dover, DE
456
Duluth, MN-WI
126
Durham, NC
189
Eau Claire, WI
110
El Centro, CA
99
El Paso, TX
244
Erie, PA
87
Eugene-Springfield, OR
196
Evansville, IN-KY
99
Fargo, ND-MN
432
Farmington, NM
64
Fayetteville-Springdale-Rogers, AR-MO
215
Fayetteville, NC
77
Flint, MI
102
Florence, AL
63
Fort Collins-Loveland, CO
206
Fort Smith, AR-OK
105
Fort Walton Beach-Crestview-Destin, FL
80
Fort Wayne, IN
136
Fresno, CA
303
Gainesville, FL
70
Grand Rapids-Wyoming, MI
304
Greeley, CO
162
Green Bay, WI
136
Greensboro-High Point, NC
251
Greenville, SC
185
Gulfport-Biloxi, MS
65
Hagerstown-Martinsburg, MD-WV
86
Harrisburg-Carlisle, PA
174
Harrisonburg, VA
90
Hartford-West Hartford-East Hartford, CT
885
Hickory-Morgantown-Lenoir, NC
57
Holland-Grand Haven, MI
78
Honolulu, HI
1576
Houston-Baytown-Sugar Land, TX
1649
Huntington-Ashland, WV-KY-OH
82
Huntsville, AL
117
Indianapolis, IN
570
Iowa City, IA
131
Jackson, MI
70
Jackson, MS
222
Jacksonville, FL
393
Jacksonville, NC
63
Janesville, WI
99
Johnson City, TN
52
Johnstown, PA
63
Joplin, MO
59
Kalamazoo-Portage, MI
127
Kankakee-Bradley, IL
87
Kansas City, MO-KS
962
Killeen-Temple-Fort Hood, TX
101
Kingsport-Bristol, TN-VA
67
Kingston, NY
87
Knoxville, TN
168
La Crosse, WI
114
Lafayette, LA
181
Lake Charles, LA
81
Lakeland-Winter Haven, FL
149
Lancaster, PA
156
Lansing-East Lansing, MI
119
Laredo, TX
89
Las Cruses, NM
107
Las Vegas-Paradise, NV
1299
Lawrence, KS
98
Lawton, OK
97
Leominster-Fitchburg-Gardner, MA
66
Lexington-Fayette, KY
198
Little Rock-North Little Rock, AR
404
Longview, TX
65
Los Angeles-Long Beach-Santa Ana, CA
4102
Louisville, KY-IN
519
Lubbock, TX
63
Lynchburg, VA
73
Macon, GA
65
Madera, CA
57
Madison, WI
284
McAllen-Edinburg-Pharr, TX
195
Medford, OR
82
Memphis, TN-MS-AR
348
Merced, CA
106
Miami-Fort Lauderdale-Miami Beach, FL
1554
Michigan City-La Porte, IN
77
Midland, TX
51
Milwaukee-Waukesha-West Allis, WI
714
Minneapolis-St Paul-Bloomington, MN-WI
1942
Mobile, AL
110
Modesto, CA
158
Monroe, LA
179
Monroe, MI
63
Montgomery, AL
103
Muskegon-Norton Shores, MI
90
Myrtle Beach-Conway-North Myrtle Beach, SC
102
Napa, CA
61
Naples-Marco Island, FL
82
Nashville-Davidson-Murfreesboro, TN
505
New Haven, CT
506
New Orleans-Metairie-Kenner, LA
367
New York-Northern New Jersey-Long Island, NY-NJ-PA
5409
Niles-Benton Harbor, MI
51
Norwich-New London, CT-RI
203
Ocala, FL
76
Ocean City, NJ
30
Ogden-Clearfield, UT
423
Oklahoma City, OK
604
Olympia, WA
99
Omaha-Council Bluffs, NE-IA
957
Orlando, FL
610
Oshkosh-Neenah, WI
85
Oxnard-Thousand Oaks-Ventura, CA
267
Palm Bay-Melbourne-Titusville, FL
168
Panama City-Lynn Haven, FL
59
Pensacola-Ferry Pass-Brent, FL
107
Peoria, IL
112
Philadelphia-Camden-Wilmington, PA-NJ-DE
2855
Phoenix-Mesa-Scottsdale, AZ
971
Pittsburgh, PA
732
Port St. Lucie-Fort Pierce, FL
109
Portland-South Portland, ME
701
Portland-Vancouver-Beaverton, OR-WA
1089
Poughkeepsie-Newburgh-Middletown, NY
201
Prescott, AZ
54
Providence-Fall River-Warwick, MA-RI
2284
Provo-Orem, UT
309
Pueblo, CO
130
Punta Gorda, FL
48
Racine, WI
119
Raleigh-Cary, NC
336
Reading, PA
142
Reno-Sparks, NV
310
Richmond, VA
490
Riverside-San Bernardino, CA
1290
Roanoke, VA
66
Rochester-Dover, NH-ME
262
Rochester, NY
307
Rockford, IL
114
Sacramento-Arden-Arcade-Roseville, CA
667
Saginaw-Saginaw Township North, MI
74
Salem, OR
170
Salinas, CA
104
Salisbury, MD
74
Salt Lake City, UT
723
San Antonio, TX
607
San Diego-Carlsbad-San Marcos, CA
907
San Francisco-Oakland-Fremont, CA
1386
San Jose-Sunnyvale-Santa Clara, CA
670
San Luis Obispo-Paso Robles, CA
77
Santa Barbara-Santa Maria-Goleta, CA
132
Santa Fe, NM
52
Santa Rosa-Petaluma, CA
129
Santa-Cruz-Watsonville, CA
66
Sarasota-Bradenton-Venice, FL
192
Savannah, GA
202
Scranton-Wilkes Barre, PA
176
Seattle-Tacoma-Bellevue, WA
1255
Shreveport-Bossier City, LA
146
Sioux Falls, SD
595
South Bend-Mishawaka, IN-MI
81
Spartanburg, SC
99
Spokane, WA
156
Springfield, IL
76
Springfield, MA-CT
155
Springfield, MO
161
Springfield, OH
34
St. Cloud, MN
82
St. Louis, MO-IL
956
Stockton, CA
193
Syracuse, NY
223
Tallahassee, FL
43
Tampa-St. Petersburg-Clearwater, FL
842
Toledo, OH
235
Topeka, KS
182
Trenton-Ewing, NJ
91
Tucson, AZ
302
Tulsa, OK
323
Tuscaloosa, AL
78
Utica-Rome, NY
80
Valdosta, GA
42
Vallejo-Fairfield, CA
133
Vero Beach, FL
79
Victoria, TX
116
Vineland-Millville-Bridgeton, NJ
54
Virginia Beach-Norfolk-Newport News, VA-NC
597
Visalia-Porterville, CA
121
Waco, TX
79
Warner Robins, GA
42
Washington-Arlington-Alexandria, DC-VA-MD-WV
4177
Waterbury, CT
157
Waterloo-Cedar Falls, IA
156
Wausau, WI
96
Wichita, KS
427
Winston-Salem, NC
127
Worcester, MA-CT
144
Yakima, WA
112
York-Hanover, PA
117
Youngstown-Warren-Boardman, OH
153
Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity? Hint: Use tapply() with mean, as in the previous subproblem. Calling sort() on the output of tapply() could also be helpful here.
Remembering that CPS$Race == “Asian” returns a TRUE/FALSE vector of whether an interviewee is Asian, determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
sort(tapply(CPS$Race == "Asian", CPS$MetroArea, mean))
Albany, GA
0.000000000
Altoona, PA
0.000000000
Amarillo, TX
0.000000000
Anderson, IN
0.000000000
Appleton,WI
0.000000000
Asheville, NC
0.000000000
Barnstable Town, MA
0.000000000
Beaumont-Port Author, TX
0.000000000
Billings, MT
0.000000000
Binghamton, NY
0.000000000
Bloomington, IN
0.000000000
Bowling Green, KY
0.000000000
Canton-Massillon, OH
0.000000000
Charleston, WV
0.000000000
Chico, CA
0.000000000
Columbus, GA-AL
0.000000000
Decatur, IL
0.000000000
Durham, NC
0.000000000
Eau Claire, WI
0.000000000
El Paso, TX
0.000000000
Erie, PA
0.000000000
Farmington, NM
0.000000000
Florence, AL
0.000000000
Hagerstown-Martinsburg, MD-WV
0.000000000
Huntsville, AL
0.000000000
Jackson, MI
0.000000000
Jackson, MS
0.000000000
Janesville, WI
0.000000000
Johnson City, TN
0.000000000
Joplin, MO
0.000000000
Kankakee-Bradley, IL
0.000000000
Killeen-Temple-Fort Hood, TX
0.000000000
Kingsport-Bristol, TN-VA
0.000000000
Knoxville, TN
0.000000000
Lafayette, LA
0.000000000
Lansing-East Lansing, MI
0.000000000
Laredo, TX
0.000000000
Leominster-Fitchburg-Gardner, MA
0.000000000
Longview, TX
0.000000000
Lubbock, TX
0.000000000
Lynchburg, VA
0.000000000
Macon, GA
0.000000000
Madera, CA
0.000000000
McAllen-Edinburg-Pharr, TX
0.000000000
Michigan City-La Porte, IN
0.000000000
Midland, TX
0.000000000
Monroe, MI
0.000000000
Muskegon-Norton Shores, MI
0.000000000
Myrtle Beach-Conway-North Myrtle Beach, SC
0.000000000
Niles-Benton Harbor, MI
0.000000000
Ocean City, NJ
0.000000000
Oshkosh-Neenah, WI
0.000000000
Port St. Lucie-Fort Pierce, FL
0.000000000
Poughkeepsie-Newburgh-Middletown, NY
0.000000000
Pueblo, CO
0.000000000
Punta Gorda, FL
0.000000000
Racine, WI
0.000000000
Reading, PA
0.000000000
Roanoke, VA
0.000000000
Rockford, IL
0.000000000
Saginaw-Saginaw Township North, MI
0.000000000
Salem, OR
0.000000000
Salisbury, MD
0.000000000
Santa Fe, NM
0.000000000
Santa-Cruz-Watsonville, CA
0.000000000
Scranton-Wilkes Barre, PA
0.000000000
Shreveport-Bossier City, LA
0.000000000
South Bend-Mishawaka, IN-MI
0.000000000
Spartanburg, SC
0.000000000
Springfield, MA-CT
0.000000000
Springfield, OH
0.000000000
St. Cloud, MN
0.000000000
Tallahassee, FL
0.000000000
Tuscaloosa, AL
0.000000000
Utica-Rome, NY
0.000000000
Valdosta, GA
0.000000000
Vero Beach, FL
0.000000000
Victoria, TX
0.000000000
Vineland-Millville-Bridgeton, NJ
0.000000000
Waco, TX
0.000000000
Waterbury, CT
0.000000000
Wausau, WI
0.000000000
St. Louis, MO-IL
0.002092050
New Orleans-Metairie-Kenner, LA
0.002724796
San Antonio, TX
0.003294893
Charleston-North Charleston, SC
0.004310345
Monroe, LA
0.005586592
Chattanooga, TN-GA
0.005988024
Modesto, CA
0.006329114
Bend, OR
0.007142857
Dayton, OH
0.007462687
Santa Barbara-Santa Maria-Goleta, CA
0.007575758
Santa Rosa-Petaluma, CA
0.007751938
Toledo, OH
0.008510638
Coeur d'Alene, ID
0.008547009
York-Hanover, PA
0.008547009
Yakima, WA
0.008928571
Grand Rapids-Wyoming, MI
0.009868421
Sioux Falls, SD
0.010084034
Evansville, IN-KY
0.010101010
Lawrence, KS
0.010204082
Cleveland-Elyria-Mentor, OH
0.010279001
Lawton, OK
0.010309278
Boise City-Nampa, ID
0.010869565
Harrisburg-Carlisle, PA
0.011494253
Kingston, NY
0.011494253
Louisville, KY-IN
0.011560694
Medford, OR
0.012195122
Greeley, CO
0.012345679
Springfield, MO
0.012422360
Birmingham-Hoover, AL
0.012755102
Waterloo-Cedar Falls, IA
0.012820513
Provo-Orem, UT
0.012944984
Youngstown-Warren-Boardman, OH
0.013071895
Ocala, FL
0.013157895
Allentown-Bethlehem-Easton, PA-NJ
0.014970060
Corpus Christi, TX
0.015151515
Dover, DE
0.015350877
Charlotte-Gastonia-Concord, NC-SC
0.015473888
Sarasota-Bradenton-Venice, FL
0.015625000
Kalamazoo-Portage, MI
0.015748031
Winston-Salem, NC
0.015748031
Johnstown, PA
0.015873016
Colorado Springs, CO
0.016129032
Champaign-Urbana, IL
0.016393443
Napa, CA
0.016393443
Panama City-Lynn Haven, FL
0.016949153
Memphis, TN-MS-AR
0.017241379
Columbus, OH
0.018148820
Prescott, AZ
0.018518519
Las Cruses, NM
0.018691589
Pensacola-Ferry Pass-Brent, FL
0.018691589
Spokane, WA
0.019230769
Fort Collins-Loveland, CO
0.019417476
Flint, MI
0.019607843
Savannah, GA
0.019801980
Tucson, AZ
0.019867550
El Centro, CA
0.020202020
Eugene-Springfield, OR
0.020408163
Davenport-Moline-Rock Island, IA-IL
0.020833333
Deltona-Daytona Beach-Ormond Beach, FL
0.021428571
Topeka, KS
0.021978022
Cincinnati-Middletown, OH-KY-IN
0.022253129
Little Rock-North Little Rock, AR
0.022277228
Albany-Schenectady-Troy, NY
0.022388060
Baton Rouge, LA
0.022900763
Bremerton-Silverdale, WA
0.022988506
Bangor, ME
0.024038462
Naples-Marco Island, FL
0.024390244
Indianapolis, IN
0.024561404
Augusta-Richmond County, GA-SC
0.024844720
Holland-Grand Haven, MI
0.025641026
Fayetteville, NC
0.025974026
Ogden-Clearfield, UT
0.026004728
Rochester-Dover, NH-ME
0.026717557
Virginia Beach-Norfolk-Newport News, VA-NC
0.026800670
Lakeland-Winter Haven, FL
0.026845638
Columbia, SC
0.027491409
Fargo, ND-MN
0.027777778
Bellingham, WA
0.028571429
Montgomery, AL
0.029126214
Omaha-Council Bluffs, NE-IA
0.029258098
Akron, OH
0.030303030
Wichita, KS
0.030444965
Athens-Clark County, GA
0.030769231
Gulfport-Biloxi, MS
0.030769231
Anderson, SC
0.031250000
Denver-Aurora, CO
0.031914894
Greenville, SC
0.032432432
Philadelphia-Camden-Wilmington, PA-NJ-DE
0.032924694
Harrisonburg, VA
0.033333333
Cape Coral-Fort Myers, FL
0.034246575
Kansas City, MO-KS
0.034303534
Worcester, MA-CT
0.034722222
Oklahoma City, OK
0.034768212
Hickory-Morgantown-Lenoir, NC
0.035087719
Lexington-Fayette, KY
0.035353535
Miami-Fort Lauderdale-Miami Beach, FL
0.035392535
Palm Bay-Melbourne-Titusville, FL
0.035714286
Salt Lake City, UT
0.035961272
Mobile, AL
0.036363636
Huntington-Ashland, WV-KY-OH
0.036585366
Richmond, VA
0.036734694
Fort Wayne, IN
0.036764706
Fort Walton Beach-Crestview-Destin, FL
0.037500000
Des Moines, IA
0.037924152
Phoenix-Mesa-Scottsdale, AZ
0.038105046
Pittsburgh, PA
0.038251366
Bridgeport-Stamford-Norwalk, CT
0.038356164
Providence-Fall River-Warwick, MA-RI
0.038966725
Tampa-St. Petersburg-Clearwater, FL
0.039192399
Duluth, MN-WI
0.039682540
Syracuse, NY
0.040358744
Albuquerque, NM
0.041050903
Decatur, Al
0.041666667
Portland-South Portland, ME
0.042796006
Gainesville, FL
0.042857143
Detroit-Warren-Livonia, MI
0.043574594
Trenton-Ewing, NJ
0.043956044
New Haven, CT
0.047430830
Jacksonville, NC
0.047619048
Milwaukee-Waukesha-West Allis, WI
0.047619048
Jacksonville, FL
0.048346056
Burlington-South Burlington, VT
0.048706240
Anniston-Oxford, AL
0.049180328
Tulsa, OK
0.049535604
Raleigh-Cary, NC
0.050595238
Orlando, FL
0.050819672
Fayetteville-Springdale-Rogers, AR-MO
0.051162791
San Luis Obispo-Paso Robles, CA
0.051948052
Boston-Cambridge-Quincy, MA-NH
0.052041274
Austin-Round Rock, TX
0.052325581
Buffalo-Niagara Falls, NY
0.052325581
Springfield, IL
0.052631579
Iowa City, IA
0.053435115
Peoria, IL
0.053571429
Madison, WI
0.056338028
Merced, CA
0.056603774
Fort Smith, AR-OK
0.057142857
Nashville-Davidson-Murfreesboro, TN
0.057425743
Lancaster, PA
0.057692308
Baltimore-Towson, MD
0.057990560
Reno-Sparks, NV
0.058064516
Chicago-Naperville-Joliet, IN-IN-WI
0.058441558
Boulder, CO
0.058479532
Houston-Baytown-Sugar Land, TX
0.061249242
Riverside-San Bernardino, CA
0.062015504
Danbury, CT
0.062500000
Dallas-Fort Worth-Arlington, TX
0.062801932
Columbia, MO
0.063829787
Rochester, NY
0.065146580
Cedar Rapids, IA
0.066326531
Hartford-West Hartford-East Hartford, CT
0.066666667
Portland-Vancouver-Beaverton, OR-WA
0.069788797
Washington-Arlington-Alexandria, DC-VA-MD-WV
0.070624850
Atlanta-Sandy Springs-Marietta, GA
0.072809278
Norwich-New London, CT-RI
0.073891626
Lake Charles, LA
0.074074074
Oxnard-Thousand Oaks-Ventura, CA
0.074906367
Bloomington-Normal IL
0.075000000
Brownsville-Harlingen, TX
0.075949367
Minneapolis-St Paul-Bloomington, MN-WI
0.076725026
Las Vegas-Paradise, NV
0.078521940
Greensboro-High Point, NC
0.079681275
Bakersfield, CA
0.081632653
Ann Arbor, MI
0.082352941
La Crosse, WI
0.087719298
Green Bay, WI
0.088235294
Visalia-Porterville, CA
0.090909091
Seattle-Tacoma-Bellevue, WA
0.099601594
New York-Northern New Jersey-Long Island, NY-NJ-PA
0.104270660
Salinas, CA
0.125000000
Olympia, WA
0.131313131
Los Angeles-Long Beach-Santa Ana, CA
0.135056070
San Diego-Carlsbad-San Marcos, CA
0.142227122
Sacramento-Arden-Arcade-Roseville, CA
0.142428786
Atlantic City, NJ
0.144144144
Stockton, CA
0.155440415
Warner Robins, GA
0.166666667
Fresno, CA
0.184818482
Vallejo-Fairfield, CA
0.203007519
San Jose-Sunnyvale-Santa Clara, CA
0.241791045
San Francisco-Oakland-Fremont, CA
0.246753247
Honolulu, HI
0.501903553
Normally, we would look at the sorted proportion of interviewees from each metropolitan area who have not received a high school diploma with the command:
sort(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean))
However, none of the interviewees aged 14 and younger have an education value reported, so the mean value is reported as NA for each metropolitan area. To get mean (and related functions, like sum) to ignore missing values, you can pass the parameter na.rm=TRUE. Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
Just as we did with the metropolitan area information, merge in the country of birth information from the CountryMap data frame, replacing the CPS data frame with the result. If you accidentally overwrite CPS with the wrong values, remember that you can restore it by re-loading the data frame from CPSData.csv and then merging in the metropolitan area information using the command provided in the previous subproblem.
What is the name of the variable added to the CPS data frame by this merge operation?
How many interviewees have a missing value for the new country of birth variable?
Among all interviewees born outside of North America, which country was the most common place of birth?
What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States? For this computation, don’t include people from this metropolitan area who have a missing country of birth.
Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India? Hint – remember to include na.rm=TRUE if you are using tapply() to answer this question.
In Brazil?
In Somalia?