In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset (available online).
The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData.csv, which has the following variables:
PeopleInHousehold: The number of people in the interviewee’s household.
Region: The census region where the interviewee lives.
State: The state where the interviewee lives.
MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.
Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.
Married: The marriage status of the interviewee.
Sex: The sex of the interviewee.
Education: The maximum level of education obtained by the interviewee.
Race: The race of the interviewee.
Hispanic: Whether the interviewee is of Hispanic ethnicity.
CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.
Citizenship: The United States citizenship status of the interviewee.
EmploymentStatus: The status of employment of the interviewee.
Industry: The industry of employment of the interviewee (only available if they are employed).
Load the dataset from CPSData.csv into a data frame called CPS, and view the dataset with the summary() and str() commands.
setwd("/Users/Tommy/Documents/BAhw/Unit1")
CPS <- read.csv("CPSData.csv")
summary(CPS)
PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.000 Midwest :30684 California :11570 Min. :10420 Min. : 0.00
1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780 1st Qu.:19.00
Median : 3.000 South :41502 New York : 5595 Median :34740 Median :39.00
Mean : 3.284 West :33177 Florida : 5149 Mean :35075 Mean :38.83
3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860 3rd Qu.:57.00
Max. :15.000 Illinois : 3912 Max. :79600 Max. :85.00
(Other) :94069 NA's :34238
Married Sex Education Race
Divorced :11151 Female:67481 High school :30906 American Indian : 1433
Married :55509 Male :63821 Bachelor's degree :19443 Asian : 6520
Never Married:30772 Some college, no degree:18863 Black : 13913
Separated : 2027 No high school diploma :16095 Multiracial : 2897
Widowed : 6505 Associate degree : 9913 Pacific Islander: 618
NA's :25338 (Other) :10744 White :105921
NA's :25338
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.0000 Min. : 57.00 Citizen, Native :116639 Disabled : 5712
1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073 Employed :61733
Median :0.0000 Median : 57.00 Non-Citizen : 7590 Not in Labor Force:15246
Mean :0.1393 Mean : 82.68 Retired :18619
3rd Qu.:0.0000 3rd Qu.: 57.00 Unemployed : 4203
Max. :1.0000 Max. :555.00 NA's :25789
Industry
Educational and health services :15017
Trade : 8933
Professional and business services: 7519
Manufacturing : 6791
Leisure and hospitality : 6364
(Other) :21618
NA's :65060
str(CPS)
'data.frame': 131302 obs. of 14 variables:
$ PeopleInHousehold : int 1 3 3 3 3 3 3 2 2 2 ...
$ Region : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
$ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
$ MetroAreaCode : int 26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
$ Age : int 85 21 37 18 52 24 26 71 43 52 ...
$ Married : Factor w/ 5 levels "Divorced","Married",..: 5 3 3 3 5 3 3 1 1 3 ...
$ Sex : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 1 2 2 ...
$ Education : Factor w/ 8 levels "Associate degree",..: 1 4 4 6 1 2 4 4 4 2 ...
$ Race : Factor w/ 6 levels "American Indian",..: 6 3 3 3 6 6 6 6 6 6 ...
$ Hispanic : int 0 0 0 0 0 0 0 0 0 0 ...
$ CountryOfBirthCode: int 57 57 57 57 57 57 57 57 57 57 ...
$ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
$ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 4 5 1 3 2 2 2 2 3 2 ...
$ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA 11 NA NA 11 4 14 4 NA 12 ...
nrow(CPS)
[1] 131302
Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
print("Educational and health services",quote = FALSE)
[1] Educational and health services
Recall from the homework assignment “The Analytical Detective” that you can call the sort() function on the output of the table() function to obtain a sorted breakdown of a variable. For instance, sort(table(CPS$Region)) sorts the regions by the number of interviewees from that region.
Which state has the fewest interviewees?
noquote(names(sort(table(CPS$State)))[1])
[1] New Mexico
Which state has the largest number of interviewees?
noquote(names(sort(table(CPS$State)))[51])
[1] California
What proportion of interviewees are citizens of the United States?
sum(table(CPS$Citizenship)[1:2]) / nrow(CPS)
[1] 0.9421943
The CPS differentiates between race (with possible values American Indian, Asian, Black, Pacific Islander, White, or Multiracial) and ethnicity. A number of interviewees are of Hispanic ethnicity, as captured by the Hispanic variable. For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
table(CPS$Race,CPS$Hispanic)
0 1
American Indian 1129 304
Asian 6407 113
Black 13292 621
Multiracial 2449 448
Pacific Islander 541 77
White 89190 16731
print("American Indian",quote = FALSE)
[1] American Indian
print("Black",quote = FALSE)
[1] Black
print("Multiracial",quote = FALSE)
[1] Multiracial
print("White",quote = FALSE)
[1] White
Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
summary(CPS)
PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.000 Midwest :30684 California :11570 Min. :10420 Min. : 0.00
1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780 1st Qu.:19.00
Median : 3.000 South :41502 New York : 5595 Median :34740 Median :39.00
Mean : 3.284 West :33177 Florida : 5149 Mean :35075 Mean :38.83
3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860 3rd Qu.:57.00
Max. :15.000 Illinois : 3912 Max. :79600 Max. :85.00
(Other) :94069 NA's :34238
Married Sex Education Race
Divorced :11151 Female:67481 High school :30906 American Indian : 1433
Married :55509 Male :63821 Bachelor's degree :19443 Asian : 6520
Never Married:30772 Some college, no degree:18863 Black : 13913
Separated : 2027 No high school diploma :16095 Multiracial : 2897
Widowed : 6505 Associate degree : 9913 Pacific Islander: 618
NA's :25338 (Other) :10744 White :105921
NA's :25338
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.0000 Min. : 57.00 Citizen, Native :116639 Disabled : 5712
1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073 Employed :61733
Median :0.0000 Median : 57.00 Non-Citizen : 7590 Not in Labor Force:15246
Mean :0.1393 Mean : 82.68 Retired :18619
3rd Qu.:0.0000 3rd Qu.: 57.00 Unemployed : 4203
Max. :1.0000 Max. :555.00 NA's :25789
Industry
Educational and health services :15017
Trade : 8933
Professional and business services: 7519
Manufacturing : 6791
Leisure and hospitality : 6364
(Other) :21618
NA's :65060
print("MetroAreaCode",quote = FALSE)
[1] MetroAreaCode
print("Married",quote = FALSE)
[1] Married
print("Education",quote = FALSE)
[1] Education
print("EmploymentStatus",quote = FALSE)
[1] EmploymentStatus
print("Industry",quote = FALSE)
[1] Industry
Often when evaluating a new dataset, we try to identify if there is a pattern in the missing values in the dataset. We will try to determine if there is a pattern in the missing values of the Married variable. The function
is.na(CPS$Married)
returns a vector of TRUE/FALSE values for whether the Married variable is missing. We can see the breakdown of whether Married is missing based on the reported value of the Region variable with the function
table(CPS$Region, is.na(CPS$Married))
Which is the most accurate:
table(CPS$Region, is.na(CPS$Married))
FALSE TRUE
Midwest 24609 6075
Northeast 21432 4507
South 33535 7967
West 26388 6789
table(CPS$Sex, is.na(CPS$Married))
FALSE TRUE
Female 55264 12217
Male 50700 13121
table(CPS$Age, is.na(CPS$Married))
FALSE TRUE
0 0 1283
1 0 1559
2 0 1574
3 0 1693
4 0 1695
5 0 1795
6 0 1721
7 0 1681
8 0 1729
9 0 1748
10 0 1750
11 0 1721
12 0 1797
13 0 1802
14 0 1790
15 1795 0
16 1751 0
17 1764 0
18 1596 0
19 1517 0
20 1398 0
21 1525 0
22 1536 0
23 1638 0
24 1627 0
25 1604 0
26 1643 0
27 1657 0
28 1736 0
29 1645 0
30 1854 0
31 1762 0
32 1790 0
33 1804 0
34 1653 0
35 1716 0
36 1663 0
37 1531 0
38 1530 0
39 1542 0
40 1571 0
41 1673 0
42 1711 0
43 1819 0
44 1764 0
45 1749 0
46 1665 0
47 1647 0
48 1791 0
49 1989 0
50 1966 0
51 1931 0
52 1935 0
53 1994 0
54 1912 0
55 1895 0
56 1935 0
57 1827 0
58 1874 0
59 1758 0
60 1746 0
61 1735 0
62 1595 0
63 1596 0
64 1519 0
65 1569 0
66 1577 0
67 1227 0
68 1130 0
69 1062 0
70 1195 0
71 1031 0
72 941 0
73 896 0
74 842 0
75 763 0
76 729 0
77 698 0
78 659 0
79 661 0
80 2664 0
85 2446 0
table(CPS$Citizenship, is.na(CPS$Married))
FALSE TRUE
Citizen, Native 91956 24683
Citizen, Naturalized 6910 163
Non-Citizen 7098 492
print("The Married variable being missing is related to the Age value for the interviewee.",quote = FALSE)
[1] The Married variable being missing is related to the Age value for the interviewee.
As mentioned in the variable descriptions, MetroAreaCode is missing if an interviewee does not live in a metropolitan area. Using the same technique as in the previous question, answer the following questions about people who live in non-metropolitan areas.
How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
sum(table(CPS$State,is.na(CPS$MetroAreaCode))[,1]==0)
[1] 2
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
sum(table(CPS$State,is.na(CPS$MetroAreaCode))[,2]==0)
[1] 3
Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
temp <- table(CPS$Region,is.na(CPS$MetroAreaCode))
prop <- temp[,2]/(temp[,1]+temp[,2])
noquote(names(prop)[prop == max(prop)])
[1] Midwest
While we were able to use the table() command to compute the proportion of interviewees from each region not living in a metropolitan area, it was somewhat tedious (it involved manually computing the proportion for each region) and isn’t something you would want to do if there were a larger number of options. It turns out there is a less tedious way to compute the proportion of values that are TRUE. The mean() function, which takes the average of the values passed to it, will treat TRUE as 1 and FALSE as 0, meaning it returns the proportion of values that are true. For instance, mean(c(TRUE, FALSE, TRUE, TRUE)) returns 0.75. Knowing this, use tapply() with the mean function to answer the following questions:
Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
temp <- abs(sort(tapply(is.na(CPS$MetroAreaCode), CPS$State, mean))-0.3)
noquote(names(temp)[temp == min(temp)])
[1] Wisconsin
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
temp <- sort(tapply(is.na(CPS$MetroAreaCode), CPS$State, mean),decreasing = TRUE)
noquote(names(temp)[3])
[1] Montana
Codes like MetroAreaCode and CountryOfBirthCode are a compact way to encode factor variables with text as their possible values, and they are therefore quite common in survey datasets. In fact, all but one of the variables in this dataset were actually stored by a numeric code in the original CPS datafile.
When analyzing a variable stored by a numeric code, we will often want to convert it into the values the codes represent. To do this, we will use a dictionary, which maps the the code to the actual value of the variable. We have provided dictionaries MetroAreaCodes.csv and CountryCodes.csv, which respectively map MetroAreaCode and CountryOfBirthCode into their true values. Read these two dictionaries into data frames MetroAreaMap and CountryMap.
How many observations (codes for metropolitan areas) are there in MetroAreaMap?
MetroAreaMap <- read.csv("MetroAreaCodes.csv")
nrow(MetroAreaMap)
[1] 271
How many observations (codes for countries) are there in CountryMap?
CountryMap <- read.csv("CountryCodes.csv")
nrow(CountryMap)
[1] 149
To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap. The following command merges the two data frames on these columns, overwriting the CPS data frame with the result:
CPS = merge(CPS, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
The first two arguments determine the data frames to be merged (they are called “x” and “y”, respectively, in the subsequent parameters to the merge function). by.x=“MetroAreaCode” means we’re matching on the MetroAreaCode variable from the “x” data frame (CPS), while by.y=“Code” means we’re matching on the Code variable from the “y” data frame (MetroAreaMap). Finally, all.x=TRUE means we want to keep all rows from the “x” data frame (CPS), even if some of the rows’ MetroAreaCode doesn’t match any codes in MetroAreaMap (for those familiar with database terminology, this parameter makes the operation a left outer join instead of an inner join).
Review the new version of the CPS data frame with the summary() and str() functions. What is the name of the variable that was added to the data frame by the merge() operation?
CPS <- merge(CPS, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
summary(CPS)
MetroAreaCode PeopleInHousehold Region State Age
Min. :10420 Min. : 1.000 Midwest :30684 California :11570 Min. : 0.00
1st Qu.:21780 1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:19.00
Median :34740 Median : 3.000 South :41502 New York : 5595 Median :39.00
Mean :35075 Mean : 3.284 West :33177 Florida : 5149 Mean :38.83
3rd Qu.:41860 3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:57.00
Max. :79600 Max. :15.000 Illinois : 3912 Max. :85.00
NA's :34238 (Other) :94069
Married Sex Education Race
Divorced :11151 Female:67481 High school :30906 American Indian : 1433
Married :55509 Male :63821 Bachelor's degree :19443 Asian : 6520
Never Married:30772 Some college, no degree:18863 Black : 13913
Separated : 2027 No high school diploma :16095 Multiracial : 2897
Widowed : 6505 Associate degree : 9913 Pacific Islander: 618
NA's :25338 (Other) :10744 White :105921
NA's :25338
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.0000 Min. : 57.00 Citizen, Native :116639 Disabled : 5712
1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073 Employed :61733
Median :0.0000 Median : 57.00 Non-Citizen : 7590 Not in Labor Force:15246
Mean :0.1393 Mean : 82.68 Retired :18619
3rd Qu.:0.0000 3rd Qu.: 57.00 Unemployed : 4203
Max. :1.0000 Max. :555.00 NA's :25789
Industry MetroArea
Educational and health services :15017 New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409
Trade : 8933 Washington-Arlington-Alexandria, DC-VA-MD-WV : 4177
Professional and business services: 7519 Los Angeles-Long Beach-Santa Ana, CA : 4102
Manufacturing : 6791 Philadelphia-Camden-Wilmington, PA-NJ-DE : 2855
Leisure and hospitality : 6364 Chicago-Naperville-Joliet, IN-IN-WI : 2772
(Other) :21618 (Other) :77749
NA's :65060 NA's :34238
str(CPS)
'data.frame': 131302 obs. of 15 variables:
$ MetroAreaCode : int 10420 10420 10420 10420 10420 10420 10420 10420 10420 10420 ...
$ PeopleInHousehold : int 4 4 2 4 1 3 4 4 2 3 ...
$ Region : Factor w/ 4 levels "Midwest","Northeast",..: 1 1 1 1 1 1 1 1 1 1 ...
$ State : Factor w/ 51 levels "Alabama","Alaska",..: 36 36 36 36 36 36 36 36 36 36 ...
$ Age : int 2 9 73 40 63 19 30 6 60 32 ...
$ Married : Factor w/ 5 levels "Divorced","Married",..: NA NA 2 2 3 3 2 NA 2 2 ...
$ Sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 1 1 2 ...
$ Education : Factor w/ 8 levels "Associate degree",..: NA NA 8 4 6 4 2 NA 4 4 ...
$ Race : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 2 6 6 6 ...
$ Hispanic : int 0 0 0 0 0 0 0 1 0 0 ...
$ CountryOfBirthCode: int 57 57 57 362 57 57 203 57 57 57 ...
$ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 2 1 1 3 1 1 1 ...
$ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: NA NA 4 3 1 2 3 NA 2 2 ...
$ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA NA NA NA NA 7 NA NA 4 13 ...
$ MetroArea : Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 1 1 1 1 1 1 1 1 1 1 ...
print("MetroArea",quote = FALSE)
[1] MetroArea
How many interviewees have a missing value for the new metropolitan area variable? Note that all of these interviewees would have been removed from the merged data frame if we did not include the all.x=TRUE parameter.
sum(is.na(CPS$MetroArea))
[1] 34238
Which of the following metropolitan areas has the largest number of interviewees?
sort(table(CPS$MetroArea),decreasing = TRUE)
New York-Northern New Jersey-Long Island, NY-NJ-PA Washington-Arlington-Alexandria, DC-VA-MD-WV
5409 4177
Los Angeles-Long Beach-Santa Ana, CA Philadelphia-Camden-Wilmington, PA-NJ-DE
4102 2855
Chicago-Naperville-Joliet, IN-IN-WI Providence-Fall River-Warwick, MA-RI
2772 2284
Boston-Cambridge-Quincy, MA-NH Minneapolis-St Paul-Bloomington, MN-WI
2229 1942
Dallas-Fort Worth-Arlington, TX Houston-Baytown-Sugar Land, TX
1863 1649
Honolulu, HI Miami-Fort Lauderdale-Miami Beach, FL
1576 1554
Atlanta-Sandy Springs-Marietta, GA Denver-Aurora, CO
1552 1504
Baltimore-Towson, MD San Francisco-Oakland-Fremont, CA
1483 1386
Detroit-Warren-Livonia, MI Las Vegas-Paradise, NV
1354 1299
Riverside-San Bernardino, CA Seattle-Tacoma-Bellevue, WA
1290 1255
Portland-Vancouver-Beaverton, OR-WA Phoenix-Mesa-Scottsdale, AZ
1089 971
Kansas City, MO-KS Omaha-Council Bluffs, NE-IA
962 957
St. Louis, MO-IL San Diego-Carlsbad-San Marcos, CA
956 907
Hartford-West Hartford-East Hartford, CT Tampa-St. Petersburg-Clearwater, FL
885 842
Pittsburgh, PA Bridgeport-Stamford-Norwalk, CT
732 730
Salt Lake City, UT Cincinnati-Middletown, OH-KY-IN
723 719
Milwaukee-Waukesha-West Allis, WI Portland-South Portland, ME
714 701
Cleveland-Elyria-Mentor, OH San Jose-Sunnyvale-Santa Clara, CA
681 670
Sacramento-Arden-Arcade-Roseville, CA Burlington-South Burlington, VT
667 657
Boise City-Nampa, ID Orlando, FL
644 610
Albuquerque, NM San Antonio, TX
609 607
Oklahoma City, OK Virginia Beach-Norfolk-Newport News, VA-NC
604 597
Sioux Falls, SD Indianapolis, IN
595 570
Columbus, OH Louisville, KY-IN
551 519
Charlotte-Gastonia-Concord, NC-SC Austin-Round Rock, TX
517 516
New Haven, CT Nashville-Davidson-Murfreesboro, TN
506 505
Des Moines, IA Richmond, VA
501 490
Dover, DE Fargo, ND-MN
456 432
Wichita, KS Ogden-Clearfield, UT
427 423
Little Rock-North Little Rock, AR Jacksonville, FL
404 393
Birmingham-Hoover, AL Colorado Springs, CO
392 372
New Orleans-Metairie-Kenner, LA Memphis, TN-MS-AR
367 348
Buffalo-Niagara Falls, NY Raleigh-Cary, NC
344 336
Allentown-Bethlehem-Easton, PA-NJ Tulsa, OK
334 323
Reno-Sparks, NV Provo-Orem, UT
310 309
Rochester, NY Grand Rapids-Wyoming, MI
307 304
Fresno, CA Tucson, AZ
303 302
Columbia, SC Madison, WI
291 284
Albany-Schenectady-Troy, NY Dayton, OH
268 268
Oxnard-Thousand Oaks-Ventura, CA Baton Rouge, LA
267 262
Charleston, WV Rochester-Dover, NH-ME
262 262
Greensboro-High Point, NC Bakersfield, CA
251 245
El Paso, TX Davenport-Moline-Rock Island, IA-IL
244 240
Toledo, OH Charleston-North Charleston, SC
235 232
Akron, OH Syracuse, NY
231 223
Jackson, MS Fayetteville-Springdale-Rogers, AR-MO
222 215
Bangor, ME Fort Collins-Loveland, CO
208 206
Norwich-New London, CT-RI Savannah, GA
203 202
Poughkeepsie-Newburgh-Middletown, NY Billings, MT
201 199
Lexington-Fayette, KY Cedar Rapids, IA
198 196
Eugene-Springfield, OR McAllen-Edinburg-Pharr, TX
196 195
Stockton, CA Sarasota-Bradenton-Venice, FL
193 192
Durham, NC Greenville, SC
189 185
Topeka, KS Lafayette, LA
182 181
Monroe, LA Scranton-Wilkes Barre, PA
179 176
Harrisburg-Carlisle, PA Boulder, CO
174 171
Salem, OR Knoxville, TN
170 168
Palm Bay-Melbourne-Titusville, FL Chattanooga, TN-GA
168 167
Greeley, CO Augusta-Richmond County, GA-SC
162 161
Springfield, MO Modesto, CA
161 158
Waterbury, CT Lancaster, PA
157 156
Spokane, WA Waterloo-Cedar Falls, IA
156 156
Springfield, MA-CT Youngstown-Warren-Boardman, OH
155 153
Lakeland-Winter Haven, FL Cape Coral-Fort Myers, FL
149 146
Shreveport-Bossier City, LA Worcester, MA-CT
146 144
Reading, PA Bend, OR
142 140
Deltona-Daytona Beach-Ormond Beach, FL Fort Wayne, IN
140 136
Green Bay, WI Vallejo-Fairfield, CA
136 133
Corpus Christi, TX Santa Barbara-Santa Maria-Goleta, CA
132 132
Iowa City, IA Pueblo, CO
131 130
Santa Rosa-Petaluma, CA Kalamazoo-Portage, MI
129 127
Winston-Salem, NC Duluth, MN-WI
127 126
Appleton,WI Beaumont-Port Author, TX
125 123
Champaign-Urbana, IL Visalia-Porterville, CA
122 121
Lansing-East Lansing, MI Racine, WI
119 119
Canton-Massillon, OH Coeur d'Alene, ID
118 117
Huntsville, AL York-Hanover, PA
117 117
Asheville, NC Victoria, TX
116 116
La Crosse, WI Rockford, IL
114 114
Danbury, CT Peoria, IL
112 112
Yakima, WA Atlantic City, NJ
112 111
Eau Claire, WI Mobile, AL
110 110
Port St. Lucie-Fort Pierce, FL Las Cruses, NM
109 107
Pensacola-Ferry Pass-Brent, FL Merced, CA
107 106
Fort Smith, AR-OK Bloomington, IN
105 104
Salinas, CA Montgomery, AL
104 103
Flint, MI Myrtle Beach-Conway-North Myrtle Beach, SC
102 102
Killeen-Temple-Fort Hood, TX El Centro, CA
101 99
Evansville, IN-KY Janesville, WI
99 99
Olympia, WA Spartanburg, SC
99 99
Lawrence, KS Lawton, OK
98 97
Decatur, Al Wausau, WI
96 96
Trenton-Ewing, NJ Harrisonburg, VA
91 90
Muskegon-Norton Shores, MI Laredo, TX
90 89
Amarillo, TX Bremerton-Silverdale, WA
88 87
Erie, PA Kankakee-Bradley, IL
87 87
Kingston, NY Hagerstown-Martinsburg, MD-WV
87 86
Ann Arbor, MI Oshkosh-Neenah, WI
85 85
Altoona, PA Huntington-Ashland, WV-KY-OH
82 82
Medford, OR Naples-Marco Island, FL
82 82
St. Cloud, MN Decatur, IL
82 81
Lake Charles, LA South Bend-Mishawaka, IN-MI
81 81
Fort Walton Beach-Crestview-Destin, FL Utica-Rome, NY
80 80
Brownsville-Harlingen, TX Vero Beach, FL
79 79
Waco, TX Holland-Grand Haven, MI
79 78
Tuscaloosa, AL Fayetteville, NC
78 77
Michigan City-La Porte, IN San Luis Obispo-Paso Robles, CA
77 77
Ocala, FL Springfield, IL
76 76
Barnstable Town, MA Saginaw-Saginaw Township North, MI
75 74
Salisbury, MD Binghamton, NY
74 73
Lynchburg, VA Bellingham, WA
73 70
Gainesville, FL Jackson, MI
70 70
Albany, GA Kingsport-Bristol, TN-VA
68 67
Leominster-Fitchburg-Gardner, MA Roanoke, VA
66 66
Santa-Cruz-Watsonville, CA Athens-Clark County, GA
66 65
Gulfport-Biloxi, MS Longview, TX
65 65
Macon, GA Anderson, SC
65 64
Farmington, NM Florence, AL
64 63
Jacksonville, NC Johnstown, PA
63 63
Lubbock, TX Monroe, MI
63 63
Anderson, IN Anniston-Oxford, AL
62 61
Napa, CA Chico, CA
61 60
Columbus, GA-AL Joplin, MO
59 59
Panama City-Lynn Haven, FL Hickory-Morgantown-Lenoir, NC
59 57
Madera, CA Prescott, AZ
57 54
Vineland-Millville-Bridgeton, NJ Johnson City, TN
54 52
Santa Fe, NM Midland, TX
52 51
Niles-Benton Harbor, MI Punta Gorda, FL
51 48
Columbia, MO Tallahassee, FL
47 43
Valdosta, GA Warner Robins, GA
42 42
Bloomington-Normal IL Springfield, OH
40 34
Ocean City, NJ Bowling Green, KY
30 29
Appleton-Oshkosh-Neenah, WI Grand Rapids-Muskegon-Holland, MI
0 0
Greenville-Spartanburg-Anderson, SC Hinesville-Fort Stewart, GA
0 0
Jamestown, NY Kalamazoo-Battle Creek, MI
0 0
Portsmouth-Rochester, NH-ME
0
print("Boston-Cambridge-Quincy, MA-NH",quote = FALSE)
[1] Boston-Cambridge-Quincy, MA-NH
Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity? Hint: Use tapply() with mean, as in the previous subproblem. Calling sort() on the output of tapply() could also be helpful here.
temp <- sort(tapply(CPS$Hispanic,CPS$MetroArea,mean),decreasing = TRUE)
noquote(names(temp)[1])
[1] Laredo, TX
Remembering that CPS$Race == “Asian” returns a TRUE/FALSE vector of whether an interviewee is Asian, determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
sum(tapply(CPS$Race == "Asian", CPS$MetroArea, mean)>=0.2,na.rm = TRUE)
[1] 4
Normally, we would look at the sorted proportion of interviewees from each metropolitan area who have not received a high school diploma with the command:
sort(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean))
However, none of the interviewees aged 14 and younger have an education value reported, so the mean value is reported as NA for each metropolitan area. To get mean (and related functions, like sum) to ignore missing values, you can pass the parameter na.rm=TRUE. Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
temp <- sort(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean,na.rm = TRUE))
noquote(names(temp)[1])
[1] Iowa City, IA
Just as we did with the metropolitan area information, merge in the country of birth information from the CountryMap data frame, replacing the CPS data frame with the result. If you accidentally overwrite CPS with the wrong values, remember that you can restore it by re-loading the data frame from CPSData.csv and then merging in the metropolitan area information using the command provided in the previous subproblem.
What is the name of the variable added to the CPS data frame by this merge operation?
CPS <- merge(CPS, CountryMap, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
str(CPS)
'data.frame': 131302 obs. of 16 variables:
$ CountryOfBirthCode: int 57 57 57 57 57 57 57 57 57 57 ...
$ MetroAreaCode : int 10420 71650 10420 10420 10420 10420 10420 10420 10420 10420 ...
$ PeopleInHousehold : int 2 4 5 2 2 3 1 3 4 4 ...
$ Region : Factor w/ 4 levels "Midwest","Northeast",..: 1 2 1 1 1 1 1 1 1 1 ...
$ State : Factor w/ 51 levels "Alabama","Alaska",..: 36 30 36 36 36 36 36 36 36 36 ...
$ Age : int 73 5 10 30 30 0 34 32 6 9 ...
$ Married : Factor w/ 5 levels "Divorced","Married",..: 2 NA NA 2 2 NA 1 2 NA NA ...
$ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 2 1 2 ...
$ Education : Factor w/ 8 levels "Associate degree",..: 8 NA NA 1 2 NA 4 4 NA NA ...
$ Race : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Hispanic : int 0 0 0 0 0 0 0 0 1 0 ...
$ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
$ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 4 NA NA 2 2 NA 2 2 NA NA ...
$ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA NA NA 13 9 NA 3 13 NA NA ...
$ MetroArea : Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 1 34 1 1 1 1 1 1 1 1 ...
$ Country : Factor w/ 149 levels "Afghanistan",..: 139 139 139 139 139 139 139 139 139 139 ...
print("Country",quote = FALSE)
[1] Country
How many interviewees have a missing value for the new country of birth variable?
sum(is.na(CPS$Country))
[1] 176
Among all interviewees born outside of North America, which country was the most common place of birth?
temp <- sort(table(CPS$Country),decreasing = TRUE)
noquote(names(temp)[3])
[1] Philippines
What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States? For this computation, don’t include people from this metropolitan area who have a missing country of birth.
temp <- table(CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA", CPS$Country != "United States")
temp[2,2]/(temp[2,1]+temp[2,2])
[1] 0.3086603
Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India? Hint – remember to include na.rm=TRUE if you are using tapply() to answer this question.
noquote(names(sort(table(CPS$MetroArea,CPS$Country == "India")[,2],decreasing = TRUE))[1])
[1] New York-Northern New Jersey-Long Island, NY-NJ-PA
In Brazil?
noquote(names(sort(table(CPS$MetroArea,CPS$Country == "Brazil")[,2],decreasing = TRUE))[1])
[1] Boston-Cambridge-Quincy, MA-NH
In Somalia?
noquote(names(sort(table(CPS$MetroArea,CPS$Country == "Somalia")[,2],decreasing = TRUE))[1])
[1] Minneapolis-St Paul-Bloomington, MN-WI