In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset (available online).
The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData.csv, which has the following variables:
PeopleInHousehold: The number of people in the interviewee’s household.
Region: The census region where the interviewee lives.
State: The state where the interviewee lives.
MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.
Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.
Married: The marriage status of the interviewee.
Sex: The sex of the interviewee.
Education: The maximum level of education obtained by the interviewee.
Race: The race of the interviewee.
Hispanic: Whether the interviewee is of Hispanic ethnicity.
CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.
Citizenship: The United States citizenship status of the interviewee.
EmploymentStatus: The status of employment of the interviewee.
Industry: The industry of employment of the interviewee (only available if they are employed).
library(dplyr)
package 'dplyr' was built under R version 3.5.1
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Load the dataset from CPSData.csv into a data frame called CPS, and view the dataset with the summary() and str() commands.
CPS=read.csv("C:/bussiness analytics/data/CPSData.csv")
nrow(CPS)
[1] 131302
Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
sort(table(CPS$Industry))%>% tail
Construction Leisure and hospitality
4387 6364
Manufacturing Professional and business services
6791 7519
Trade Educational and health services
8933 15017
Recall from the homework assignment “The Analytical Detective” that you can call the sort() function on the output of the table() function to obtain a sorted breakdown of a variable. For instance, sort(table(CPS$Region)) sorts the regions by the number of interviewees from that region.
Which state has the fewest interviewees?
sort(table(CPS$State)) %>% head
New Mexico Montana Mississippi Alabama West Virginia Arkansas
1102 1214 1230 1376 1409 1421
Which state has the largest number of interviewees?
sort(table(CPS$State)) %>% tail()
Illinois Pennsylvania Florida New York Texas California
3912 3930 5149 5595 7077 11570
What proportion of interviewees are citizens of the United States?
table(CPS$Citizenship) %>% prop.table
Citizen, Native Citizen, Naturalized Non-Citizen
0.88832615 0.05386818 0.05780567
table(CPS$Citizenship == "Non-Citizen") %>% prop.table
FALSE TRUE
0.94219433 0.05780567
prop.table()計算表格內的百分比
table(CPS$Citizenship == "Non-Citizen") %>% prop.table # 0.942194
FALSE TRUE
0.94219433 0.05780567
The CPS differentiates between race (with possible values American Indian, Asian, Black, Pacific Islander, White, or Multiracial) and ethnicity. A number of interviewees are of Hispanic ethnicity, as captured by the Hispanic variable. For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
table(CPS$Race, CPS$Hispanic)
0 1
American Indian 1129 304
Asian 6407 113
Black 13292 621
Multiracial 2449 448
Pacific Islander 541 77
White 89190 16731
Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
summary(CPS)
PeopleInHousehold Region State MetroAreaCode
Min. : 1.000 Midwest :30684 California :11570 Min. :10420
1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780
Median : 3.000 South :41502 New York : 5595 Median :34740
Mean : 3.284 West :33177 Florida : 5149 Mean :35075
3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860
Max. :15.000 Illinois : 3912 Max. :79600
(Other) :94069 NA's :34238
Age Married Sex
Min. : 0.00 Divorced :11151 Female:67481
1st Qu.:19.00 Married :55509 Male :63821
Median :39.00 Never Married:30772
Mean :38.83 Separated : 2027
3rd Qu.:57.00 Widowed : 6505
Max. :85.00 NA's :25338
Education Race Hispanic
High school :30906 American Indian : 1433 Min. :0.0000
Bachelor's degree :19443 Asian : 6520 1st Qu.:0.0000
Some college, no degree:18863 Black : 13913 Median :0.0000
No high school diploma :16095 Multiracial : 2897 Mean :0.1393
Associate degree : 9913 Pacific Islander: 618 3rd Qu.:0.0000
(Other) :10744 White :105921 Max. :1.0000
NA's :25338
CountryOfBirthCode Citizenship EmploymentStatus
Min. : 57.00 Citizen, Native :116639 Disabled : 5712
1st Qu.: 57.00 Citizen, Naturalized: 7073 Employed :61733
Median : 57.00 Non-Citizen : 7590 Not in Labor Force:15246
Mean : 82.68 Retired :18619
3rd Qu.: 57.00 Unemployed : 4203
Max. :555.00 NA's :25789
Industry
Educational and health services :15017
Trade : 8933
Professional and business services: 7519
Manufacturing : 6791
Leisure and hospitality : 6364
(Other) :21618
NA's :65060
#Married #Education #EmploymentStatus #Industry #
Often when evaluating a new dataset, we try to identify if there is a pattern in the missing values in the dataset. We will try to determine if there is a pattern in the missing values of the Married variable. The function
is.na(CPS$Married)
returns a vector of TRUE/FALSE values for whether the Married variable is missing. We can see the breakdown of whether Married is missing based on the reported value of the Region variable with the function
table(CPS$Region, is.na(CPS$Married))
Which is the most accurate:
is.na(CPS$Married)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[40] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[53] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[66] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[79] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[92] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[105] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[118] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[131] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[144] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[170] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[183] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
[196] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[209] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[222] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
[235] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[248] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[261] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
[274] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
[300] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
[313] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[326] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[339] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
[352] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
[365] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[378] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
[391] TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[404] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
[417] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[430] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[443] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[456] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
[469] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[482] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
[495] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[508] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
[521] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[534] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[547] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
[560] TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
[573] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[586] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[599] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[612] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[625] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[638] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[651] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
[664] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[677] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[690] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[703] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[716] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[729] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[742] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[755] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[768] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
[781] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[794] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[807] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[820] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[833] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
[846] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[859] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[872] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[885] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[898] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[911] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[924] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[937] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[950] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[963] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[976] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[989] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
[ reached getOption("max.print") -- omitted 130302 entries ]
table(CPS$Region, is.na(CPS$Married))
FALSE TRUE
Midwest 24609 6075
Northeast 21432 4507
South 33535 7967
West 26388 6789
# The Married variable being missing is not related to the Region, Sex, Age, or Citizenship value for the interviewee.
is.na()遺漏值的處理
As mentioned in the variable descriptions, MetroAreaCode is missing if an interviewee does not live in a metropolitan area. Using the same technique as in the previous question, answer the following questions about people who live in non-metropolitan areas.
How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
table(is.na(CPS$MetroAreaCode), CPS$State)
Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware
FALSE 1020 0 1327 724 11333 2545 2593 1696
TRUE 356 1590 201 697 237 380 243 518
District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa
FALSE 1791 4947 2250 1576 761 3473 1420 1297
TRUE 0 202 557 523 757 439 584 1231
Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota
FALSE 1234 908 1216 909 2978 1858 2517 2150
TRUE 701 933 234 1354 222 129 546 989
Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey
FALSE 376 1440 199 816 1609 1148 2567
TRUE 854 705 1015 1133 247 1514 0
New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon
FALSE 832 5144 1642 432 2754 1024 1519
TRUE 270 451 977 1213 924 499 424
Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah
FALSE 3245 2209 1139 595 1149 6060 1455
TRUE 685 0 519 1405 635 1017 387
Vermont Virginia Washington West Virginia Wisconsin Wyoming
FALSE 657 2367 1937 344 1882 0
TRUE 1233 586 429 1065 804 1624
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
# 3
Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
tapply(is.na(CPS$MetroAreaCode), CPS$Region, mean) %>% sort
Northeast South West Midwest
0.2162381 0.2378440 0.2436628 0.3478686
# Midwest
While we were able to use the table() command to compute the proportion of interviewees from each region not living in a metropolitan area, it was somewhat tedious (it involved manually computing the proportion for each region) and isn’t something you would want to do if there were a larger number of options. It turns out there is a less tedious way to compute the proportion of values that are TRUE. The mean() function, which takes the average of the values passed to it, will treat TRUE as 1 and FALSE as 0, meaning it returns the proportion of values that are true. For instance, mean(c(TRUE, FALSE, TRUE, TRUE)) returns 0.75. Knowing this, use tapply() with the mean function to answer the following questions:
Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
tapply(is.na(CPS$MetroAreaCode), CPS$State, mean)
Alabama Alaska Arizona Arkansas
0.25872093 1.00000000 0.13154450 0.49049965
California Colorado Connecticut Delaware
0.02048401 0.12991453 0.08568406 0.23396567
District of Columbia Florida Georgia Hawaii
0.00000000 0.03923092 0.19843249 0.24916627
Idaho Illinois Indiana Iowa
0.49868248 0.11221881 0.29141717 0.48694620
Kansas Kentucky Louisiana Maine
0.36227390 0.50678979 0.16137931 0.59832081
Maryland Massachusetts Michigan Minnesota
0.06937500 0.06492199 0.17825661 0.31506849
Mississippi Missouri Montana Nebraska
0.69430894 0.32867133 0.83607908 0.58132376
Nevada New Hampshire New Jersey New Mexico
0.13308190 0.56874530 0.00000000 0.24500907
New York North Carolina North Dakota Ohio
0.08060769 0.37304315 0.73738602 0.25122349
Oklahoma Oregon Pennsylvania Rhode Island
0.32764281 0.21821925 0.17430025 0.00000000
South Carolina South Dakota Tennessee Texas
0.31302774 0.70250000 0.35594170 0.14370496
Utah Vermont Virginia Washington
0.21009772 0.65238095 0.19844226 0.18131868
West Virginia Wisconsin Wyoming
0.75585522 0.29932986 1.00000000
abs(0.3 - tapply(is.na(CPS$MetroAreaCode), CPS$State, mean)) %>%
sort %>% head #Wisconsin
Wisconsin Indiana South Carolina Minnesota Oklahoma
0.0006701415 0.0085828343 0.0130277443 0.0150684932 0.0276428102
Missouri
0.0286713287
abs(x)用于计算x的绝对值, sqrt(x)用于计算x的(正则)平方根 √{x}
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
tapply(is.na(CPS$MetroAreaCode), CPS$State, mean) %>% sort %>% tail
South Dakota North Dakota West Virginia Montana Alaska Wyoming
0.7025000 0.7373860 0.7558552 0.8360791 1.0000000 1.0000000
# Montana
Codes like MetroAreaCode and CountryOfBirthCode are a compact way to encode factor variables with text as their possible values, and they are therefore quite common in survey datasets. In fact, all but one of the variables in this dataset were actually stored by a numeric code in the original CPS datafile.
When analyzing a variable stored by a numeric code, we will often want to convert it into the values the codes represent. To do this, we will use a dictionary, which maps the the code to the actual value of the variable. We have provided dictionaries MetroAreaCodes.csv and CountryCodes.csv, which respectively map MetroAreaCode and CountryOfBirthCode into their true values. Read these two dictionaries into data frames MetroAreaMap and CountryMap.
How many observations (codes for metropolitan areas) are there in MetroAreaMap?
Me=read.csv("C:/bussiness analytics/data/MetroAreaCodes.csv")
nrow(Me)
[1] 271
How many observations (codes for countries) are there in CountryMap?
Cc=read.csv("C:/bussiness analytics/data/CountryCodes.csv")
nrow(Cc)
[1] 149
To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap. The following command merges the two data frames on these columns, overwriting the CPS data frame with the result:
CPS = merge(CPS, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
The first two arguments determine the data frames to be merged (they are called “x” and “y”, respectively, in the subsequent parameters to the merge function). by.x=“MetroAreaCode” means we’re matching on the MetroAreaCode variable from the “x” data frame (CPS), while by.y=“Code” means we’re matching on the Code variable from the “y” data frame (MetroAreaMap). Finally, all.x=TRUE means we want to keep all rows from the “x” data frame (CPS), even if some of the rows’ MetroAreaCode doesn’t match any codes in MetroAreaMap (for those familiar with database terminology, this parameter makes the operation a left outer join instead of an inner join).
Review the new version of the CPS data frame with the summary() and str() functions. What is the name of the variable that was added to the data frame by the merge() operation?
CPS = merge(CPS, Me, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
column names 'MetroArea.x', 'MetroArea.y', 'MetroArea.x', 'MetroArea.y', 'MetroArea.x', 'MetroArea.y' are duplicated in the result
CPS = merge(CPS, Me, by.x=“MetroAreaCode”, by.y=“Code”, all.x=TRUE)
How many interviewees have a missing value for the new metropolitan area variable? Note that all of these interviewees would have been removed from the merged data frame if we did not include the all.x=TRUE parameter.
XX = merge(CPS, Me, by.x="MetroAreaCode", by.y="Code")
column names 'MetroArea.x', 'MetroArea.y', 'MetroArea.x', 'MetroArea.y', 'MetroArea.x', 'MetroArea.y', 'MetroArea.x', 'MetroArea.y' are duplicated in the result
merge()數據合併
Which of the following metropolitan areas has the largest number of interviewees?
table(CPS$MetroArea) %>% sort %>% tail(10) # Boston-Cambridge-Quincy, MA-NH
Houston-Baytown-Sugar Land, TX
1649
Dallas-Fort Worth-Arlington, TX
1863
Minneapolis-St Paul-Bloomington, MN-WI
1942
Boston-Cambridge-Quincy, MA-NH
2229
Providence-Fall River-Warwick, MA-RI
2284
Chicago-Naperville-Joliet, IN-IN-WI
2772
Philadelphia-Camden-Wilmington, PA-NJ-DE
2855
Los Angeles-Long Beach-Santa Ana, CA
4102
Washington-Arlington-Alexandria, DC-VA-MD-WV
4177
New York-Northern New Jersey-Long Island, NY-NJ-PA
5409
Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity? Hint: Use tapply() with mean, as in the previous subproblem. Calling sort() on the output of tapply() could also be helpful here.
tapply(CPS$Hispanic, CPS$MetroArea, mean) %>% sort %>% tail #Laredo, TX
San Antonio, TX El Centro, CA El Paso, TX
0.6441516 0.6868687 0.7909836
Brownsville-Harlingen, TX McAllen-Edinburg-Pharr, TX Laredo, TX
0.7974684 0.9487179 0.9662921
Remembering that CPS$Race == “Asian” returns a TRUE/FALSE vector of whether an interviewee is Asian, determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
tapply(CPS$Race == "Asian", CPS$MetroArea, mean) %>% sort %>% tail #4
Warner Robins, GA Fresno, CA
0.1666667 0.1848185
Vallejo-Fairfield, CA San Jose-Sunnyvale-Santa Clara, CA
0.2030075 0.2417910
San Francisco-Oakland-Fremont, CA Honolulu, HI
0.2467532 0.5019036
Normally, we would look at the sorted proportion of interviewees from each metropolitan area who have not received a high school diploma with the command:
sort(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean))
However, none of the interviewees aged 14 and younger have an education value reported, so the mean value is reported as NA for each metropolitan area. To get mean (and related functions, like sum) to ignore missing values, you can pass the parameter na.rm=TRUE. Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean, na.rm=T) %>%
sort %>% head
Iowa City, IA Bowling Green, KY Kalamazoo-Portage, MI
0.02912621 0.03703704 0.05050505
Champaign-Urbana, IL Bremerton-Silverdale, WA Lawrence, KS
0.05154639 0.05405405 0.05952381
# Iowa City, IA
Just as we did with the metropolitan area information, merge in the country of birth information from the CountryMap data frame, replacing the CPS data frame with the result. If you accidentally overwrite CPS with the wrong values, remember that you can restore it by re-loading the data frame from CPSData.csv and then merging in the metropolitan area information using the command provided in the previous subproblem.
What is the name of the variable added to the CPS data frame by this merge operation?
CPS = merge(CPS, Cc, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
# Country
merge(CPS ,Country ,by.x= ,by.y= all.x ) How many interviewees have a missing value for the new country of birth variable?
sum(is.na(CPS$CountryOfBirthCode))
[1] 0
Among all interviewees born outside of North America, which country was the most common place of birth?
table(CPS$Country) %>% sort %>% tail
integer(0)
What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States? For this computation, don’t include people from this metropolitan area who have a missing country of birth.
area="New York-Northern New Jersey-Long Island, NY-NJ-PA"
mean(Cc$Country[Me$MetroArea]==area)
[1] NA
area=“New York-Northern New Jersey-Long Island, NY-NJ-PA” mean(Cc\(Country[Me\)MetroArea]==area)
Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India? Hint – remember to include na.rm=TRUE if you are using tapply() to answer this question.
tapply(CPS$Country == 'India', CPS$MetroArea, sum, na.rm=T) %>%
sort %>% tail
New York-Northern New Jersey-Long Island, NY-NJ-PA
In Brazil?
tapply(CPS$Country == 'Brazil', CPS$MetroArea, sum, na.rm=T) %>% sort %>% tail
Error in tapply(CPS$Country == "Brazil", CPS$MetroArea, sum, na.rm = T) :
arguments must have same length
Boston-Cambridge-Quincy, MA-NH
In Somalia?
tapply(CPS$Country == 'Somalia', CPS$MetroArea, sum, na.rm=T) %>% sort %>% tail
Minneapolis-St Paul-Bloomington, MN-WI