基本的程式筆記設定,安裝、載入一些基本的套件
rm(list=ls(all=T))
knitr::opts_chunk$set(comment = NA)
knitr::opts_knit$set(global.par = TRUE)
par(cex=0.8); options(scipen=20, digits=4, width=90)
if(!require(pacman)) install.packages("pacman")
pacman::p_load(dplyr)以上這些程式碼請大家不要去改動
就業統計數據是政策制定者用來衡量經濟整體實力的最重要指標之一。在美國,政府使用現有人口調查(CPS)衡量失業率,該調查每月收集來自各種美國人的人口統計和就業信息。在本練習中,我們將使用講座中審查的主題以及一些使用2013年9月版的,具有全國代表性的數據集。數據集中的觀察結果代表2013年9月CPS中實際完成調查的人員,完整數據集有385個欄位,但在本練習中,我們將使用數據集CPSData.csv版本,它具有以下欄位:
PeopleInHousehold: 受訪者家庭中的人數。Region: 受訪者居住的人口普查區域。State: 受訪者居住的州。MetroAreaCode: 都會區代碼,如受訪者不住都會區,則為NA;從代碼到都會區名稱的對應在MetroAreaCodes.csv中提供。Age: 受訪者的年齡,以年為單位。 80代表80-84歲的人,85代表85歲及以上的人。Married: 受訪者的婚姻狀況。Sex: 受訪者的性別。Education: 受訪者獲得的最高教育程度。Race: 受訪者的種族。Hispanic: 受訪者是否屬於西班牙裔。CountryOfBirthcode: 識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。Citizenship: 受訪者的公民身份。EmploymentStatus: 受訪者的就業狀況。Industry: 受訪者的就業行業(僅在受僱的情況下可用)。§ 1.1 How many interviewees are in the dataset?
A=read.csv('data/CPSData.csv')
MetroAreaMap=read.csv('data/MetroAreaCodes.csv')
CountryMap = read.csv('data/CountryCodes.csv')
nrow(A)[1] 131302
§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
table(A$Industry) %>% sort
Armed forces Mining
29 550
Agriculture, forestry, fishing, and hunting Information
1307 1328
Public administration Other services
3186 3224
Transportation and utilities Financial
3260 4347
Construction Leisure and hospitality
4387 6364
Manufacturing Professional and business services
6791 7519
Trade Educational and health services
8933 15017
#Educational and health services§ 1.3 Which state has the fewest interviewees?
table(A$State) %>% sort %>% head
New Mexico Montana Mississippi Alabama West Virginia Arkansas
1102 1214 1230 1376 1409 1421
#New Mexico Which state has the largest number of interviewees?
table(A$State) %>% sort %>% tail
Illinois Pennsylvania Florida New York Texas California
3912 3930 5149 5595 7077 11570
#California § 1.4 What proportion of interviewees are citizens of the United States?
table(A$Citizenship) %>% prop.table
Citizen, Native Citizen, Naturalized Non-Citizen
0.88833 0.05387 0.05781
§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
tapply(A$Hispanic, A$Race,sum) %>% sortPacific Islander Asian American Indian Multiracial Black
77 113 304 448 621
White
16731
§ 2.1 Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
summary(A) PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.00 Length:131302 Length:131302 Min. :10420 Min. : 0.0
1st Qu.: 2.00 Class :character Class :character 1st Qu.:21780 1st Qu.:19.0
Median : 3.00 Mode :character Mode :character Median :34740 Median :39.0
Mean : 3.28 Mean :35075 Mean :38.8
3rd Qu.: 4.00 3rd Qu.:41860 3rd Qu.:57.0
Max. :15.00 Max. :79600 Max. :85.0
NA's :34238
Married Sex Education Race
Length:131302 Length:131302 Length:131302 Length:131302
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.000 Min. : 57.0 Length:131302 Length:131302
1st Qu.:0.000 1st Qu.: 57.0 Class :character Class :character
Median :0.000 Median : 57.0 Mode :character Mode :character
Mean :0.139 Mean : 82.7
3rd Qu.:0.000 3rd Qu.: 57.0
Max. :1.000 Max. :555.0
Industry
Length:131302
Class :character
Mode :character
§ 2.2 Which is the most accurate:
tapply(is.na(A$Married), A$Region, mean) Midwest Northeast South West
0.1980 0.1738 0.1920 0.2046
tapply(is.na(A$Married), A$Sex, mean)Female Male
0.1810 0.2056
tapply(is.na(A$Married), A$Age, mean) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 85
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tapply(is.na(A$Married), A$Citizenship, mean) Citizen, Native Citizen, Naturalized Non-Citizen
0.21162 0.02305 0.06482
§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
tapply(is.na(A$MetroAreaCode), A$State,mean) %>% sortDistrict of Columbia New Jersey Rhode Island California
0.00000 0.00000 0.00000 0.02048
Florida Massachusetts Maryland New York
0.03923 0.06492 0.06938 0.08061
Connecticut Illinois Colorado Arizona
0.08568 0.11222 0.12991 0.13154
Nevada Texas Louisiana Pennsylvania
0.13308 0.14370 0.16138 0.17430
Michigan Washington Georgia Virginia
0.17826 0.18132 0.19843 0.19844
Utah Oregon Delaware New Mexico
0.21010 0.21822 0.23397 0.24501
Hawaii Ohio Alabama Indiana
0.24917 0.25122 0.25872 0.29142
Wisconsin South Carolina Minnesota Oklahoma
0.29933 0.31303 0.31507 0.32764
Missouri Tennessee Kansas North Carolina
0.32867 0.35594 0.36227 0.37304
Iowa Arkansas Idaho Kentucky
0.48695 0.49050 0.49868 0.50679
New Hampshire Nebraska Maine Vermont
0.56875 0.58132 0.59832 0.65238
Mississippi South Dakota North Dakota West Virginia
0.69431 0.70250 0.73739 0.75586
Montana Alaska Wyoming
0.83608 1.00000 1.00000
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
tapply(is.na(A$MetroAreaCode), A$State, mean) %>% sortDistrict of Columbia New Jersey Rhode Island California
0.00000 0.00000 0.00000 0.02048
Florida Massachusetts Maryland New York
0.03923 0.06492 0.06938 0.08061
Connecticut Illinois Colorado Arizona
0.08568 0.11222 0.12991 0.13154
Nevada Texas Louisiana Pennsylvania
0.13308 0.14370 0.16138 0.17430
Michigan Washington Georgia Virginia
0.17826 0.18132 0.19843 0.19844
Utah Oregon Delaware New Mexico
0.21010 0.21822 0.23397 0.24501
Hawaii Ohio Alabama Indiana
0.24917 0.25122 0.25872 0.29142
Wisconsin South Carolina Minnesota Oklahoma
0.29933 0.31303 0.31507 0.32764
Missouri Tennessee Kansas North Carolina
0.32867 0.35594 0.36227 0.37304
Iowa Arkansas Idaho Kentucky
0.48695 0.49050 0.49868 0.50679
New Hampshire Nebraska Maine Vermont
0.56875 0.58132 0.59832 0.65238
Mississippi South Dakota North Dakota West Virginia
0.69431 0.70250 0.73739 0.75586
Montana Alaska Wyoming
0.83608 1.00000 1.00000
§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
tapply(is.na(A$MetroAreaCode), A$Region, mean) %>% sortNortheast South West Midwest
0.2162 0.2378 0.2437 0.3479
§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
tapply(is.na(A$MetroAreaCode),A$State,mean) %>% sortDistrict of Columbia New Jersey Rhode Island California
0.00000 0.00000 0.00000 0.02048
Florida Massachusetts Maryland New York
0.03923 0.06492 0.06938 0.08061
Connecticut Illinois Colorado Arizona
0.08568 0.11222 0.12991 0.13154
Nevada Texas Louisiana Pennsylvania
0.13308 0.14370 0.16138 0.17430
Michigan Washington Georgia Virginia
0.17826 0.18132 0.19843 0.19844
Utah Oregon Delaware New Mexico
0.21010 0.21822 0.23397 0.24501
Hawaii Ohio Alabama Indiana
0.24917 0.25122 0.25872 0.29142
Wisconsin South Carolina Minnesota Oklahoma
0.29933 0.31303 0.31507 0.32764
Missouri Tennessee Kansas North Carolina
0.32867 0.35594 0.36227 0.37304
Iowa Arkansas Idaho Kentucky
0.48695 0.49050 0.49868 0.50679
New Hampshire Nebraska Maine Vermont
0.56875 0.58132 0.59832 0.65238
Mississippi South Dakota North Dakota West Virginia
0.69431 0.70250 0.73739 0.75586
Montana Alaska Wyoming
0.83608 1.00000 1.00000
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
tapply(is.na(A$MetroAreaCode), A$State, mean) %>% sortDistrict of Columbia New Jersey Rhode Island California
0.00000 0.00000 0.00000 0.02048
Florida Massachusetts Maryland New York
0.03923 0.06492 0.06938 0.08061
Connecticut Illinois Colorado Arizona
0.08568 0.11222 0.12991 0.13154
Nevada Texas Louisiana Pennsylvania
0.13308 0.14370 0.16138 0.17430
Michigan Washington Georgia Virginia
0.17826 0.18132 0.19843 0.19844
Utah Oregon Delaware New Mexico
0.21010 0.21822 0.23397 0.24501
Hawaii Ohio Alabama Indiana
0.24917 0.25122 0.25872 0.29142
Wisconsin South Carolina Minnesota Oklahoma
0.29933 0.31303 0.31507 0.32764
Missouri Tennessee Kansas North Carolina
0.32867 0.35594 0.36227 0.37304
Iowa Arkansas Idaho Kentucky
0.48695 0.49050 0.49868 0.50679
New Hampshire Nebraska Maine Vermont
0.56875 0.58132 0.59832 0.65238
Mississippi South Dakota North Dakota West Virginia
0.69431 0.70250 0.73739 0.75586
Montana Alaska Wyoming
0.83608 1.00000 1.00000
§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?
nrow(MetroAreaMap)[1] 271
How many observations (codes for countries) are there in CountryMap?
nrow(CountryMap)[1] 149
§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?
A = merge(A,CountryMap, by.x= "CountryOfBirthCode", by.y="Code", all.x=TRUE)
A = merge(A,MetroAreaMap, by.x= "MetroAreaCode", by.y="Code", all.x=TRUE)How many interviewees have a missing value for the new metropolitan area variable?
sum(is.na(A$MetroArea))[1] 34238
§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?
table(A$MetroArea) %>% sort %>% tail
Providence-Fall River-Warwick, MA-RI
2284
Chicago-Naperville-Joliet, IN-IN-WI
2772
Philadelphia-Camden-Wilmington, PA-NJ-DE
2855
Los Angeles-Long Beach-Santa Ana, CA
4102
Washington-Arlington-Alexandria, DC-VA-MD-WV
4177
New York-Northern New Jersey-Long Island, NY-NJ-PA
5409
§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?
tapply(A$Hispanic,A$MetroArea,mean) %>% sort %>% tail San Antonio, TX El Centro, CA El Paso, TX
0.6442 0.6869 0.7910
Brownsville-Harlingen, TX McAllen-Edinburg-Pharr, TX Laredo, TX
0.7975 0.9487 0.9663
§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
tapply(A$Race=="Asian",A$MetroArea,mean) %>% sort %>% tail Warner Robins, GA Fresno, CA
0.1667 0.1848
Vallejo-Fairfield, CA San Jose-Sunnyvale-Santa Clara, CA
0.2030 0.2418
San Francisco-Oakland-Fremont, CA Honolulu, HI
0.2468 0.5019
§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
tapply(A$Education=="No high school diploma",A$MetroArea,mean,na.rm=T)%>% sort%>% head Iowa City, IA Bowling Green, KY Kalamazoo-Portage, MI
0.02913 0.03704 0.05051
Champaign-Urbana, IL Bremerton-Silverdale, WA Lawrence, KS
0.05155 0.05405 0.05952
§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?
#有疑問
#How many interviewees have a missing value for the new metropolitan area variable?
sum(is.na(A$MetroArea))[1] 34238
§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?
table(A$Country) %>% sort %>% tail(10)
Cuba Germany Vietnam El Salvador Puerto Rico China
426 438 458 477 518 581
India Philippines Mexico United States
770 839 3921 115063
§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?
table(A$Country[A$MetroArea=="New York-Northern New Jersey-Long Island, NY-NJ-PA"]=="United States")%>% prop.table
FALSE TRUE
0.3087 0.6913
§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?
tapply(A$Country=="India",A$MetroArea,sum) %>% sort %>% tail Kansas City, MO-KS Milwaukee-Waukesha-West Allis, WI
11 12
Fresno, CA San Jose-Sunnyvale-Santa Clara, CA
16 19
Hartford-West Hartford-East Hartford, CT Detroit-Warren-Livonia, MI
26 30
In Brazil?
tapply(A$Country=="Brazil", A$MetroArea,sum) %>% sort %>% tailSacramento-Arden-Arcade-Roseville, CA Canton-Massillon, OH
2 3
Phoenix-Mesa-Scottsdale, AZ Davenport-Moline-Rock Island, IA-IL
3 4
Miami-Fort Lauderdale-Miami Beach, FL Boston-Cambridge-Quincy, MA-NH
16 18
In Somalia?
tapply(A$Country=="Somalia", A$MetroArea,sum) %>% sort %>% tail York-Hanover, PA Youngstown-Warren-Boardman, OH
0 0
Dayton, OH Richmond, VA
1 1
Phoenix-Mesa-Scottsdale, AZ St. Cloud, MN
7 7