The major data is in data/MetroAreaCodes.csv. We read it into a data frame D.
D = read.csv("data/CPSData.csv", stringsAsFactors = TRUE)
# stringsAsFactors = TRUE代表將data中的'character字串'轉為'factor類別'D 之中每一筆紀錄是一個受測對象,每一個欄位是一個調查問卷題項:
PeopleInHousehold: 受訪者家庭中的人數。Region: 受訪者居住的人口普查區域。State: 受訪者居住的州。MetroAreaCode: 都會區代碼,如受訪者不住都會區,則為NA;從代碼到都會區名稱的對應在MetroAreaCodes.csv中提供。Age: 受訪者的年齡,以年為單位。 80代表80-84歲的人,85代表85歲及以上的人。Married: 受訪者的婚姻狀況。Sex: 受訪者的性別。Education: 受訪者獲得的最高教育程度。Race: 受訪者的種族。Hispanic: 受訪者是否屬於西班牙裔。CountryOfBirthcode: 識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。Citizenship: 受訪者的公民身份。EmploymentStatus: 受訪者的就業狀況。Industry: 受訪者的就業行業(僅在受僱的情況下可用)。🌷 這份作業的學習重點是:
NA) 的觀念和處理方法summary(D) PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.00 Midwest :30684 California :11570 Min. :10420 Min. : 0.0
1st Qu.: 2.00 Northeast:25939 Texas : 7077 1st Qu.:21780 1st Qu.:19.0
Median : 3.00 South :41502 New York : 5595 Median :34740 Median :39.0
Mean : 3.28 West :33177 Florida : 5149 Mean :35075 Mean :38.8
3rd Qu.: 4.00 Pennsylvania: 3930 3rd Qu.:41860 3rd Qu.:57.0
Max. :15.00 Illinois : 3912 Max. :79600 Max. :85.0
(Other) :94069 NA's :34238
Married Sex Education
Divorced :11151 Female:67481 High school :30906
Married :55509 Male :63821 Bachelor's degree :19443
Never Married:30772 Some college, no degree:18863
Separated : 2027 No high school diploma :16095
Widowed : 6505 Associate degree : 9913
NA's :25338 (Other) :10744
NA's :25338
Race Hispanic CountryOfBirthCode
American Indian : 1433 Min. :0.000 Min. : 57.0
Asian : 6520 1st Qu.:0.000 1st Qu.: 57.0
Black : 13913 Median :0.000 Median : 57.0
Multiracial : 2897 Mean :0.139 Mean : 82.7
Pacific Islander: 618 3rd Qu.:0.000 3rd Qu.: 57.0
White :105921 Max. :1.000 Max. :555.0
Citizenship EmploymentStatus
Citizen, Native :116639 Disabled : 5712
Citizen, Naturalized: 7073 Employed :61733
Non-Citizen : 7590 Not in Labor Force:15246
Retired :18619
Unemployed : 4203
NA's :25789
Industry
Educational and health services :15017
Trade : 8933
Professional and business services: 7519
Manufacturing : 6791
Leisure and hospitality : 6364
(Other) :21618
NA's :65060
# summary只能針對'類別'or'數值'變數分別計算其統計值,字串是無法計算的!!In the summary we can see that there are 34,238 NA’s in MetroAreaCode. Yet be careful …
🌷 summary does not shows the numbers of NA in the character columns.
The best way to examine NA in a data frame is to pipe is.na() into colSums().
is.na(D) %>% colSums # 計算每個column中有多少個NA值 PeopleInHousehold Region State MetroAreaCode
0 0 0 34238
Age Married Sex Education
0 25338 0 25338
Race Hispanic CountryOfBirthCode Citizenship
0 0 0 0
EmploymentStatus Industry
25789 65060
now we see there are NA’s in Married, Education, EmploymentStatus and Industry. In the case of census, NA may occur when question items are …
§ 1.1 How many interviewees are in the dataset?
nrow(D)[1] 131302
§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
table(D$Industry) %>% sort(dec=T) %>% head(1)
Educational and health services
15017
summary(D$Industry) %>% sort(dec=T) # 計算所有變數之總數(包含NA's) NA's Educational and health services
65060 15017
Trade Professional and business services
8933 7519
Manufacturing Leisure and hospitality
6791 6364
Construction Financial
4387 4347
Transportation and utilities Other services
3260 3224
Public administration Information
3186 1328
Agriculture, forestry, fishing, and hunting Mining
1307 550
Armed forces
29
§ 1.3 Which state has the fewest interviewees?
table(D$State) %>% sort() %>% head(1)
New Mexico
1102
Which state has the largest number of interviewees?
table(D$State) %>% sort() %>% tail(1)
California
11570
# 或是 table(D$State) %>% sort(dec=T) %>% head(1)§ 1.4 What proportion of interviewees are citizens of the United States?
table(D$Citizenship) %>% prop.table()
Citizen, Native Citizen, Naturalized Non-Citizen
0.88833 0.05387 0.05781
§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?
tapply(D$Hispanic, D$Race, sum) %>% sort() # 此方法缺少>250人的條件Pacific Islander Asian American Indian Multiracial Black
77 113 304 448 621
White
16731
table(D$Race, D$Hispanic=='1')[ ,'TRUE'][table(D$Race, D$Hispanic=='1')[ ,'TRUE'] > 250 ]American Indian Black Multiracial White
304 621 448 16731
§ 2.1 Which variables have at least one interviewee with a missing (NA) value?
summary(D) PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.00 Midwest :30684 California :11570 Min. :10420 Min. : 0.0
1st Qu.: 2.00 Northeast:25939 Texas : 7077 1st Qu.:21780 1st Qu.:19.0
Median : 3.00 South :41502 New York : 5595 Median :34740 Median :39.0
Mean : 3.28 West :33177 Florida : 5149 Mean :35075 Mean :38.8
3rd Qu.: 4.00 Pennsylvania: 3930 3rd Qu.:41860 3rd Qu.:57.0
Max. :15.00 Illinois : 3912 Max. :79600 Max. :85.0
(Other) :94069 NA's :34238
Married Sex Education
Divorced :11151 Female:67481 High school :30906
Married :55509 Male :63821 Bachelor's degree :19443
Never Married:30772 Some college, no degree:18863
Separated : 2027 No high school diploma :16095
Widowed : 6505 Associate degree : 9913
NA's :25338 (Other) :10744
NA's :25338
Race Hispanic CountryOfBirthCode
American Indian : 1433 Min. :0.000 Min. : 57.0
Asian : 6520 1st Qu.:0.000 1st Qu.: 57.0
Black : 13913 Median :0.000 Median : 57.0
Multiracial : 2897 Mean :0.139 Mean : 82.7
Pacific Islander: 618 3rd Qu.:0.000 3rd Qu.: 57.0
White :105921 Max. :1.000 Max. :555.0
Citizenship EmploymentStatus
Citizen, Native :116639 Disabled : 5712
Citizen, Naturalized: 7073 Employed :61733
Non-Citizen : 7590 Not in Labor Force:15246
Retired :18619
Unemployed : 4203
NA's :25789
Industry
Educational and health services :15017
Trade : 8933
Professional and business services: 7519
Manufacturing : 6791
Leisure and hospitality : 6364
(Other) :21618
NA's :65060
§ 2.2 Which statements below is the most accurate:
tapply(is.na(D$Married), D$Region, mean) Midwest Northeast South West
0.1980 0.1738 0.1920 0.2046
# 依照'區域'計算具有NA值的比率。[is.na()用來判斷是否有遺漏值NA存在!]
tapply(is.na(D$Married), D$Sex, mean)Female Male
0.1810 0.2056
# 依照'性別'計算具有NA值的比率。
tapply(is.na(D$Married), D$Age, mean) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 85
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 依照'年齡'計算具有NA值的比率。
tapply(is.na(D$Married), D$Citizenship, mean) Citizen, Native Citizen, Naturalized Non-Citizen
0.21162 0.02305 0.06482
# 依照'公民身分'計算具有NA值的比率。🌻 Married is not applicable for interviewees who is younger than 15 years old. This type of NA occurs systematically. They are not random.
§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
t1 <- tapply(is.na(D$MetroAreaCode), D$State, sum)
sum(table(D$State) == t1) # 我自己寫的code[1] 2
tapply(is.na(D$MetroAreaCode), D$State, mean) %>% sortDistrict of Columbia New Jersey Rhode Island California
0.00000 0.00000 0.00000 0.02048
Florida Massachusetts Maryland New York
0.03923 0.06492 0.06938 0.08061
Connecticut Illinois Colorado Arizona
0.08568 0.11222 0.12991 0.13154
Nevada Texas Louisiana Pennsylvania
0.13308 0.14370 0.16138 0.17430
Michigan Washington Georgia Virginia
0.17826 0.18132 0.19843 0.19844
Utah Oregon Delaware New Mexico
0.21010 0.21822 0.23397 0.24501
Hawaii Ohio Alabama Indiana
0.24917 0.25122 0.25872 0.29142
Wisconsin South Carolina Minnesota Oklahoma
0.29933 0.31303 0.31507 0.32764
Missouri Tennessee Kansas North Carolina
0.32867 0.35594 0.36227 0.37304
Iowa Arkansas Idaho Kentucky
0.48695 0.49050 0.49868 0.50679
New Hampshire Nebraska Maine Vermont
0.56875 0.58132 0.59832 0.65238
Mississippi South Dakota North Dakota West Virginia
0.69431 0.70250 0.73739 0.75586
Montana Alaska Wyoming
0.83608 1.00000 1.00000
# 老師用mean是因為若人數相同則比率會=1,此方法較快。但結果是呈現出所有的值,並非計算出加總結果
tapply(is.na(D$MetroAreaCode), D$State, mean)[tapply(is.na(D$MetroAreaCode), D$State, mean)=='1'] Alaska Wyoming
1 1
# 子紘的寫法How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
tapply(!is.na(D$MetroAreaCode), D$State, mean) %>% sort(dec=T) %>% head(3)District of Columbia New Jersey Rhode Island
1 1 1
#按照上一題老師的寫法寫的,抓出!NA值做降冪排列,這寫法的值會=1
tapply(is.na(D$MetroAreaCode), D$State, mean)[tapply(is.na(D$MetroAreaCode), D$State, mean)=='0']District of Columbia New Jersey Rhode Island
0 0 0
# 子紘的寫法,要抓出!NA值所以此寫法的值要=0§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
tapply(is.na(D$MetroAreaCode), D$Region, mean) %>% sort() %>% tail(1)Midwest
0.3479
§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
t3 <- tapply(is.na(D$MetroAreaCode), D$State, mean)
abs(t3*100-30) %>% sort() %>% head(1)Wisconsin
0.06701
# 將t3換成%然後-30,再取abs,排序後得到的最小值=最靠近30%Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
tapply(is.na(D$MetroAreaCode), D$State, mean) %>% sort(dec=T) Alaska Wyoming Montana West Virginia
1.00000 1.00000 0.83608 0.75586
North Dakota South Dakota Mississippi Vermont
0.73739 0.70250 0.69431 0.65238
Maine Nebraska New Hampshire Kentucky
0.59832 0.58132 0.56875 0.50679
Idaho Arkansas Iowa North Carolina
0.49868 0.49050 0.48695 0.37304
Kansas Tennessee Missouri Oklahoma
0.36227 0.35594 0.32867 0.32764
Minnesota South Carolina Wisconsin Indiana
0.31507 0.31303 0.29933 0.29142
Alabama Ohio Hawaii New Mexico
0.25872 0.25122 0.24917 0.24501
Delaware Oregon Utah Virginia
0.23397 0.21822 0.21010 0.19844
Georgia Washington Michigan Pennsylvania
0.19843 0.18132 0.17826 0.17430
Louisiana Texas Nevada Arizona
0.16138 0.14370 0.13308 0.13154
Colorado Illinois Connecticut New York
0.12991 0.11222 0.08568 0.08061
Maryland Massachusetts Florida California
0.06938 0.06492 0.03923 0.02048
District of Columbia New Jersey Rhode Island
0.00000 0.00000 0.00000
# 這題答案會是Montana。因為題目說要忽略面試者皆為non-metropolitan的情形🌻 As we can see in the questions above, NA’s are not completely useless. Sometimes they carry useful information.
In this exercise, we have two more data files
data/MetroAreaCodes.csv maps D$MetroAreaCode to the names of the Metro Area’sdata/CountryCodes.csv maps D$CountryCode to the names of the Countriesmetro <- read.csv("data/MetroAreaCodes.csv")
country <- read.csv("data/CountryCodes.csv")§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?
nrow(metro)[1] 271
How many observations (codes for countries) are there in CountryMap?
nrow(country)[1] 149
🌻 merge(x, y, by.x, by.y, all.x, all.y) merges two data frames x and y
by.x and by.y specify the merging keys in x and y respectively.all.x and all.y specify whether all rows in x and y should be kept. The defaults are FALSE which implies rows that do not match are removed. When set to TRUE, the unmatched rows stay with the merged columns set to NA.§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?
D = merge(D, metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
# 用 D這個df中的MetroAreaCode 與 metro中的Code 合併,
# all.x=TRUE代表保留所有D含有的項目(預設值為False=移除掉)How many interviewees have a missing value for the new metropolitan area variable?
sum(is.na(D$MetroArea))[1] 34238
§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?
table(D$MetroArea) %>% sort() %>% tail(1)
New York-Northern New Jersey-Long Island, NY-NJ-PA
5409
§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?
tapply(D$Hispanic, D$MetroArea, mean) %>% sort(dec=T) %>% head(1)Laredo, TX
0.9663
§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
tapply(D$Race=='Asian', D$MetroArea, mean) %>% sort(dec=T) %>% head() Honolulu, HI San Francisco-Oakland-Fremont, CA
0.5019 0.2468
San Jose-Sunnyvale-Santa Clara, CA Vallejo-Fairfield, CA
0.2418 0.2030
Fresno, CA Warner Robins, GA
0.1848 0.1667
🌻 All mathematical and logical function return NA if there is any NA in their input vector(s). As an remedy, the na.rm=TRUE argument can exclude the NA’s from the input before calculation.
§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
tapply(D$Education=='No high school diploma', D$MetroArea, mean, na.rm=TRUE ) %>% sort() %>% head(1)Iowa City, IA
0.02913
§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?
D = merge(D, country, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)How many interviewees have a missing value for the new Country variable?
sum(is.na(D$Country))[1] 176
§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?
table(D$Country) %>% sort(dec=T) %>% head()
United States Mexico Philippines India China Puerto Rico
115063 3921 839 770 581 518
# 這題題目指定要出生在北美以外的地方,所以答案應該是Mexico§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?
table(D$Country[D$MetroArea=='New York-Northern New Jersey-Long Island, NY-NJ-PA'] == 'United States') %>% prop.table()
FALSE TRUE
0.3087 0.6913
§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?
tapply(D$Country=='India', D$MetroArea, sum) %>% sort(dec=T) %>% head(1)Detroit-Warren-Livonia, MI
30
In Brazil?
tapply(D$Country=='Brazil', D$MetroArea, sum) %>% sort(dec=T) %>% head(1)Boston-Cambridge-Quincy, MA-NH
18
In Somalia?
tapply(D$Country=='Somalia', D$MetroArea, sum) %>% sort(dec=T) %>% head(1)Phoenix-Mesa-Scottsdale, AZ
7