The major data is in data/MetroAreaCodes.csv. We read it into a data frame D.

D = read.csv("data/CPSData.csv", stringsAsFactors = TRUE)
# stringsAsFactors = TRUE代表將data中的'character字串'轉為'factor類別'

D 之中每一筆紀錄是一個受測對象,每一個欄位是一個調查問卷題項:


🌷 這份作業的學習重點是:


summary(D)
 PeopleInHousehold       Region               State       MetroAreaCode        Age      
 Min.   : 1.00     Midwest  :30684   California  :11570   Min.   :10420   Min.   : 0.0  
 1st Qu.: 2.00     Northeast:25939   Texas       : 7077   1st Qu.:21780   1st Qu.:19.0  
 Median : 3.00     South    :41502   New York    : 5595   Median :34740   Median :39.0  
 Mean   : 3.28     West     :33177   Florida     : 5149   Mean   :35075   Mean   :38.8  
 3rd Qu.: 4.00                       Pennsylvania: 3930   3rd Qu.:41860   3rd Qu.:57.0  
 Max.   :15.00                       Illinois    : 3912   Max.   :79600   Max.   :85.0  
                                     (Other)     :94069   NA's   :34238                 
          Married          Sex                          Education    
 Divorced     :11151   Female:67481   High school            :30906  
 Married      :55509   Male  :63821   Bachelor's degree      :19443  
 Never Married:30772                  Some college, no degree:18863  
 Separated    : 2027                  No high school diploma :16095  
 Widowed      : 6505                  Associate degree       : 9913  
 NA's         :25338                  (Other)                :10744  
                                      NA's                   :25338  
               Race           Hispanic     CountryOfBirthCode
 American Indian :  1433   Min.   :0.000   Min.   : 57.0     
 Asian           :  6520   1st Qu.:0.000   1st Qu.: 57.0     
 Black           : 13913   Median :0.000   Median : 57.0     
 Multiracial     :  2897   Mean   :0.139   Mean   : 82.7     
 Pacific Islander:   618   3rd Qu.:0.000   3rd Qu.: 57.0     
 White           :105921   Max.   :1.000   Max.   :555.0     
                                                             
               Citizenship               EmploymentStatus
 Citizen, Native     :116639   Disabled          : 5712  
 Citizen, Naturalized:  7073   Employed          :61733  
 Non-Citizen         :  7590   Not in Labor Force:15246  
                               Retired           :18619  
                               Unemployed        : 4203  
                               NA's              :25789  
                                                         
                               Industry    
 Educational and health services   :15017  
 Trade                             : 8933  
 Professional and business services: 7519  
 Manufacturing                     : 6791  
 Leisure and hospitality           : 6364  
 (Other)                           :21618  
 NA's                              :65060  
# summary只能針對'類別'or'數值'變數分別計算其統計值,字串是無法計算的!!

In the summary we can see that there are 34,238 NA’s in MetroAreaCode. Yet be careful …

🌷 summary does not shows the numbers of NA in the character columns.

The best way to examine NA in a data frame is to pipe is.na() into colSums().

is.na(D) %>% colSums  # 計算每個column中有多少個NA值
 PeopleInHousehold             Region              State      MetroAreaCode 
                 0                  0                  0              34238 
               Age            Married                Sex          Education 
                 0              25338                  0              25338 
              Race           Hispanic CountryOfBirthCode        Citizenship 
                 0                  0                  0                  0 
  EmploymentStatus           Industry 
             25789              65060 

now we see there are NA’s in Married, Education, EmploymentStatus and Industry. In the case of census, NA may occur when question items are …



Section-1 Loading and Summarizing the Dataset

§ 1.1 How many interviewees are in the dataset?

nrow(D)
[1] 131302

§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

table(D$Industry) %>% sort(dec=T) %>% head(1)

Educational and health services 
                          15017 
summary(D$Industry) %>% sort(dec=T) # 計算所有變數之總數(包含NA's)
                                       NA's             Educational and health services 
                                      65060                                       15017 
                                      Trade          Professional and business services 
                                       8933                                        7519 
                              Manufacturing                     Leisure and hospitality 
                                       6791                                        6364 
                               Construction                                   Financial 
                                       4387                                        4347 
               Transportation and utilities                              Other services 
                                       3260                                        3224 
                      Public administration                                 Information 
                                       3186                                        1328 
Agriculture, forestry, fishing, and hunting                                      Mining 
                                       1307                                         550 
                               Armed forces 
                                         29 

§ 1.3 Which state has the fewest interviewees?

table(D$State) %>% sort() %>% head(1)

New Mexico 
      1102 

Which state has the largest number of interviewees?

table(D$State) %>% sort() %>% tail(1)

California 
     11570 
# 或是 table(D$State) %>% sort(dec=T) %>% head(1)

§ 1.4 What proportion of interviewees are citizens of the United States?

table(D$Citizenship) %>% prop.table()

     Citizen, Native Citizen, Naturalized          Non-Citizen 
             0.88833              0.05387              0.05781 

§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?

tapply(D$Hispanic, D$Race, sum) %>% sort() # 此方法缺少>250人的條件
Pacific Islander            Asian  American Indian      Multiracial            Black 
              77              113              304              448              621 
           White 
           16731 
table(D$Race, D$Hispanic=='1')[ ,'TRUE'][table(D$Race, D$Hispanic=='1')[ ,'TRUE'] > 250 ]
American Indian           Black     Multiracial           White 
            304             621             448           16731 



Section-2 Evaluating Missing Values

§ 2.1 Which variables have at least one interviewee with a missing (NA) value?

summary(D)
 PeopleInHousehold       Region               State       MetroAreaCode        Age      
 Min.   : 1.00     Midwest  :30684   California  :11570   Min.   :10420   Min.   : 0.0  
 1st Qu.: 2.00     Northeast:25939   Texas       : 7077   1st Qu.:21780   1st Qu.:19.0  
 Median : 3.00     South    :41502   New York    : 5595   Median :34740   Median :39.0  
 Mean   : 3.28     West     :33177   Florida     : 5149   Mean   :35075   Mean   :38.8  
 3rd Qu.: 4.00                       Pennsylvania: 3930   3rd Qu.:41860   3rd Qu.:57.0  
 Max.   :15.00                       Illinois    : 3912   Max.   :79600   Max.   :85.0  
                                     (Other)     :94069   NA's   :34238                 
          Married          Sex                          Education    
 Divorced     :11151   Female:67481   High school            :30906  
 Married      :55509   Male  :63821   Bachelor's degree      :19443  
 Never Married:30772                  Some college, no degree:18863  
 Separated    : 2027                  No high school diploma :16095  
 Widowed      : 6505                  Associate degree       : 9913  
 NA's         :25338                  (Other)                :10744  
                                      NA's                   :25338  
               Race           Hispanic     CountryOfBirthCode
 American Indian :  1433   Min.   :0.000   Min.   : 57.0     
 Asian           :  6520   1st Qu.:0.000   1st Qu.: 57.0     
 Black           : 13913   Median :0.000   Median : 57.0     
 Multiracial     :  2897   Mean   :0.139   Mean   : 82.7     
 Pacific Islander:   618   3rd Qu.:0.000   3rd Qu.: 57.0     
 White           :105921   Max.   :1.000   Max.   :555.0     
                                                             
               Citizenship               EmploymentStatus
 Citizen, Native     :116639   Disabled          : 5712  
 Citizen, Naturalized:  7073   Employed          :61733  
 Non-Citizen         :  7590   Not in Labor Force:15246  
                               Retired           :18619  
                               Unemployed        : 4203  
                               NA's              :25789  
                                                         
                               Industry    
 Educational and health services   :15017  
 Trade                             : 8933  
 Professional and business services: 7519  
 Manufacturing                     : 6791  
 Leisure and hospitality           : 6364  
 (Other)                           :21618  
 NA's                              :65060  

§ 2.2 Which statements below is the most accurate:

tapply(is.na(D$Married), D$Region, mean)
  Midwest Northeast     South      West 
   0.1980    0.1738    0.1920    0.2046 
# 依照'區域'計算具有NA值的比率。[is.na()用來判斷是否有遺漏值NA存在!]
tapply(is.na(D$Married), D$Sex, mean)
Female   Male 
0.1810 0.2056 
# 依照'性別'計算具有NA值的比率。
tapply(is.na(D$Married), D$Age, mean)
 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 85 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
# 依照'年齡'計算具有NA值的比率。
tapply(is.na(D$Married), D$Citizenship, mean)
     Citizen, Native Citizen, Naturalized          Non-Citizen 
             0.21162              0.02305              0.06482 
# 依照'公民身分'計算具有NA值的比率。

🌻 Married is not applicable for interviewees who is younger than 15 years old. This type of NA occurs systematically. They are not random.

§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).

t1 <- tapply(is.na(D$MetroAreaCode), D$State, sum)
sum(table(D$State) == t1)  # 我自己寫的code
[1] 2
tapply(is.na(D$MetroAreaCode), D$State, mean) %>% sort
District of Columbia           New Jersey         Rhode Island           California 
             0.00000              0.00000              0.00000              0.02048 
             Florida        Massachusetts             Maryland             New York 
             0.03923              0.06492              0.06938              0.08061 
         Connecticut             Illinois             Colorado              Arizona 
             0.08568              0.11222              0.12991              0.13154 
              Nevada                Texas            Louisiana         Pennsylvania 
             0.13308              0.14370              0.16138              0.17430 
            Michigan           Washington              Georgia             Virginia 
             0.17826              0.18132              0.19843              0.19844 
                Utah               Oregon             Delaware           New Mexico 
             0.21010              0.21822              0.23397              0.24501 
              Hawaii                 Ohio              Alabama              Indiana 
             0.24917              0.25122              0.25872              0.29142 
           Wisconsin       South Carolina            Minnesota             Oklahoma 
             0.29933              0.31303              0.31507              0.32764 
            Missouri            Tennessee               Kansas       North Carolina 
             0.32867              0.35594              0.36227              0.37304 
                Iowa             Arkansas                Idaho             Kentucky 
             0.48695              0.49050              0.49868              0.50679 
       New Hampshire             Nebraska                Maine              Vermont 
             0.56875              0.58132              0.59832              0.65238 
         Mississippi         South Dakota         North Dakota        West Virginia 
             0.69431              0.70250              0.73739              0.75586 
             Montana               Alaska              Wyoming 
             0.83608              1.00000              1.00000 
# 老師用mean是因為若人數相同則比率會=1,此方法較快。但結果是呈現出所有的值,並非計算出加總結果

tapply(is.na(D$MetroAreaCode), D$State, mean)[tapply(is.na(D$MetroAreaCode), D$State, mean)=='1']
 Alaska Wyoming 
      1       1 
# 子紘的寫法

How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.

tapply(!is.na(D$MetroAreaCode), D$State, mean) %>% sort(dec=T) %>% head(3)
District of Columbia           New Jersey         Rhode Island 
                   1                    1                    1 
#按照上一題老師的寫法寫的,抓出!NA值做降冪排列,這寫法的值會=1

tapply(is.na(D$MetroAreaCode), D$State, mean)[tapply(is.na(D$MetroAreaCode), D$State, mean)=='0']
District of Columbia           New Jersey         Rhode Island 
                   0                    0                    0 
# 子紘的寫法,要抓出!NA值所以此寫法的值要=0

§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?

tapply(is.na(D$MetroAreaCode), D$Region, mean) %>% sort() %>% tail(1)
Midwest 
 0.3479 

§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?

t3 <- tapply(is.na(D$MetroAreaCode), D$State, mean)
abs(t3*100-30) %>% sort() %>% head(1)
Wisconsin 
  0.06701 
# 將t3換成%然後-30,再取abs,排序後得到的最小值=最靠近30%

Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?

tapply(is.na(D$MetroAreaCode), D$State, mean) %>% sort(dec=T)
              Alaska              Wyoming              Montana        West Virginia 
             1.00000              1.00000              0.83608              0.75586 
        North Dakota         South Dakota          Mississippi              Vermont 
             0.73739              0.70250              0.69431              0.65238 
               Maine             Nebraska        New Hampshire             Kentucky 
             0.59832              0.58132              0.56875              0.50679 
               Idaho             Arkansas                 Iowa       North Carolina 
             0.49868              0.49050              0.48695              0.37304 
              Kansas            Tennessee             Missouri             Oklahoma 
             0.36227              0.35594              0.32867              0.32764 
           Minnesota       South Carolina            Wisconsin              Indiana 
             0.31507              0.31303              0.29933              0.29142 
             Alabama                 Ohio               Hawaii           New Mexico 
             0.25872              0.25122              0.24917              0.24501 
            Delaware               Oregon                 Utah             Virginia 
             0.23397              0.21822              0.21010              0.19844 
             Georgia           Washington             Michigan         Pennsylvania 
             0.19843              0.18132              0.17826              0.17430 
           Louisiana                Texas               Nevada              Arizona 
             0.16138              0.14370              0.13308              0.13154 
            Colorado             Illinois          Connecticut             New York 
             0.12991              0.11222              0.08568              0.08061 
            Maryland        Massachusetts              Florida           California 
             0.06938              0.06492              0.03923              0.02048 
District of Columbia           New Jersey         Rhode Island 
             0.00000              0.00000              0.00000 
# 這題答案會是Montana。因為題目說要忽略面試者皆為non-metropolitan的情形

🌻 As we can see in the questions above, NA’s are not completely useless. Sometimes they carry useful information.



Section-3 Integrating Metropolitan Area Data

In this exercise, we have two more data files

metro <- read.csv("data/MetroAreaCodes.csv")
country <- read.csv("data/CountryCodes.csv")

§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?

nrow(metro)
[1] 271

How many observations (codes for countries) are there in CountryMap?

nrow(country)
[1] 149

🌻 merge(x, y, by.x, by.y, all.x, all.y) merges two data frames x and y

§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?

D = merge(D, metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
# 用 D這個df中的MetroAreaCode 與 metro中的Code 合併,
# all.x=TRUE代表保留所有D含有的項目(預設值為False=移除掉)

How many interviewees have a missing value for the new metropolitan area variable?

sum(is.na(D$MetroArea))
[1] 34238

§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?

table(D$MetroArea) %>% sort() %>% tail(1)

New York-Northern New Jersey-Long Island, NY-NJ-PA 
                                              5409 

§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?

tapply(D$Hispanic, D$MetroArea, mean) %>% sort(dec=T) %>% head(1)
Laredo, TX 
    0.9663 

§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.

tapply(D$Race=='Asian', D$MetroArea, mean) %>% sort(dec=T) %>% head()
                      Honolulu, HI  San Francisco-Oakland-Fremont, CA 
                            0.5019                             0.2468 
San Jose-Sunnyvale-Santa Clara, CA              Vallejo-Fairfield, CA 
                            0.2418                             0.2030 
                        Fresno, CA                  Warner Robins, GA 
                            0.1848                             0.1667 

🌻 All mathematical and logical function return NA if there is any NA in their input vector(s). As an remedy, the na.rm=TRUE argument can exclude the NA’s from the input before calculation.

§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.

tapply(D$Education=='No high school diploma', D$MetroArea, mean,   na.rm=TRUE ) %>% sort() %>% head(1)
Iowa City, IA 
      0.02913 



Section-4 Integrating Country of Birth Data

§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?

D = merge(D, country, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)

How many interviewees have a missing value for the new Country variable?

sum(is.na(D$Country))
[1] 176

§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?

table(D$Country) %>% sort(dec=T) %>% head()

United States        Mexico   Philippines         India         China   Puerto Rico 
       115063          3921           839           770           581           518 
# 這題題目指定要出生在北美以外的地方,所以答案應該是Mexico

§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?

table(D$Country[D$MetroArea=='New York-Northern New Jersey-Long Island, NY-NJ-PA'] == 'United States') %>% prop.table()

 FALSE   TRUE 
0.3087 0.6913 

§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?

tapply(D$Country=='India', D$MetroArea, sum) %>% sort(dec=T) %>% head(1)
Detroit-Warren-Livonia, MI 
                        30 

In Brazil?

tapply(D$Country=='Brazil', D$MetroArea, sum) %>% sort(dec=T) %>% head(1)
Boston-Cambridge-Quincy, MA-NH 
                            18 

In Somalia?

tapply(D$Country=='Somalia', D$MetroArea, sum) %>% sort(dec=T) %>% head(1)
Phoenix-Mesa-Scottsdale, AZ 
                          7