基本的程式筆記設定,安裝、載入一些基本的套件

rm(list=ls(all=T))
knitr::opts_chunk$set(comment = NA)
knitr::opts_knit$set(global.par = TRUE)
par(cex=0.8); options(scipen=20, digits=4, width=90)
if(!require(pacman)) install.packages("pacman")
pacman::p_load(dplyr)

以上這些程式碼請大家不要去改動


就業統計數據是政策制定者用來衡量經濟整體實力的最重要指標之一。在美國,政府使用現有人口調查(CPS)衡量失業率,該調查每月收集來自各種美國人的人口統計和就業信息。在本練習中,我們將使用講座中審查的主題以及一些使用2013年9月版的,具有全國代表性的數據集。數據集中的觀察結果代表2013年9月CPS中實際完成調查的人員,完整數據集有385個欄位,但在本練習中,我們將使用數據集CPSData.csv版本,它具有以下欄位:




Section-1 Loading and Summarizing the Dataset

§ 1.1 How many interviewees are in the dataset?

A=read.csv('data/CPSData.csv')
MetroAreaMap=read.csv('data/MetroAreaCodes.csv')
CountryMap = read.csv('data/CountryCodes.csv')
nrow(A)
[1] 131302

§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

table(A$Industry) %>% sort

                               Armed forces                                      Mining 
                                         29                                         550 
Agriculture, forestry, fishing, and hunting                                 Information 
                                       1307                                        1328 
                      Public administration                              Other services 
                                       3186                                        3224 
               Transportation and utilities                                   Financial 
                                       3260                                        4347 
                               Construction                     Leisure and hospitality 
                                       4387                                        6364 
                              Manufacturing          Professional and business services 
                                       6791                                        7519 
                                      Trade             Educational and health services 
                                       8933                                       15017 
#Educational and health services

§ 1.3 Which state has the fewest interviewees?

table(A$State) %>% sort %>% head

   New Mexico       Montana   Mississippi       Alabama West Virginia      Arkansas 
         1102          1214          1230          1376          1409          1421 
#New Mexico 

Which state has the largest number of interviewees?

table(A$State) %>% sort %>% tail

    Illinois Pennsylvania      Florida     New York        Texas   California 
        3912         3930         5149         5595         7077        11570 
#California 

§ 1.4 What proportion of interviewees are citizens of the United States?

table(A$Citizenship) %>% prop.table

     Citizen, Native Citizen, Naturalized          Non-Citizen 
             0.88833              0.05387              0.05781 

§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)

tapply(A$Hispanic, A$Race,sum) %>% sort
Pacific Islander            Asian  American Indian      Multiracial            Black 
              77              113              304              448              621 
           White 
           16731 




Section-2 Evaluating Missing Values

§ 2.1 Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)

summary(A)
 PeopleInHousehold    Region             State           MetroAreaCode        Age      
 Min.   : 1.00     Length:131302      Length:131302      Min.   :10420   Min.   : 0.0  
 1st Qu.: 2.00     Class :character   Class :character   1st Qu.:21780   1st Qu.:19.0  
 Median : 3.00     Mode  :character   Mode  :character   Median :34740   Median :39.0  
 Mean   : 3.28                                           Mean   :35075   Mean   :38.8  
 3rd Qu.: 4.00                                           3rd Qu.:41860   3rd Qu.:57.0  
 Max.   :15.00                                           Max.   :79600   Max.   :85.0  
                                                         NA's   :34238                 
   Married              Sex             Education             Race          
 Length:131302      Length:131302      Length:131302      Length:131302     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    Hispanic     CountryOfBirthCode Citizenship        EmploymentStatus  
 Min.   :0.000   Min.   : 57.0      Length:131302      Length:131302     
 1st Qu.:0.000   1st Qu.: 57.0      Class :character   Class :character  
 Median :0.000   Median : 57.0      Mode  :character   Mode  :character  
 Mean   :0.139   Mean   : 82.7                                           
 3rd Qu.:0.000   3rd Qu.: 57.0                                           
 Max.   :1.000   Max.   :555.0                                           
                                                                         
   Industry        
 Length:131302     
 Class :character  
 Mode  :character  
                   
                   
                   
                   

§ 2.2 Which is the most accurate:

tapply(is.na(A$Married), A$Region, mean)
  Midwest Northeast     South      West 
   0.1980    0.1738    0.1920    0.2046 
tapply(is.na(A$Married), A$Sex, mean)
Female   Male 
0.1810 0.2056 
tapply(is.na(A$Married), A$Age, mean)
 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 85 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
tapply(is.na(A$Married), A$Citizenship, mean)
     Citizen, Native Citizen, Naturalized          Non-Citizen 
             0.21162              0.02305              0.06482 

§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).

tapply(is.na(A$MetroAreaCode), A$State,mean) %>% sort
District of Columbia           New Jersey         Rhode Island           California 
             0.00000              0.00000              0.00000              0.02048 
             Florida        Massachusetts             Maryland             New York 
             0.03923              0.06492              0.06938              0.08061 
         Connecticut             Illinois             Colorado              Arizona 
             0.08568              0.11222              0.12991              0.13154 
              Nevada                Texas            Louisiana         Pennsylvania 
             0.13308              0.14370              0.16138              0.17430 
            Michigan           Washington              Georgia             Virginia 
             0.17826              0.18132              0.19843              0.19844 
                Utah               Oregon             Delaware           New Mexico 
             0.21010              0.21822              0.23397              0.24501 
              Hawaii                 Ohio              Alabama              Indiana 
             0.24917              0.25122              0.25872              0.29142 
           Wisconsin       South Carolina            Minnesota             Oklahoma 
             0.29933              0.31303              0.31507              0.32764 
            Missouri            Tennessee               Kansas       North Carolina 
             0.32867              0.35594              0.36227              0.37304 
                Iowa             Arkansas                Idaho             Kentucky 
             0.48695              0.49050              0.49868              0.50679 
       New Hampshire             Nebraska                Maine              Vermont 
             0.56875              0.58132              0.59832              0.65238 
         Mississippi         South Dakota         North Dakota        West Virginia 
             0.69431              0.70250              0.73739              0.75586 
             Montana               Alaska              Wyoming 
             0.83608              1.00000              1.00000 

How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.

tapply(is.na(A$MetroAreaCode), A$State, mean) %>% sort
District of Columbia           New Jersey         Rhode Island           California 
             0.00000              0.00000              0.00000              0.02048 
             Florida        Massachusetts             Maryland             New York 
             0.03923              0.06492              0.06938              0.08061 
         Connecticut             Illinois             Colorado              Arizona 
             0.08568              0.11222              0.12991              0.13154 
              Nevada                Texas            Louisiana         Pennsylvania 
             0.13308              0.14370              0.16138              0.17430 
            Michigan           Washington              Georgia             Virginia 
             0.17826              0.18132              0.19843              0.19844 
                Utah               Oregon             Delaware           New Mexico 
             0.21010              0.21822              0.23397              0.24501 
              Hawaii                 Ohio              Alabama              Indiana 
             0.24917              0.25122              0.25872              0.29142 
           Wisconsin       South Carolina            Minnesota             Oklahoma 
             0.29933              0.31303              0.31507              0.32764 
            Missouri            Tennessee               Kansas       North Carolina 
             0.32867              0.35594              0.36227              0.37304 
                Iowa             Arkansas                Idaho             Kentucky 
             0.48695              0.49050              0.49868              0.50679 
       New Hampshire             Nebraska                Maine              Vermont 
             0.56875              0.58132              0.59832              0.65238 
         Mississippi         South Dakota         North Dakota        West Virginia 
             0.69431              0.70250              0.73739              0.75586 
             Montana               Alaska              Wyoming 
             0.83608              1.00000              1.00000 

§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?

tapply(is.na(A$MetroAreaCode), A$Region, mean) %>% sort
Northeast     South      West   Midwest 
   0.2162    0.2378    0.2437    0.3479 

§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?

tapply(is.na(A$MetroAreaCode),A$State,mean) %>% sort
District of Columbia           New Jersey         Rhode Island           California 
             0.00000              0.00000              0.00000              0.02048 
             Florida        Massachusetts             Maryland             New York 
             0.03923              0.06492              0.06938              0.08061 
         Connecticut             Illinois             Colorado              Arizona 
             0.08568              0.11222              0.12991              0.13154 
              Nevada                Texas            Louisiana         Pennsylvania 
             0.13308              0.14370              0.16138              0.17430 
            Michigan           Washington              Georgia             Virginia 
             0.17826              0.18132              0.19843              0.19844 
                Utah               Oregon             Delaware           New Mexico 
             0.21010              0.21822              0.23397              0.24501 
              Hawaii                 Ohio              Alabama              Indiana 
             0.24917              0.25122              0.25872              0.29142 
           Wisconsin       South Carolina            Minnesota             Oklahoma 
             0.29933              0.31303              0.31507              0.32764 
            Missouri            Tennessee               Kansas       North Carolina 
             0.32867              0.35594              0.36227              0.37304 
                Iowa             Arkansas                Idaho             Kentucky 
             0.48695              0.49050              0.49868              0.50679 
       New Hampshire             Nebraska                Maine              Vermont 
             0.56875              0.58132              0.59832              0.65238 
         Mississippi         South Dakota         North Dakota        West Virginia 
             0.69431              0.70250              0.73739              0.75586 
             Montana               Alaska              Wyoming 
             0.83608              1.00000              1.00000 

Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?

tapply(is.na(A$MetroAreaCode), A$State, mean) %>% sort
District of Columbia           New Jersey         Rhode Island           California 
             0.00000              0.00000              0.00000              0.02048 
             Florida        Massachusetts             Maryland             New York 
             0.03923              0.06492              0.06938              0.08061 
         Connecticut             Illinois             Colorado              Arizona 
             0.08568              0.11222              0.12991              0.13154 
              Nevada                Texas            Louisiana         Pennsylvania 
             0.13308              0.14370              0.16138              0.17430 
            Michigan           Washington              Georgia             Virginia 
             0.17826              0.18132              0.19843              0.19844 
                Utah               Oregon             Delaware           New Mexico 
             0.21010              0.21822              0.23397              0.24501 
              Hawaii                 Ohio              Alabama              Indiana 
             0.24917              0.25122              0.25872              0.29142 
           Wisconsin       South Carolina            Minnesota             Oklahoma 
             0.29933              0.31303              0.31507              0.32764 
            Missouri            Tennessee               Kansas       North Carolina 
             0.32867              0.35594              0.36227              0.37304 
                Iowa             Arkansas                Idaho             Kentucky 
             0.48695              0.49050              0.49868              0.50679 
       New Hampshire             Nebraska                Maine              Vermont 
             0.56875              0.58132              0.59832              0.65238 
         Mississippi         South Dakota         North Dakota        West Virginia 
             0.69431              0.70250              0.73739              0.75586 
             Montana               Alaska              Wyoming 
             0.83608              1.00000              1.00000 




Section-3 Integrating Metropolitan Area Data

§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?

nrow(MetroAreaMap)
[1] 271

How many observations (codes for countries) are there in CountryMap?

nrow(CountryMap)
[1] 149

§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?

A = merge(A,CountryMap, by.x= "CountryOfBirthCode",  by.y="Code", all.x=TRUE)
A = merge(A,MetroAreaMap, by.x= "MetroAreaCode", by.y="Code", all.x=TRUE)

How many interviewees have a missing value for the new metropolitan area variable?

sum(is.na(A$MetroArea))
[1] 34238

§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?

table(A$MetroArea) %>% sort %>% tail

              Providence-Fall River-Warwick, MA-RI 
                                              2284 
               Chicago-Naperville-Joliet, IN-IN-WI 
                                              2772 
          Philadelphia-Camden-Wilmington, PA-NJ-DE 
                                              2855 
              Los Angeles-Long Beach-Santa Ana, CA 
                                              4102 
      Washington-Arlington-Alexandria, DC-VA-MD-WV 
                                              4177 
New York-Northern New Jersey-Long Island, NY-NJ-PA 
                                              5409 

§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?

tapply(A$Hispanic,A$MetroArea,mean) %>% sort %>% tail
           San Antonio, TX              El Centro, CA                El Paso, TX 
                    0.6442                     0.6869                     0.7910 
 Brownsville-Harlingen, TX McAllen-Edinburg-Pharr, TX                 Laredo, TX 
                    0.7975                     0.9487                     0.9663 

§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.

tapply(A$Race=="Asian",A$MetroArea,mean) %>% sort %>% tail
                 Warner Robins, GA                         Fresno, CA 
                            0.1667                             0.1848 
             Vallejo-Fairfield, CA San Jose-Sunnyvale-Santa Clara, CA 
                            0.2030                             0.2418 
 San Francisco-Oakland-Fremont, CA                       Honolulu, HI 
                            0.2468                             0.5019 

§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.

tapply(A$Education=="No high school diploma",A$MetroArea,mean,na.rm=T)%>% sort%>% head
           Iowa City, IA        Bowling Green, KY    Kalamazoo-Portage, MI 
                 0.02913                  0.03704                  0.05051 
    Champaign-Urbana, IL Bremerton-Silverdale, WA             Lawrence, KS 
                 0.05155                  0.05405                  0.05952 




Section-4 Integrating Country of Birth Data

§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?

#有疑問
#

How many interviewees have a missing value for the new metropolitan area variable?

sum(is.na(A$MetroArea))
[1] 34238

§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?

table(A$Country) %>% sort %>% tail(10)

         Cuba       Germany       Vietnam   El Salvador   Puerto Rico         China 
          426           438           458           477           518           581 
        India   Philippines        Mexico United States 
          770           839          3921        115063 

§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?

table(A$Country[A$MetroArea=="New York-Northern New Jersey-Long Island, NY-NJ-PA"]=="United States")%>% prop.table

 FALSE   TRUE 
0.3087 0.6913 

§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?

tapply(A$Country=="India",A$MetroArea,sum) %>% sort %>% tail
                      Kansas City, MO-KS        Milwaukee-Waukesha-West Allis, WI 
                                      11                                       12 
                              Fresno, CA       San Jose-Sunnyvale-Santa Clara, CA 
                                      16                                       19 
Hartford-West Hartford-East Hartford, CT               Detroit-Warren-Livonia, MI 
                                      26                                       30 

In Brazil?

tapply(A$Country=="Brazil", A$MetroArea,sum) %>% sort %>% tail
Sacramento-Arden-Arcade-Roseville, CA                  Canton-Massillon, OH 
                                    2                                     3 
          Phoenix-Mesa-Scottsdale, AZ   Davenport-Moline-Rock Island, IA-IL 
                                    3                                     4 
Miami-Fort Lauderdale-Miami Beach, FL        Boston-Cambridge-Quincy, MA-NH 
                                   16                                    18 

In Somalia?

tapply(A$Country=="Somalia", A$MetroArea,sum) %>% sort %>% tail
              York-Hanover, PA Youngstown-Warren-Boardman, OH 
                             0                              0 
                    Dayton, OH                   Richmond, VA 
                             1                              1 
   Phoenix-Mesa-Scottsdale, AZ                  St. Cloud, MN 
                             7                              7