N.B. This requires the rio and rvest packages in addition to the standard tidyverse. *****

Resolving overchoice

There are a lot of choices of data sets to look at. This caused the author some indecision. We can use R to take care of some of it though. There are 47 pages, and we need 3 examples, though some seem not quite fit for the assignment. Of course we have set the random seed so this is reproducible.

sample(1:47,10)
##  [1] 30 16  1  3  4 18 19 22 32 44

Evaluation of Page 30

Page 30’s first data set posted is the CDC SMART BRFSS (i.e. Metro area Behavioral Risk Factor). For data currency reasons we are using the 2017 data instead of the 2016 data that is suggested in the post. The extracted xpt file is 356 mb so the remote pull and unzip/load takes a fair amount of time.

mmsaRaw <- rio::import("https://www.cdc.gov/brfss/annual_data/2017/files/MMSA2017_XPT.zip")
kable(head(mmsaRaw[1:3]))
DISPCODE STATERE1 SAFETIME
1200 NA 1
1200 NA 1
1200 NA 1
1200 NA 1
1200 NA 1
1200 NA 1

Well that is a lot of data. Unfortunately it looks tidy, there are a lot of data points for each response, but each observation variable is a column, each observation is a row, and each type of observation is broken out (there is only one).

Evaluation of page 16

Page 16 does not have a data set posted.

Evaluation of page 1

The first data set is the Census PINC-03 data set (All races 25 Years+ Total Work Experience) with the appropriate library R can read the xslx

pincRaw <-rio::import("https://www2.census.gov/programs-surveys/cps/tables/pinc-03/2019/pinc03_1_1_1_1.xls")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ... and 6 more problems
kable(head(pincRaw))
Table with row headers in column A and column headers in rows 12 through 14 …2 …3 …4 …5 …6 …7 …8 …9 …10 …11 …12
PINC-03. Educational Attainment–People 25 Years Old and Over, by Total Money Earnings in 2018, Work Experience in 2018, Age, Race, Hispanic Origin, and Sex NA NA NA NA NA NA NA NA NA NA NA
Data reflect the implementation of an updated processing system that incorporates content from earlier questionnaire redesigns related to income, health insurance, and demographics. NA NA NA NA NA NA NA NA NA NA NA
For information on confidentiality protection, sampling error, nonsampling error, and definitions, see <www2.census.gov/programs-surveys/cps/techdocs/cpsmar19.pdf>. NA NA NA NA NA NA NA NA NA NA NA
Source: U.S. Census Bureau, Current Population Survey, 2019 Annual Social and Economic Supplement. NA NA NA NA NA NA NA NA NA NA NA
(Numbers in thousands. People 25 Years Old and Over as of March of the following year. A.O.I.C. stands for alone or in combination. Median income is calculated using $2,500 income intervals. NA NA NA NA NA NA NA NA NA NA NA
The Gini index is calculated using micro-sorted data. Medians falling in the upper open-ended interval are plugged with “$250,000”. Standard errors calculated using replicate weights) NA NA NA NA NA NA NA NA NA NA NA

The data did not import cleanly so we are going to have to do some work. First we slice off some extra rows, and combine and fill the column structure to get to a reasonable table. We then have to trim off the summary data they provide.

pincRaw <- slice(pincRaw,10:n()-1)
pincCols <- t(fill(as.data.frame(t(pincRaw[3:5,])),c(1,2,3) ))
colnames(pincRaw)<- gsub("Degree Graduate (Incl GED)", "Degree", gsub('(\\sNA)|(NA\\s)', '' ,paste(pincCols[1,],pincCols[2,],pincCols[3,], sep=" ")))
pinc <- slice(pincRaw,append(7,9:(n()-5))) %>% select(-Total)

kable(pinc)
Characteristic Less Than 9th Grade High School 9th to 12th Nongrad High School Graduate (Incl GED) College Some College No Degree Graduate (Incl GED) College Associate Degree Graduate (Incl GED) College Bachelor’s Degree or more Total College Bachelor’s Degree or more Bachelor’s Degree College Bachelor’s Degree or more Master’s Degree College Bachelor’s Degree or more Professional Degree College Bachelor’s Degree or more Doctorate Degree
Without Earnings 4829 7090 25591 12173 6566 19457 12488 5431 616 923
..$1 to $2,499 or loss 115 190 775 607 287 966 613 289 21 44
..$2,500 to $4,999 97 201 641 399 241 632 459 138 13 22
..$5,000 to $7,499 109 245 900 556 334 933 608 265 16 44
..$7,500 to $9,999 106 220 776 375 224 565 371 158 11 25
..$10,000 to $12,499 185 396 1536 826 528 979 688 236 22 32
..$12,500 to $14,999 127 209 711 362 206 502 364 113 18 8
..$15,000 to $17,499 273 433 1407 809 435 1007 750 203 12 42
..$17,500 to $19,999 196 339 1122 612 366 567 411 118 17 21
..$20,000 to $22,499 447 579 2321 1093 733 1415 963 341 46 65
..$22,500 to $24,999 194 280 1093 688 429 702 535 136 10 22
..$25,000 to $27,499 292 450 2331 1218 799 1502 1130 290 31 49
..$27,500 to $29,999 134 160 1116 466 379 583 410 148 11 14
..$30,000 to $32,499 331 468 2872 1616 1026 2063 1592 370 34 66
..$32,500 to $34,999 58 82 718 385 313 510 386 103 5 16
..$35,000 to $37,499 239 385 2237 1187 763 1979 1487 391 55 46
..$37,500 to $39,999 79 111 889 483 344 732 562 145 6 20
..$40,000 to $42,499 156 335 2235 1190 929 2451 1716 573 60 102
..$42,500 to $44,999 30 48 488 300 224 627 448 166 6 8
..$45,000 to $47,499 76 153 1430 884 692 1835 1269 491 29 46
..$47,500 to $49,999 47 70 589 445 297 927 651 222 13 41
..$50,000 to $52,499 149 247 1985 1307 1018 3035 1918 914 93 110
..$52,500 to $54,999 13 37 306 262 181 717 441 251 15 11
..$55,000 to $57,499 41 88 852 572 541 1783 1132 533 49 69
..$57,500 to $59,999 12 23 229 281 207 660 436 189 14 21
..$60,000 to $62,499 74 127 1342 892 779 2930 1922 777 97 133
..$62,500 to $64,999 8 27 197 169 120 505 284 180 10 30
..$65,000 to $67,499 21 28 674 427 409 1552 1004 461 23 64
..$67,500 to $69,999 11 9 228 146 148 513 340 134 11 29
..$70,000 to $72,499 21 46 678 525 451 1999 1230 609 75 85
..$72,500 to $74,999 4 3 124 82 85 418 232 151 12 24
..$75,000 to $77,499 24 45 527 358 321 1755 1126 432 84 113
..$77,500 to $79,999 9 19 141 135 127 460 307 121 15 17
..$80,000 to $82,499 14 56 550 432 390 1893 1111 587 72 123
..$82,500 to $84,999 0 1 134 91 52 328 146 141 13 27
..$85,000 to $87,499 3 13 249 238 180 1268 745 405 52 65
..$87,500 to $89,999 0 9 52 65 49 274 139 99 4 33
..$90,000 to $92,499 7 13 214 235 196 1498 812 517 34 135
..$92,500 to $94,999 0 12 61 44 33 307 190 96 4 18
..$95,000 to $97,499 6 12 126 129 97 780 451 250 21 58
..$97,500 to $99,999 2 8 94 78 46 332 185 120 12 15
..$100,000 and over 66 102 1717 1551 1198 15875 7885 4924 1375 1692
Median earnings (dollars) 25318 25280 35016 37811 41834 62140 57105 70241 104593 92126

Tidy

We simply gather an education column out of the columns with Thousands as the variable and the income ranges excluded. We also coerce the variable to numeric.

pincTidy <- pinc %>% gather(Education, Thousands, -Characteristic)
pincTidy$Thousands <- as.numeric(pincTidy$Thousands)
kable(head(pincTidy))
Characteristic Education Thousands
Without Earnings Less Than 9th Grade 4829
..$1 to $2,499 or loss Less Than 9th Grade 115
..$2,500 to $4,999 Less Than 9th Grade 97
..$5,000 to $7,499 Less Than 9th Grade 109
..$7,500 to $9,999 Less Than 9th Grade 106
..$10,000 to $12,499 Less Than 9th Grade 185

Analysis

We can see the number of members without earnings.

earningless <- pincTidy %>% filter(Characteristic == "Without Earnings") %>%  group_by(Education) 
kable(earningless)
Characteristic Education Thousands
Without Earnings Less Than 9th Grade 4829
Without Earnings High School 9th to 12th Nongrad 7090
Without Earnings High School Graduate (Incl GED) 25591
Without Earnings College Some College No Degree Graduate (Incl GED) 12173
Without Earnings College Associate Degree Graduate (Incl GED) 6566
Without Earnings College Bachelor’s Degree or more Total 19457
Without Earnings College Bachelor’s Degree or more Bachelor’s Degree 12488
Without Earnings College Bachelor’s Degree or more Master’s Degree 5431
Without Earnings College Bachelor’s Degree or more Professional Degree 616
Without Earnings College Bachelor’s Degree or more Doctorate Degree 923
earningless[c(1,9),] %>%ggplot(aes( Education, Thousands)) + geom_bar(stat="identity")

Evaluation of Page 3

Page 3 isn’t a specific data set, but the unicef data sets. They are very nice, but there are so many we risk overchoice again, We’ll keep looking.

Evaluation of Page 4

Page 4’s first data post is the Illinois Report card data which is reported to be already tidy.

Evaluation of Page 18

Page 18 gives us unemployment data from the World Bank. We slice off a row, and take out a number of summary columns.

unempRaw<-rio::import("https://github.com/ErindaB/Other/raw/master/Unemployment%20Rate%2C%20seas.%20adj..xlsx")
## New names:
## * `` -> ...1
unempRaw<-slice(unempRaw, 2:n())
unempRaw <- unempRaw %>% select(-c("Advanced Economies"  ,"EMDE East Asia & Pacific"  ,"EMDE Europe & Central Asia"     ,"Emerging Market and Developing Economies (EMDEs)","High Income Countries"   ,"Hong Kong SAR, China" ,"EMDE Latin America & Caribbean"    ,"Low-Income Countries (LIC)"  ,"Middle-Income Countries (MIC)"                   ,"EMDE Middle East & N. Africa"  ,"EMDE South Asia","EMDE Sub-Saharan Africa","World (WBG members)"))

names(unempRaw)[[1]]<-"Year"
kable(head(unempRaw))
Year Argentina Australia Austria Belgium Bulgaria Bahrain Belarus Brazil Canada Switzerland Chile China Colombia Cyprus Czech Republic Germany Denmark Dominican Republic Algeria Ecuador Egypt, Arab Rep. Spain Estonia Finland France United Kingdom Greece Croatia Hungary India Ireland Iceland Israel Italy Jordan Japan Kazakhstan Korea, Rep. Sri Lanka Lithuania Luxembourg Latvia Morocco Moldova, Rep. Mexico North Macedonia Malta Netherlands Norway New Zealand Pakistan Peru Philippines Poland Portugal Romania Russian Federation Saudi Arabia Singapore Slovakia Slovenia Sweden Thailand Tunisia Turkey Taiwan, China Uruguay United States Venezuela, RB Vietnam South Africa
1990 NA 6.943297 5.373002 6.55026 NA NA NA NA 8.150000 0.501328 NA NA NA NA NA NA NA NA 25 NA NA 15.48333 0.65 3.103129 7.625 7.091667 NA NA NA NA 13.41667 NA NA NA NA 2.108117 NA NA 15.9 NA NA NA NA NA NA NA NA NA 5.783333 7.984591 3.13 NA NA 3.441667 NA NA NA NA NA NA NA 2.239701 NA NA NA 1.658333 NA 5.616667 NA NA NA
1991 NA 9.614137 5.823096 6.439812 NA NA NA NA 10.316670 1.090451 NA NA NA NA NA 4.864885 NA NA 25 NA NA 15.51667 1.475 6.666424 7.8 8.825 NA NA NA NA 14.73333 NA NA NA NA 2.099018 NA NA 14.7 NA NA NA NA NA NA NA NA NA 6.041667 10.61144 6.28 NA 10.475 9.008333 NA NA NA NA 1.75 7.05 NA 4.005607 NA NA NA 1.533333 NA 6.850000 NA NA NA
1992 NA 10.750080 5.941711 7.088092 13.235 NA NA NA 11.216670 2.563105 NA NA NA NA NA 5.764563 NA NA 27 NA NA 17.06667 3.725 11.796830 8.65 9.966667 NA NA NA NA 15.40000 NA NA NA NA 2.151389 NA NA 14.6 NA NA NA NA NA NA NA NA NA 6.55 10.64473 5.85 NA 9.85 12.933330 NA 5.45 NA NA 1.8 11.31833 11.56667 7.110956 NA NA NA 1.500000 NA 7.491667 NA NA NA
1993 NA 10.866170 6.811381 8.61913 15.85583 NA NA NA 11.375000 4.516116 NA NA NA NA 4.333333 6.93137 NA NA 23.2 NA NA 20.83333 6.55 16.384210 9.65 10.4 NA NA NA NA 15.63333 NA NA NA 19.7 2.503291 NA NA 13.8 4.191667 NA 4.658333 NA NA NA NA NA NA 6.608333 9.800159 4.73 NA 9.35 15.033330 NA 9.208333 NA NA 1.675 12.855 14.575 11.146890 NA NA NA 1.425000 NA 6.908333 NA NA NA
1994 NA 9.705695 6.545480 9.753554 14.06583 NA NA NA 10.391670 4.718465 NA NA NA NA 4.283333 7.340639 NA NA 24.4 NA NA 22.05 7.55 16.534420 10.25 9.5 NA NA NA NA 14.35000 NA NA NA 15.8 2.890953 NA NA 13.1 3.625 NA 6.358333 NA NA NA NA NA NA 6 8.342465 4.84 NA 9.55 16.508330 NA 10.975 7.00654 NA 1.725 14.62917 14.55 10.766190 NA NA NA 1.566667 NA 6.100000 NA NA NA
1995 NA 8.471058 6.589767 9.674164 11.38583 NA NA NA 9.466667 4.232892 NA NA NA NA 4.033333 7.091997 NA NA 28.1 NA NA 20.79167 9.75 15.426480 9.675 8.658333 NA NA NA NA 12.28333 NA NA NA 15.3 3.153574 NA NA 12.3 6.116667 2.600765 6.35 NA NA NA NA NA NA 5.441667 6.451948 5.37 NA 9.5 15.225000 7.150996 9.975 8.308334 NA 1.725 13.68083 14.04167 10.421390 NA NA NA 1.808333 NA 5.591667 NA NA NA

Tidying

We Gather the data by the country columns excluding the year, generating a new column country and the unemployment rate. This puts it in tidy form.

unempTidy <-unempRaw %>% gather(Country, UnempRate, -Year) %>% arrange( Country, Year)
unempTidy$UnempRate <- as.numeric(unempTidy$UnempRate)
## Warning: NAs introduced by coercion
kable(head(unempTidy))
Year Country UnempRate
1990 Algeria 25.0
1991 Algeria 25.0
1992 Algeria 27.0
1993 Algeria 23.2
1994 Algeria 24.4
1995 Algeria 28.1

Analysis

We can find some world rates if we wish. We can also see the distribution in 2001.

worldRate<-unempTidy %>% group_by(Year) %>% summarize(rate=mean(UnempRate, na.rm = TRUE ))
unempTidy %>% group_by(Year) %>% summarize(rate=mean(UnempRate, na.rm = TRUE )) %>% ggplot(aes(Year,rate)) +geom_bar(stat="identity")

spaceOdyessy <- unempTidy %>% filter(Year==2001) %>% filter(!is.na(UnempRate)) %>% arrange(desc(UnempRate)) 
spaceOdyessy$Country <- factor(spaceOdyessy$Country, spaceOdyessy$Country)
spaceOdyessy %>% ggplot(aes(Country, UnempRate)) +geom_bar(stat="identity")

Page 19

Page 19 has no data set

Page 22

Page 22 is a Wikipedia table for Emmy winners. We use rvest to pull it down, and get it into a table. Thankfully rvest takes care of much of the filling. We splice out years before 1966 as a lot of it is misaligned with “modern” data.

emmyHTML <- read_html("https://en.wikipedia.org/wiki/List_of_Primetime_Emmy_Award_winners")
emmyTable<-rvest::html_table(emmyHTML, fill=TRUE)[[1]]
emmyTable<-emmyTable[emmyTable$Year>=1966,]
kable(head(emmyTable))
Year Comedy Drama Variety Lead Comedy Actor Lead Drama Actor Lead Comedy Actress Lead Drama Actress
18 1966 The Dick Van Dyke Show (CBS) The Fugitive (ABC) The Andy Williams Show (NBC) Dick Van Dyke The Dick Van Dyke Show (CBS) Bill Cosby I Spy (NBC) Mary Tyler Moore The Dick Van Dyke Show (CBS) Barbara Stanwyck The Big Valley (ABC)
19 1967 The Monkees (NBC) Mission: Impossible (CBS) The Andy Williams Show (NBC) Don Adams Get Smart (NBC) Bill Cosby I Spy (NBC) Lucille Ball The Lucy Show (CBS) Barbara Bain Mission: Impossible (CBS)
20 1968 Get Smart (NBC) Mission: Impossible (CBS) The Andy Williams Show (NBC) Don Adams Get Smart (NBC) Bill Cosby I Spy (NBC) Bill Cosby I Spy (NBC) Rowan & Martin’s Laugh-In (NBC)
21 1969 Get Smart (NBC) Mission: Impossible (CBS) The Andy Williams Show (NBC) NET Playhouse (NET) Bill Cosby I Spy (NBC) Carl Betz Judd for the Defense (ABC) Hope Lange The Ghost & Mrs. Muir (ABC)
22 1970 My World and Welcome to It (NBC) Marcus Welby, M.D. (ABC) The Andy Williams Show (NBC) The David Frost Show (Syndicated) William Windom My World and Welcome to It (NBC) Robert Young Marcus Welby, M.D. (ABC) Susan Hampshire The Forsyte Saga (NET)
23 1971 All in the Family (CBS) The Bold Ones: The Senator (NBC) Singer Presents Burt Bacharach (CBS) Jack Klugman The Odd Couple (ABC) Hal Holbrook The Bold Ones: The Senator (NBC) Jean Stapleton All in the Family (CBS) Susan Hampshire The First Churchills (Masterpiece Theatre) (PBS)

Tidy

First we gather based Category excluding year; then we sort on year. Next we break up the Winner entry to extract the network from the parens

emmyTable<-gather(emmyTable, Category, Winner, -Year) %>% arrange (Year)
emmyTidy<-emmyTable%>% extract(Winner, c("Winner","Network"), "(.+) \\((.+)\\)")
kable(head(emmyTidy))
Year Category Winner Network
1966 Comedy The Dick Van Dyke Show CBS
1966 Drama The Fugitive ABC
1966 Variety The Andy Williams Show NBC
1966 Lead Comedy Actor Dick Van Dyke The Dick Van Dyke Show CBS
1966 Lead Drama Actor Bill Cosby I Spy NBC
1966 Lead Comedy Actress Mary Tyler Moore The Dick Van Dyke Show CBS

Analysis

Some basic analysis shows that NBC, CBS, ABC, and HBO are in order the most awarded networks. We can also see the changing breakdown in winning networks as years progress. We can see that the non top winners have started taking more Emmys.

networkCount <-emmyTidy %>% group_by(Network) %>% summarize(count=n()) %>% filter(count>8)  %>% arrange(desc(count))
kable(networkCount)
Network count
NBC 112
CBS 96
ABC 83
HBO 27
AMC 9
emmyTidy %>% filter(Network %in% networkCount$Network) %>% ggplot(aes(x=Year,color=Network)) +geom_bar()