Project2

Resolving overchoice

There are a lot of choices of data sets to look at. This caused the author some indecision. We can use R to take care of some of it though. There are 47 pages, and we need 3 examples, though some seem not quite fit for the assignment. Of course we have set the random seed so this is reproducible.

sample(1:47,10)

##  [1] 30 16  1  3  4 18 19 22 32 44

Evaluation of Page 30

Page 30’s first data set posted is the CDC SMART BRFSS (i.e. Metro area Behavioral Risk Factor). For data currency reasons we are using the 2017 data instead of the 2016 data that is suggested in the post. The extracted xpt file is 356 mb so the remote pull and unzip/load takes a fair amount of time.

mmsaRaw <- rio::import("https://www.cdc.gov/brfss/annual_data/2017/files/MMSA2017_XPT.zip")
kable(head(mmsaRaw[1:3]))

DISPCODE	STATERE1	SAFETIME
1200	NA	1
1200	NA	1
1200	NA	1
1200	NA	1
1200	NA	1
1200	NA	1

Well that is a lot of data. Unfortunately it looks tidy, there are a lot of data points for each response, but each observation variable is a column, each observation is a row, and each type of observation is broken out (there is only one).

Evaluation of page 16

Page 16 does not have a data set posted.

Evaluation of page 1

The first data set is the Census PINC-03 data set (All races 25 Years+ Total Work Experience) with the appropriate library R can read the xslx

pincRaw <-rio::import("https://www2.census.gov/programs-surveys/cps/tables/pinc-03/2019/pinc03_1_1_1_1.xls")

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ... and 6 more problems

kable(head(pincRaw))

Table with row headers in column A and column headers in rows 12 through 14	…2	…3	…4	…5	…6	…7	…8	…9	…10	…11	…12
PINC-03. Educational Attainment–People 25 Years Old and Over, by Total Money Earnings in 2018, Work Experience in 2018, Age, Race, Hispanic Origin, and Sex	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Data reflect the implementation of an updated processing system that incorporates content from earlier questionnaire redesigns related to income, health insurance, and demographics.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
For information on confidentiality protection, sampling error, nonsampling error, and definitions, see <www2.census.gov/programs-surveys/cps/techdocs/cpsmar19.pdf>.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Source: U.S. Census Bureau, Current Population Survey, 2019 Annual Social and Economic Supplement.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
(Numbers in thousands. People 25 Years Old and Over as of March of the following year. A.O.I.C. stands for alone or in combination. Median income is calculated using $2,500 income intervals.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
The Gini index is calculated using micro-sorted data. Medians falling in the upper open-ended interval are plugged with “$250,000”. Standard errors calculated using replicate weights)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

The data did not import cleanly so we are going to have to do some work. First we slice off some extra rows, and combine and fill the column structure to get to a reasonable table. We then have to trim off the summary data they provide.

pincRaw <- slice(pincRaw,10:n()-1)
pincCols <- t(fill(as.data.frame(t(pincRaw[3:5,])),c(1,2,3) ))
colnames(pincRaw)<- gsub("Degree Graduate (Incl GED)", "Degree", gsub('(\\sNA)|(NA\\s)', '' ,paste(pincCols[1,],pincCols[2,],pincCols[3,], sep=" ")))
pinc <- slice(pincRaw,append(7,9:(n()-5))) %>% select(-Total)

kable(pinc)

Characteristic	Less Than 9th Grade	High School 9th to 12th Nongrad	High School Graduate (Incl GED)	College Some College No Degree Graduate (Incl GED)	College Associate Degree Graduate (Incl GED)	College Bachelor’s Degree or more Total	College Bachelor’s Degree or more Bachelor’s Degree	College Bachelor’s Degree or more Master’s Degree	College Bachelor’s Degree or more Professional Degree	College Bachelor’s Degree or more Doctorate Degree
Without Earnings	4829	7090	25591	12173	6566	19457	12488	5431	616	923
..$1 to $2,499 or loss	115	190	775	607	287	966	613	289	21	44
..$2,500 to $4,999	97	201	641	399	241	632	459	138	13	22
..$5,000 to $7,499	109	245	900	556	334	933	608	265	16	44
..$7,500 to $9,999	106	220	776	375	224	565	371	158	11	25
..$10,000 to $12,499	185	396	1536	826	528	979	688	236	22	32
..$12,500 to $14,999	127	209	711	362	206	502	364	113	18	8
..$15,000 to $17,499	273	433	1407	809	435	1007	750	203	12	42
..$17,500 to $19,999	196	339	1122	612	366	567	411	118	17	21
..$20,000 to $22,499	447	579	2321	1093	733	1415	963	341	46	65
..$22,500 to $24,999	194	280	1093	688	429	702	535	136	10	22
..$25,000 to $27,499	292	450	2331	1218	799	1502	1130	290	31	49
..$27,500 to $29,999	134	160	1116	466	379	583	410	148	11	14
..$30,000 to $32,499	331	468	2872	1616	1026	2063	1592	370	34	66
..$32,500 to $34,999	58	82	718	385	313	510	386	103	5	16
..$35,000 to $37,499	239	385	2237	1187	763	1979	1487	391	55	46
..$37,500 to $39,999	79	111	889	483	344	732	562	145	6	20
..$40,000 to $42,499	156	335	2235	1190	929	2451	1716	573	60	102
..$42,500 to $44,999	30	48	488	300	224	627	448	166	6	8
..$45,000 to $47,499	76	153	1430	884	692	1835	1269	491	29	46
..$47,500 to $49,999	47	70	589	445	297	927	651	222	13	41
..$50,000 to $52,499	149	247	1985	1307	1018	3035	1918	914	93	110
..$52,500 to $54,999	13	37	306	262	181	717	441	251	15	11
..$55,000 to $57,499	41	88	852	572	541	1783	1132	533	49	69
..$57,500 to $59,999	12	23	229	281	207	660	436	189	14	21
..$60,000 to $62,499	74	127	1342	892	779	2930	1922	777	97	133
..$62,500 to $64,999	8	27	197	169	120	505	284	180	10	30
..$65,000 to $67,499	21	28	674	427	409	1552	1004	461	23	64
..$67,500 to $69,999	11	9	228	146	148	513	340	134	11	29
..$70,000 to $72,499	21	46	678	525	451	1999	1230	609	75	85
..$72,500 to $74,999	4	3	124	82	85	418	232	151	12	24
..$75,000 to $77,499	24	45	527	358	321	1755	1126	432	84	113
..$77,500 to $79,999	9	19	141	135	127	460	307	121	15	17
..$80,000 to $82,499	14	56	550	432	390	1893	1111	587	72	123
..$82,500 to $84,999	0	1	134	91	52	328	146	141	13	27
..$85,000 to $87,499	3	13	249	238	180	1268	745	405	52	65
..$87,500 to $89,999	0	9	52	65	49	274	139	99	4	33
..$90,000 to $92,499	7	13	214	235	196	1498	812	517	34	135
..$92,500 to $94,999	0	12	61	44	33	307	190	96	4	18
..$95,000 to $97,499	6	12	126	129	97	780	451	250	21	58
..$97,500 to $99,999	2	8	94	78	46	332	185	120	12	15
..$100,000 and over	66	102	1717	1551	1198	15875	7885	4924	1375	1692
Median earnings (dollars)	25318	25280	35016	37811	41834	62140	57105	70241	104593	92126

Tidy

We simply gather an education column out of the columns with Thousands as the variable and the income ranges excluded. We also coerce the variable to numeric.

pincTidy <- pinc %>% gather(Education, Thousands, -Characteristic)
pincTidy$Thousands <- as.numeric(pincTidy$Thousands)
kable(head(pincTidy))

Characteristic	Education	Thousands
Without Earnings	Less Than 9th Grade	4829
..$1 to $2,499 or loss	Less Than 9th Grade	115
..$2,500 to $4,999	Less Than 9th Grade	97
..$5,000 to $7,499	Less Than 9th Grade	109
..$7,500 to $9,999	Less Than 9th Grade	106
..$10,000 to $12,499	Less Than 9th Grade	185

Analysis

We can see the number of members without earnings.

earningless <- pincTidy %>% filter(Characteristic == "Without Earnings") %>%  group_by(Education) 
kable(earningless)

Characteristic	Education	Thousands
Without Earnings	Less Than 9th Grade	4829
Without Earnings	High School 9th to 12th Nongrad	7090
Without Earnings	High School Graduate (Incl GED)	25591
Without Earnings	College Some College No Degree Graduate (Incl GED)	12173
Without Earnings	College Associate Degree Graduate (Incl GED)	6566
Without Earnings	College Bachelor’s Degree or more Total	19457
Without Earnings	College Bachelor’s Degree or more Bachelor’s Degree	12488
Without Earnings	College Bachelor’s Degree or more Master’s Degree	5431
Without Earnings	College Bachelor’s Degree or more Professional Degree	616
Without Earnings	College Bachelor’s Degree or more Doctorate Degree	923

earningless[c(1,9),] %>%ggplot(aes( Education, Thousands)) + geom_bar(stat="identity")

Evaluation of Page 3

Page 3 isn’t a specific data set, but the unicef data sets. They are very nice, but there are so many we risk overchoice again, We’ll keep looking.

Evaluation of Page 4

Page 4’s first data post is the Illinois Report card data which is reported to be already tidy.

Evaluation of Page 18

Page 18 gives us unemployment data from the World Bank. We slice off a row, and take out a number of summary columns.

unempRaw<-rio::import("https://github.com/ErindaB/Other/raw/master/Unemployment%20Rate%2C%20seas.%20adj..xlsx")

## New names:
## * `` -> ...1

unempRaw<-slice(unempRaw, 2:n())
unempRaw <- unempRaw %>% select(-c("Advanced Economies"  ,"EMDE East Asia & Pacific"  ,"EMDE Europe & Central Asia"     ,"Emerging Market and Developing Economies (EMDEs)","High Income Countries"   ,"Hong Kong SAR, China" ,"EMDE Latin America & Caribbean"    ,"Low-Income Countries (LIC)"  ,"Middle-Income Countries (MIC)"                   ,"EMDE Middle East & N. Africa"  ,"EMDE South Asia","EMDE Sub-Saharan Africa","World (WBG members)"))

names(unempRaw)[[1]]<-"Year"
kable(head(unempRaw))

Year	Argentina	Australia	Austria	Belgium	Bulgaria	Bahrain	Belarus	Brazil	Canada	Switzerland	Chile	China	Colombia	Cyprus	Czech Republic	Germany	Denmark	Dominican Republic	Algeria	Ecuador	Egypt, Arab Rep.	Spain	Estonia	Finland	France	United Kingdom	Greece	Croatia	Hungary	India	Ireland	Iceland	Israel	Italy	Jordan	Japan	Kazakhstan	Korea, Rep.	Sri Lanka	Lithuania	Luxembourg	Latvia	Morocco	Moldova, Rep.	Mexico	North Macedonia	Malta	Netherlands	Norway	New Zealand	Pakistan	Peru	Philippines	Poland	Portugal	Romania	Russian Federation	Saudi Arabia	Singapore	Slovakia	Slovenia	Sweden	Thailand	Tunisia	Turkey	Taiwan, China	Uruguay	United States	Venezuela, RB	Vietnam	South Africa
1990	NA	6.943297	5.373002	6.55026	NA	NA	NA	NA	8.150000	0.501328	NA	NA	NA	NA	NA	NA	NA	NA	25	NA	NA	15.48333	0.65	3.103129	7.625	7.091667	NA	NA	NA	NA	13.41667	NA	NA	NA	NA	2.108117	NA	NA	15.9	NA	NA	NA	NA	NA	NA	NA	NA	NA	5.783333	7.984591	3.13	NA	NA	3.441667	NA	NA	NA	NA	NA	NA	NA	2.239701	NA	NA	NA	1.658333	NA	5.616667	NA	NA	NA
1991	NA	9.614137	5.823096	6.439812	NA	NA	NA	NA	10.316670	1.090451	NA	NA	NA	NA	NA	4.864885	NA	NA	25	NA	NA	15.51667	1.475	6.666424	7.8	8.825	NA	NA	NA	NA	14.73333	NA	NA	NA	NA	2.099018	NA	NA	14.7	NA	NA	NA	NA	NA	NA	NA	NA	NA	6.041667	10.61144	6.28	NA	10.475	9.008333	NA	NA	NA	NA	1.75	7.05	NA	4.005607	NA	NA	NA	1.533333	NA	6.850000	NA	NA	NA
1992	NA	10.750080	5.941711	7.088092	13.235	NA	NA	NA	11.216670	2.563105	NA	NA	NA	NA	NA	5.764563	NA	NA	27	NA	NA	17.06667	3.725	11.796830	8.65	9.966667	NA	NA	NA	NA	15.40000	NA	NA	NA	NA	2.151389	NA	NA	14.6	NA	NA	NA	NA	NA	NA	NA	NA	NA	6.55	10.64473	5.85	NA	9.85	12.933330	NA	5.45	NA	NA	1.8	11.31833	11.56667	7.110956	NA	NA	NA	1.500000	NA	7.491667	NA	NA	NA
1993	NA	10.866170	6.811381	8.61913	15.85583	NA	NA	NA	11.375000	4.516116	NA	NA	NA	NA	4.333333	6.93137	NA	NA	23.2	NA	NA	20.83333	6.55	16.384210	9.65	10.4	NA	NA	NA	NA	15.63333	NA	NA	NA	19.7	2.503291	NA	NA	13.8	4.191667	NA	4.658333	NA	NA	NA	NA	NA	NA	6.608333	9.800159	4.73	NA	9.35	15.033330	NA	9.208333	NA	NA	1.675	12.855	14.575	11.146890	NA	NA	NA	1.425000	NA	6.908333	NA	NA	NA
1994	NA	9.705695	6.545480	9.753554	14.06583	NA	NA	NA	10.391670	4.718465	NA	NA	NA	NA	4.283333	7.340639	NA	NA	24.4	NA	NA	22.05	7.55	16.534420	10.25	9.5	NA	NA	NA	NA	14.35000	NA	NA	NA	15.8	2.890953	NA	NA	13.1	3.625	NA	6.358333	NA	NA	NA	NA	NA	NA	6	8.342465	4.84	NA	9.55	16.508330	NA	10.975	7.00654	NA	1.725	14.62917	14.55	10.766190	NA	NA	NA	1.566667	NA	6.100000	NA	NA	NA
1995	NA	8.471058	6.589767	9.674164	11.38583	NA	NA	NA	9.466667	4.232892	NA	NA	NA	NA	4.033333	7.091997	NA	NA	28.1	NA	NA	20.79167	9.75	15.426480	9.675	8.658333	NA	NA	NA	NA	12.28333	NA	NA	NA	15.3	3.153574	NA	NA	12.3	6.116667	2.600765	6.35	NA	NA	NA	NA	NA	NA	5.441667	6.451948	5.37	NA	9.5	15.225000	7.150996	9.975	8.308334	NA	1.725	13.68083	14.04167	10.421390	NA	NA	NA	1.808333	NA	5.591667	NA	NA	NA

Tidying

We Gather the data by the country columns excluding the year, generating a new column country and the unemployment rate. This puts it in tidy form.

unempTidy <-unempRaw %>% gather(Country, UnempRate, -Year) %>% arrange( Country, Year)
unempTidy$UnempRate <- as.numeric(unempTidy$UnempRate)

## Warning: NAs introduced by coercion

kable(head(unempTidy))

Year	Country	UnempRate
1990	Algeria	25.0
1991	Algeria	25.0
1992	Algeria	27.0
1993	Algeria	23.2
1994	Algeria	24.4
1995	Algeria	28.1

Analysis

We can find some world rates if we wish. We can also see the distribution in 2001.

worldRate<-unempTidy %>% group_by(Year) %>% summarize(rate=mean(UnempRate, na.rm = TRUE ))
unempTidy %>% group_by(Year) %>% summarize(rate=mean(UnempRate, na.rm = TRUE )) %>% ggplot(aes(Year,rate)) +geom_bar(stat="identity")

spaceOdyessy <- unempTidy %>% filter(Year==2001) %>% filter(!is.na(UnempRate)) %>% arrange(desc(UnempRate)) 
spaceOdyessy$Country <- factor(spaceOdyessy$Country, spaceOdyessy$Country)
spaceOdyessy %>% ggplot(aes(Country, UnempRate)) +geom_bar(stat="identity")

Page 19

Page 19 has no data set

Page 22

Page 22 is a Wikipedia table for Emmy winners. We use rvest to pull it down, and get it into a table. Thankfully rvest takes care of much of the filling. We splice out years before 1966 as a lot of it is misaligned with “modern” data.

emmyHTML <- read_html("https://en.wikipedia.org/wiki/List_of_Primetime_Emmy_Award_winners")
emmyTable<-rvest::html_table(emmyHTML, fill=TRUE)[[1]]
emmyTable<-emmyTable[emmyTable$Year>=1966,]
kable(head(emmyTable))

	Year	Comedy	Drama	Variety	Lead Comedy Actor	Lead Drama Actor	Lead Comedy Actress	Lead Drama Actress
18	1966	The Dick Van Dyke Show (CBS)	The Fugitive (ABC)	The Andy Williams Show (NBC)	Dick Van Dyke The Dick Van Dyke Show (CBS)	Bill Cosby I Spy (NBC)	Mary Tyler Moore The Dick Van Dyke Show (CBS)	Barbara Stanwyck The Big Valley (ABC)
19	1967	The Monkees (NBC)	Mission: Impossible (CBS)	The Andy Williams Show (NBC)	Don Adams Get Smart (NBC)	Bill Cosby I Spy (NBC)	Lucille Ball The Lucy Show (CBS)	Barbara Bain Mission: Impossible (CBS)
20	1968	Get Smart (NBC)	Mission: Impossible (CBS)	The Andy Williams Show (NBC)	Don Adams Get Smart (NBC)	Bill Cosby I Spy (NBC)	Bill Cosby I Spy (NBC)	Rowan & Martin’s Laugh-In (NBC)
21	1969	Get Smart (NBC)	Mission: Impossible (CBS)	The Andy Williams Show (NBC)	NET Playhouse (NET)	Bill Cosby I Spy (NBC)	Carl Betz Judd for the Defense (ABC)	Hope Lange The Ghost & Mrs. Muir (ABC)
22	1970	My World and Welcome to It (NBC)	Marcus Welby, M.D. (ABC)	The Andy Williams Show (NBC)	The David Frost Show (Syndicated)	William Windom My World and Welcome to It (NBC)	Robert Young Marcus Welby, M.D. (ABC)	Susan Hampshire The Forsyte Saga (NET)
23	1971	All in the Family (CBS)	The Bold Ones: The Senator (NBC)	Singer Presents Burt Bacharach (CBS)	Jack Klugman The Odd Couple (ABC)	Hal Holbrook The Bold Ones: The Senator (NBC)	Jean Stapleton All in the Family (CBS)	Susan Hampshire The First Churchills (Masterpiece Theatre) (PBS)

Tidy

First we gather based Category excluding year; then we sort on year. Next we break up the Winner entry to extract the network from the parens

emmyTable<-gather(emmyTable, Category, Winner, -Year) %>% arrange (Year)
emmyTidy<-emmyTable%>% extract(Winner, c("Winner","Network"), "(.+) \\((.+)\\)")
kable(head(emmyTidy))

Year	Category	Winner	Network
1966	Comedy	The Dick Van Dyke Show	CBS
1966	Drama	The Fugitive	ABC
1966	Variety	The Andy Williams Show	NBC
1966	Lead Comedy Actor	Dick Van Dyke The Dick Van Dyke Show	CBS
1966	Lead Drama Actor	Bill Cosby I Spy	NBC
1966	Lead Comedy Actress	Mary Tyler Moore The Dick Van Dyke Show	CBS

Analysis

Some basic analysis shows that NBC, CBS, ABC, and HBO are in order the most awarded networks. We can also see the changing breakdown in winning networks as years progress. We can see that the non top winners have started taking more Emmys.

networkCount <-emmyTidy %>% group_by(Network) %>% summarize(count=n()) %>% filter(count>8)  %>% arrange(desc(count))
kable(networkCount)

Network	count
NBC	112
CBS	96
ABC	83
HBO	27
AMC	9

emmyTidy %>% filter(Network %in% networkCount$Network) %>% ggplot(aes(x=Year,color=Network)) +geom_bar()

Project2

Scott Reed

10/5/2019

Resolving overchoice

Evaluation of Page 30

Evaluation of page 16

Evaluation of page 1

Tidy

Analysis

Evaluation of Page 3

Evaluation of Page 4

Evaluation of Page 18

Tidying

Analysis

Page 19

Page 22

Tidy

Analysis