The goal of this project was to select three untidy datasets posted in the discussion forum and prepare them for analysis. I chose the following data sets:
For each post, I discuss the issues with the dataset, propose the plan to tidy the data, execute the plan, and reflect on the final dataset.
The data in a table on a wikipedia entry about FIFA World Cup Winners. After scraping the table from the site, we can see how the data is untidy Some columns have two different values. For example, Years_Won has both the year and an abbreviated name of the country. Columns for “Other_Appearances_as_Coach” and “Other_Appearances_as_Player” have multiple entries as well.
url1 <- 'https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners'
coaches<- url1 %>%
read_html() %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[4]') %>%
html_table(fill = TRUE)
coaches <- coaches[[1]]
coaches <- as.data.frame(coaches)
coaches$T <- NULL
coaches <- coaches[-1,]
names(coaches)[2:5] <- c('Champion_Country','Years_Won','Other_Appearances_as_Coach', 'Other_Appearances_as_Player')
coaches<- tbl_df(coaches)
head(coaches, 10)
## # A tibble: 10 x 5
## Coach Champion_Country Years_Won Other_Appearance~ Other_Appearance~
## <chr> <chr> <chr> <chr> <chr>
## 1 Vittor~ Italy 1934 ITA,~ "" ""
## 2 Enzo B~ Italy 1982 ITA 1978 ITA, 1986 I~ ""
## 3 Franz ~ West Germany 1990 FRG 1986 FRG 1966, 1970, 1974
## 4 Carlos~ Argentina 1986 ARG 1990 ARG ""
## 5 Vicent~ Spain 2010 ESP 2014 ESP ""
## 6 Didier~ France 2018 FRA 2014 FRA 1998
## 7 Vicent~ Brazil 1958 BRA 1966 BRA ""
## 8 Sepp H~ West Germany 1954 FRG 1938 GER, 1958 F~ ""
## 9 Aimé J~ France 1998 FRA "" ""
## 10 Marcel~ Italy 2006 ITA 2010 ITA ""
The plan is to create six variables: 1. Championship Year 2. Last_Name of the coach 3. First_name of the coach 4. Champion_Country (Full name) 5. Number of other appearances by the coach as a coach 6. Number of other appearance as a player
The block below creates the coaches1 data frame which separates the coach name into last name, first name variables, and parses the Championship year and country into their own variables.
coaches1<-coaches %>%
mutate(Years_Won=strsplit(as.character(Years_Won), ", ")) %>%
unnest(Years_Won) %>%
separate(Years_Won, c("Championship_Year", "Country2")) %>%
select(Championship_Year, Coach, Champion_Country) %>%
separate(Coach, c("First_Name", "Last_Name"), extra = "merge", fill = "right")
## Warning: package 'bindrcpp' was built under R version 3.5.1
The block below creates the coaches2 and 2a data frames which separates out Other Appearances into their own variables and counts the number of other appearances as a coach.
coaches2<-coaches %>%
mutate(Other_Appearances_as_Coach=strsplit(as.character(Other_Appearances_as_Coach), ", ")) %>%
unnest(Other_Appearances_as_Coach) %>%
separate(Other_Appearances_as_Coach, c("Other_Appearance_as_Coach_Year", "Representing_Country_Coach")) %>%
separate(Coach, c("First_Name", "Last_Name"), extra = "merge", fill = "right") %>%
select(Other_Appearance_as_Coach_Year, Last_Name, Representing_Country_Coach)
coaches2a<-coaches2 %>%
group_by(Last_Name) %>%
summarise("No_of_Other_Appearances_as_Coach" = n())
The block below creates the coaches3 data frame which separates out Other Appearances into their own variables and counts the number of other appearances as a player.
coaches3 <- coaches %>%
mutate(Other_Appearances_as_Player=strsplit(as.character(Other_Appearances_as_Player), ", ")) %>%
unnest(Other_Appearances_as_Player) %>%
separate(Coach, c("First_Name", "Last_Name"), extra = "merge", fill = "right") %>%
select(Other_Appearances_as_Player, Last_Name)
coaches3a<-coaches3 %>%
group_by(Last_Name) %>%
summarise("No_of_Other_Appearances_as_Player" = n())
Data frames coaches 2a and coaches 3a are merged into the coaches data frame to create the FINAL_COACHES_DF data frame.
coaches4 <-merge(coaches1,coaches2a,by="Last_Name",all.x = TRUE)
coaches4 <-merge(coaches4,coaches3a,by="Last_Name", all.x = TRUE)
coaches4[is.na(coaches4)] <-0
FINAL_COACHES_DF <- coaches4 %>%
select(Championship_Year,Last_Name,First_Name,Champion_Country,No_of_Other_Appearances_as_Coach,No_of_Other_Appearances_as_Player) %>%
arrange(Championship_Year)
FINAL_COACHES_DF
## Championship_Year Last_Name First_Name Champion_Country
## 1 1930 Suppici Alberto Uruguay
## 2 1934 Pozzo Vittorio Italy
## 3 1938 Pozzo Vittorio Italy
## 4 1950 López Juan Uruguay
## 5 1954 Herberger Sepp West Germany
## 6 1958 Feola Vicente Brazil
## 7 1962 Moreira Aymoré Brazil
## 8 1966 Ramsey Alf England
## 9 1970 Zagallo Mário Brazil
## 10 1974 Schön Helmut West Germany
## 11 1978 Luis Menotti César Argentina
## 12 1982 Bearzot Enzo Italy
## 13 1986 Bilardo Carlos Argentina
## 14 1990 Beckenbauer Franz West Germany
## 15 1994 Alberto Parreira Carlos Brazil
## 16 1998 Jacquet Aimé France
## 17 2002 Felipe Scolari Luiz Brazil
## 18 2006 Lippi Marcello Italy
## 19 2010 del Bosque Vicente Spain
## 20 2014 Löw Joachim Germany
## 21 2018 Deschamps Didier France
## No_of_Other_Appearances_as_Coach No_of_Other_Appearances_as_Player
## 1 0 0
## 2 0 0
## 3 0 0
## 4 1 0
## 5 3 0
## 6 1 0
## 7 0 0
## 8 1 1
## 9 2 2
## 10 3 0
## 11 1 0
## 12 2 0
## 13 1 0
## 14 1 3
## 15 5 0
## 16 0 0
## 17 2 0
## 18 1 0
## 19 1 0
## 20 2 0
## 21 1 1
The data is now in a tidy format and is ready for analysis and questions such as:
what’s the total number of other appearances as a coach for championship coaches?
sum(FINAL_COACHES_DF$No_of_Other_Appearances_as_Coach)
## [1] 28
what’s the total number of other appearances as a player for championship coaches?
sum(FINAL_COACHES_DF$No_of_Other_Appearances_as_Player)
## [1] 7
Show the countries and number of World championships?
FINAL_COACHES_DF %>%
group_by(Champion_Country) %>%
summarise(Champion = n()) %>%
arrange(desc(Champion))
## # A tibble: 9 x 2
## Champion_Country Champion
## <chr> <int>
## 1 Brazil 5
## 2 Italy 4
## 3 West Germany 3
## 4 Argentina 2
## 5 France 2
## 6 Uruguay 2
## 7 England 1
## 8 Germany 1
## 9 Spain 1
The data is in wide format. Iformation about the number of Grand Slam (“GS”) tournaments won at the time and total number of GS tournament won is in parenthesis. For example, “Don Budge (1/6)”. Also, for the name of the champion, there are other entries other than names. For example,“started in 1891” or “France held under German occupation***" or “tournament not created”. These kind of entries make it difficult for analysis because they have to be accounted for when running the analysis.
url2 <- 'https://en.wikipedia.org/wiki/List_of_Grand_Slam_men%27s_singles_champions'
grand_slam<- url2 %>%
read_html() %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>%
html_table(fill = TRUE)
grand_slam <- grand_slam[[1]]
grand_slam <- as.data.frame(grand_slam)
head(grand_slam,10)
## Year Australian Open French Open
## 1 1877 started in 1905 started in 1891
## 2 1878 tournament not created tournament not created
## 3 1879 tournament not created tournament not created
## 4 1880 tournament not created tournament not created
## 5 1881 tournament not created tournament not created
## 6 1882 tournament not created tournament not created
## 7 1883 tournament not created tournament not created
## 8 1884 tournament not created tournament not created
## 9 1885 tournament not created tournament not created
## 10 1886 tournament not created tournament not created
## Wimbledon US Open
## 1 GBR: Spencer Gore started in 1881
## 2 Frank Hadow tournament not created
## 3 John Hartley (1/2) tournament not created
## 4 John Hartley (2/2) tournament not created
## 5 William Renshaw (1/7) USA: Richard Sears (1/7)
## 6 William Renshaw (2/7) Richard Sears (2/7)
## 7 William Renshaw (3/7) Richard Sears (3/7)
## 8 William Renshaw (4/7) Richard Sears (4/7)
## 9 William Renshaw (5/7) Richard Sears (5/7)
## 10 William Renshaw (6/7) Richard Sears (6/7)
I added: 4. Parse out the tournament won at the time and total tournaments won into their own variables. Use regular expressions to parse out the data, transform it to numeric data, and replace the NAs with 0 for those years without a champion. 5. For Mens_Champion, if there was no winner for whatever reason, change it None. 6. For country of champion, use full country’s name, not just the three letter abbreviation. 7. Add a column for the reason why there was no champion that particular year.
grand_slam2<- grand_slam %>%
gather(Tournament_Name,Mens_Champion,-Year) %>%
mutate(Tournament_Type="Grand Slam", Country_of_Tournament="") %>%
separate(Mens_Champion, c('Country_of_Champion','Mens_Champion'), sep=":", extra = "merge", fill="left")
grand_slam2<-grand_slam2 %>%
mutate(Country_of_Tournament = case_when(Tournament_Name == 'Australian Open' ~ "Australia", Tournament_Name == 'French Open' ~ "France",
Tournament_Name == 'Wimbledon' ~ "England", Tournament_Name == "US Open" ~ "United States", TRUE ~ Country_of_Tournament)) %>%
mutate(Country_of_Champion = case_when(Country_of_Champion=='AUS'~'Australia',
Country_of_Champion=='NZL'~'New Zealand',
Country_of_Champion=='PER'~'Peru',
Country_of_Champion=='RSA'~'Republic of South Africa',
Country_of_Champion=='GER'~'Germany',
Country_of_Champion=='CZE'~'Czech Republic',
Country_of_Champion=='SRB'~'Serbia',
Country_of_Champion=='FRA'~'France',
Country_of_Champion=='HUN'~'Hungary',
Country_of_Champion=='EGY'~'Egypt',
Country_of_Champion=='SWE'~'Sweden',
Country_of_Champion=='ITA'~'Italy',
Country_of_Champion=='ESP'~'Spain',
Country_of_Champion=='TCH'~'Czechoslovakia',
Country_of_Champion=='ARG'~'Argentina',
Country_of_Champion=='ECU'~'Ecuador',
Country_of_Champion=='AUT'~'Austria',
Country_of_Champion=='RUS'~'Russia',
Country_of_Champion=='BRA'~'Brazil',
Country_of_Champion=='GBR'~'Great Britain',
Country_of_Champion=='FRG'~'West Germany',
Country_of_Champion=='NED'~'Netherlands',
Country_of_Champion=='SUI'~'Switzerland',
Country_of_Champion=='USA'~'United States of America',
Country_of_Champion=='MEX'~'Mexico',
Country_of_Champion=='ROU'~'Romania', TRUE ~ Country_of_Champion)) %>%
mutate("No_Champion_Reason" = case_when(Mens_Champion=='started in 1905'~ Mens_Champion,
Mens_Champion=="tournament not created"~ Mens_Champion,
Mens_Champion == "World War I" ~ Mens_Champion,
Mens_Champion == "World War II"~ Mens_Champion,
Mens_Champion == "Tournament canceled" ~ Mens_Champion,
Mens_Champion == "France held under German occupation***"~ Mens_Champion,
Mens_Champion == "started in 1891"~ Mens_Champion,
Mens_Champion == "Tournament date changed"~ Mens_Champion,
Mens_Champion == "started in 1881"~ Mens_Champion,TRUE ~ 'See Champion')) %>%
mutate(Mens_Champion = case_when(Mens_Champion=='started in 1905'~'None',
Mens_Champion=="tournament not created"~ 'None',
Mens_Champion == "World War I" ~ 'None',
Mens_Champion == "World War II"~ 'None',
Mens_Champion == "Tournament canceled" ~ 'None',
Mens_Champion == "France held under German occupation***"~ 'None',
Mens_Champion == "started in 1891"~ 'None',
Mens_Champion == "started in 1881"~ 'None',
Mens_Champion == 'Tournament date changed'~ 'None',
TRUE ~ Mens_Champion ))
pattern = "\\([[:digit:]]{1,2}\\/[[:digit:]]{1,2}\\)"
pattern2 = "[()]"
pattern3 = "[^[:alpha:][:space:]]"
grand_slam3 <- grand_slam2 %>%
mutate(Championships=gsub(pattern2, "", str_extract(grand_slam2$Mens_Champion, pattern))) %>%
separate(Championships,c("GS_Championships_Won_at_the_Time", "Total_GS_Won"), sep="/")
grand_slam3$GS_Championships_Won_at_the_Time[is.na(grand_slam3$GS_Championships_Won_at_the_Time) & grand_slam3$Mens_Champion != 'None'] <- 1
grand_slam3$Total_GS_Won[is.na(grand_slam3$Total_GS_Won) & grand_slam3$Mens_Champion != 'None'] <- 1
grand_slam3$Mens_Champion<-str_trim(gsub(pattern3, "", grand_slam3$Mens_Champion))
grand_slam3$GS_Championships_Won_at_the_Time <- as.numeric(grand_slam3$GS_Championships_Won_at_the_Time)
grand_slam3$Total_GS_Won <- as.numeric(grand_slam3$Total_GS_Won)
grand_slam3$Total_GS_Won[is.na(grand_slam3$Total_GS_Won)] <- 0
grand_slam3$GS_Championships_Won_at_the_Time[is.na(grand_slam3$GS_Championships_Won_at_the_Time)] <- 0
grand_slam_final <- arrange(grand_slam3, Year)
The Country_of_Champion variable was not consistently recorded. For some entries, the player’s country had an abbreviation before the name like so, “GER: Gottfried von Cramm”, but this was not consistent with other players. One future correction would be to update that variable.
I debated on whether to keep in records without a champion entry or not. Keeping in data for those years where there was no champion may not be good for analysis. In the end, I left those records in since they do offer addtional information and be can easily removed.
The data is now in a tidy format available for analysis:
What mens player has won the most Grand Slam tennis titles?
grand_slam_final %>%
filter( GS_Championships_Won_at_the_Time == max(GS_Championships_Won_at_the_Time)) %>%
select(Mens_Champion, Total_GS_Won)
## Mens_Champion Total_GS_Won
## 1 Roger Federer 20
What mens player has won the most Grand Slam tennis titles per tournament?
grand_slam_final %>%
filter(Mens_Champion != 'None') %>%
group_by(Tournament_Name) %>%
add_count(Mens_Champion) %>%
select(Tournament_Name, Mens_Champion,n) %>%
filter(n==max(n)) %>%
arrange(Tournament_Name) %>%
distinct()
## # A tibble: 8 x 3
## # Groups: Tournament_Name [4]
## Tournament_Name Mens_Champion n
## <chr> <chr> <int>
## 1 Australian Open Roy Emerson 6
## 2 Australian Open Roger Federer 6
## 3 Australian Open Novak Djokovic 6
## 4 French Open Rafael Nadal 11
## 5 US Open Richard Sears 7
## 6 US Open William Larned 7
## 7 US Open Bill Tilden 7
## 8 Wimbledon Roger Federer 8
This dataset was exported from the website Chicago Data Portal (link) The site tracks crime statistics from 2001 to the present. For this project, I exported Homicide statistics for 2018. The data has a lot of duplicative variables which can be removed for analysis purposes.
Chicago_Homicides_2018<- read.csv('Chicago Homicides 2018.csv')
Chicago_Homicides_2018$Date <- mdy_hms(Chicago_Homicides_2018$Date)
head(Chicago_Homicides_2018,10)
## ID Case.Number Date
## 1 23810 JB140548 2018-02-04 01:36:00
## 2 23811 JB141441 2018-02-05 01:10:00
## 3 23812 JB142661 2018-02-06 04:10:00
## 4 23813 JB143932 2018-02-07 09:23:00
## 5 23814 JB143932 2018-02-07 09:23:00
## 6 23815 JB145915 2018-02-08 20:43:00
## 7 23834 JB160133 2018-02-22 09:22:00
## 8 23835 JB161360 2018-02-22 15:35:00
## 9 23836 JB161698 2018-02-22 18:39:00
## 10 23838 JB163448 2018-02-24 08:23:00
## Block IUCR Primary.Type
## 1 117XX S SANGAMON ST 110 HOMICIDE
## 2 118XX S INDIANA AVE 110 HOMICIDE
## 3 062XX S MAY ST 110 HOMICIDE
## 4 014XX S KENNETH AVE 110 HOMICIDE
## 5 014XX S KENNETH AVE 110 HOMICIDE
## 6 006XX N LARAMIE AVE 110 HOMICIDE
## 7 027XX S AVERS AVE 110 HOMICIDE
## 8 112XX S DR MARTIN LUTHER KING JR DR 110 HOMICIDE
## 9 070XX S MORGAN ST 110 HOMICIDE
## 10 066XX S MINERVA AVE 110 HOMICIDE
## Description Location.Description Arrest Domestic Beat District
## 1 FIRST DEGREE MURDER STREET false false 524 5
## 2 FIRST DEGREE MURDER VACANT LOT false false 532 5
## 3 FIRST DEGREE MURDER APARTMENT false false 712 7
## 4 FIRST DEGREE MURDER APARTMENT false false 1011 10
## 5 FIRST DEGREE MURDER APARTMENT false false 1011 10
## 6 FIRST DEGREE MURDER HOUSE false false 1532 15
## 7 FIRST DEGREE MURDER APARTMENT true false 1031 10
## 8 FIRST DEGREE MURDER PARK PROPERTY true false 531 5
## 9 FIRST DEGREE MURDER ALLEY false false 733 7
## 10 FIRST DEGREE MURDER APARTMENT true true 321 3
## Ward Community.Area FBI.Code X.Coordinate Y.Coordinate Year
## 1 34 53 01A 1172100 1827003 2018
## 2 9 53 01A 1179707 1826280 2018
## 3 16 68 01A 1169720 1863456 2018
## 4 24 29 01A 1146951 1892767 2018
## 5 24 29 01A 1146951 1892767 2018
## 6 28 25 01A 1141583 1903680 2018
## 7 22 30 01A 1151152 1885454 2018
## 8 9 49 01A 1180913 1830483 2018
## 9 17 68 01A 1170861 1858101 2018
## 10 5 42 01A 1185025 1861230 2018
## Updated.On Latitude Longitude Location
## 1 02/11/2018 03:57:41 PM 41.68074 -87.64566 (41.68073915, -87.645661144)
## 2 02/12/2018 03:49:14 PM 41.67859 -87.61784 (41.678585145, -87.617837834)
## 3 02/13/2018 03:58:53 PM 41.78082 -87.65332 (41.78082323, -87.653318937)
## 4 02/14/2018 03:52:17 PM 41.86172 -87.73605 (41.861720639, -87.736048391)
## 5 02/14/2018 03:52:17 PM 41.86172 -87.73605 (41.861720639, -87.736048391)
## 6 02/15/2018 03:50:23 PM 41.89177 -87.75548 (41.891768093, -87.755483944)
## 7 03/01/2018 03:54:55 PM 41.84157 -87.72082 (41.841571637, -87.720818596)
## 8 03/01/2018 03:54:55 PM 41.69009 -87.61329 (41.690091209, -87.61329487)
## 9 03/01/2018 03:54:55 PM 41.76610 -87.64929 (41.766103626, -87.64929197)
## 10 03/03/2018 03:54:43 PM 41.77437 -87.59728 (41.774369057, -87.597278237)
Chicago_Homicides_2018_01<-Chicago_Homicides_2018 %>%
select(Case.Number, Date, Block, Description, Location.Description, Arrest, Domestic, Beat, District, Ward, Latitude, Longitude) %>%
arrange(Date)
pattern4 = "^[[:digit:]]{1,3}[[Xx]]{1,3}"
Chicago_Homicides_2018_01$Block<-str_trim(str_replace_all(Chicago_Homicides_2018_01$Block,pattern4,""))
Chicago_Homicides_2018_01$Arrest <- str_replace(Chicago_Homicides_2018_01$Arrest, "false", "No")
Chicago_Homicides_2018_01$Arrest <- str_replace(Chicago_Homicides_2018_01$Arrest, "true", "Yes")
Chicago_Homicides_2018_01$Domestic <- str_replace(Chicago_Homicides_2018_01$Domestic, "false", "No")
Chicago_Homicides_2018_01$Domestic <- str_replace(Chicago_Homicides_2018_01$Domestic, "true", "Yes")
Chicago_Homicides_2018_01<-Chicago_Homicides_2018_01 %>%
add_count(Block)
names(Chicago_Homicides_2018_01)[13] <- c('No_of_Homicides_on_Block')
Chicago_Homicides_2018_01
## # A tibble: 424 x 13
## Case.Number Date Block Description Location.Descri~
## <fct> <dttm> <chr> <fct> <fct>
## 1 JB100159 2018-01-01 02:46:00 W CORNE~ FIRST DEGREE~ AUTO
## 2 JB100522 2018-01-01 11:33:00 N HAMPD~ FIRST DEGREE~ APARTMENT
## 3 JB100839 2018-01-01 18:27:00 S PRINC~ FIRST DEGREE~ AUTO
## 4 JB101043 2018-01-01 22:40:00 S STONY~ FIRST DEGREE~ ALLEY
## 5 JB105765 2018-01-05 18:50:00 W LAKE ~ FIRST DEGREE~ STREET
## 6 JB105277 2018-01-06 12:54:00 E 62ND ~ FIRST DEGREE~ STREET
## 7 JB106545 2018-01-06 13:20:00 S MAY ST FIRST DEGREE~ STREET
## 8 JB107582 2018-01-07 13:35:00 N LARAM~ FIRST DEGREE~ STREET
## 9 JB107731 2018-01-07 16:37:00 N CHRIS~ FIRST DEGREE~ STREET
## 10 JB109411 2018-01-09 04:08:00 W GARFI~ FIRST DEGREE~ PARKING LOT
## # ... with 414 more rows, and 8 more variables: Arrest <chr>,
## # Domestic <chr>, Beat <int>, District <int>, Ward <int>,
## # Latitude <dbl>, Longitude <dbl>, No_of_Homicides_on_Block <int>
Chicago_Homicides_2018_01 %>%
filter(No_of_Homicides_on_Block == max(No_of_Homicides_on_Block))
## # A tibble: 6 x 13
## Case.Number Date Block Description Location.Descrip~
## <fct> <dttm> <chr> <fct> <fct>
## 1 JB247027 2018-05-23 14:19:00 S PAULI~ FIRST DEGREE~ STREET
## 2 JB314967 2018-06-21 15:01:00 S PAULI~ FIRST DEGREE~ STREET
## 3 JB345261 2018-07-11 15:58:00 S PAULI~ FIRST DEGREE~ STREET
## 4 JB345912 2018-07-11 23:30:00 S PAULI~ FIRST DEGREE~ YARD
## 5 JB407497 2018-08-24 23:55:00 S PAULI~ FIRST DEGREE~ AUTO
## 6 JB432708 2018-09-12 17:46:00 S PAULI~ FIRST DEGREE~ STREET
## # ... with 8 more variables: Arrest <chr>, Domestic <chr>, Beat <int>,
## # District <int>, Ward <int>, Latitude <dbl>, Longitude <dbl>,
## # No_of_Homicides_on_Block <int>
months<-month(Chicago_Homicides_2018_01$Date, label = TRUE)
plot(months)