IS 607 - Project 2 - Preparing Different Datasets for Downstream Analysis

I. Introduction

The goal of this project was to select three untidy datasets posted in the discussion forum and prepare them for analysis. I chose the following data sets:

Coaches who have won the World Cup - posted by Joseph Garcia.
List of Grand Slam men’s singles champions - posted by Peter Kowalchuk
Chicago Homicides 2018 - posted by John K. Hancock

For each post, I discuss the issues with the dataset, propose the plan to tidy the data, execute the plan, and reflect on the final dataset.

II. Datasets and Analysis

1. Coaches who have won the World Cup - posted by Joseph Garcia post

1a. Discussion

The data in a table on a wikipedia entry about FIFA World Cup Winners. After scraping the table from the site, we can see how the data is untidy Some columns have two different values. For example, Years_Won has both the year and an abbreviated name of the country. Columns for “Other_Appearances_as_Coach” and “Other_Appearances_as_Player” have multiple entries as well.

url1 <- 'https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners'

coaches<- url1  %>% 
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[4]') %>% 
  html_table(fill = TRUE)

coaches <- coaches[[1]]


coaches <- as.data.frame(coaches)

coaches$T <- NULL 
coaches <- coaches[-1,]

names(coaches)[2:5] <- c('Champion_Country','Years_Won','Other_Appearances_as_Coach', 'Other_Appearances_as_Player')


coaches<- tbl_df(coaches)
head(coaches, 10)

## # A tibble: 10 x 5
##    Coach   Champion_Country Years_Won  Other_Appearance~ Other_Appearance~
##    <chr>   <chr>            <chr>      <chr>             <chr>            
##  1 Vittor~ Italy            1934 ITA,~ ""                ""               
##  2 Enzo B~ Italy            1982 ITA   1978 ITA, 1986 I~ ""               
##  3 Franz ~ West Germany     1990 FRG   1986 FRG          1966, 1970, 1974 
##  4 Carlos~ Argentina        1986 ARG   1990 ARG          ""               
##  5 Vicent~ Spain            2010 ESP   2014 ESP          ""               
##  6 Didier~ France           2018 FRA   2014 FRA          1998             
##  7 Vicent~ Brazil           1958 BRA   1966 BRA          ""               
##  8 Sepp H~ West Germany     1954 FRG   1938 GER, 1958 F~ ""               
##  9 Aimé J~ France           1998 FRA   ""                ""               
## 10 Marcel~ Italy            2006 ITA   2010 ITA          ""

1b. The Plan

The plan is to create six variables: 1. Championship Year 2. Last_Name of the coach 3. First_name of the coach 4. Champion_Country (Full name) 5. Number of other appearances by the coach as a coach 6. Number of other appearance as a player

1c. Execution

The block below creates the coaches1 data frame which separates the coach name into last name, first name variables, and parses the Championship year and country into their own variables.

coaches1<-coaches %>%
  mutate(Years_Won=strsplit(as.character(Years_Won), ", ")) %>% 
  unnest(Years_Won) %>% 
  separate(Years_Won, c("Championship_Year", "Country2")) %>% 
  select(Championship_Year, Coach, Champion_Country) %>% 
  separate(Coach, c("First_Name", "Last_Name"), extra = "merge", fill = "right")

## Warning: package 'bindrcpp' was built under R version 3.5.1

The block below creates the coaches2 and 2a data frames which separates out Other Appearances into their own variables and counts the number of other appearances as a coach.

coaches2<-coaches %>%
  mutate(Other_Appearances_as_Coach=strsplit(as.character(Other_Appearances_as_Coach), ", ")) %>% 
  unnest(Other_Appearances_as_Coach) %>% 
  separate(Other_Appearances_as_Coach, c("Other_Appearance_as_Coach_Year", "Representing_Country_Coach")) %>%
  separate(Coach, c("First_Name", "Last_Name"), extra = "merge", fill = "right") %>% 
  select(Other_Appearance_as_Coach_Year, Last_Name,  Representing_Country_Coach)


coaches2a<-coaches2 %>%
    group_by(Last_Name) %>%
    summarise("No_of_Other_Appearances_as_Coach" = n())

The block below creates the coaches3 data frame which separates out Other Appearances into their own variables and counts the number of other appearances as a player.

coaches3 <- coaches %>% 
  mutate(Other_Appearances_as_Player=strsplit(as.character(Other_Appearances_as_Player), ", ")) %>% 
  unnest(Other_Appearances_as_Player) %>% 
  separate(Coach, c("First_Name", "Last_Name"), extra = "merge", fill = "right") %>%
  select(Other_Appearances_as_Player, Last_Name)

coaches3a<-coaches3 %>%
    group_by(Last_Name) %>%
    summarise("No_of_Other_Appearances_as_Player" = n())

Data frames coaches 2a and coaches 3a are merged into the coaches data frame to create the FINAL_COACHES_DF data frame.

coaches4 <-merge(coaches1,coaches2a,by="Last_Name",all.x = TRUE)
coaches4 <-merge(coaches4,coaches3a,by="Last_Name", all.x = TRUE)

coaches4[is.na(coaches4)] <-0

FINAL_COACHES_DF <- coaches4 %>% 
    select(Championship_Year,Last_Name,First_Name,Champion_Country,No_of_Other_Appearances_as_Coach,No_of_Other_Appearances_as_Player) %>% 
    arrange(Championship_Year)


FINAL_COACHES_DF

##    Championship_Year        Last_Name First_Name Champion_Country
## 1               1930          Suppici    Alberto          Uruguay
## 2               1934            Pozzo   Vittorio            Italy
## 3               1938            Pozzo   Vittorio            Italy
## 4               1950            López       Juan          Uruguay
## 5               1954        Herberger       Sepp     West Germany
## 6               1958            Feola    Vicente           Brazil
## 7               1962          Moreira     Aymoré           Brazil
## 8               1966           Ramsey        Alf          England
## 9               1970          Zagallo      Mário           Brazil
## 10              1974            Schön     Helmut     West Germany
## 11              1978     Luis Menotti      César        Argentina
## 12              1982          Bearzot       Enzo            Italy
## 13              1986          Bilardo     Carlos        Argentina
## 14              1990      Beckenbauer      Franz     West Germany
## 15              1994 Alberto Parreira     Carlos           Brazil
## 16              1998          Jacquet       Aimé           France
## 17              2002   Felipe Scolari       Luiz           Brazil
## 18              2006            Lippi   Marcello            Italy
## 19              2010       del Bosque    Vicente            Spain
## 20              2014              Löw    Joachim          Germany
## 21              2018        Deschamps     Didier           France
##    No_of_Other_Appearances_as_Coach No_of_Other_Appearances_as_Player
## 1                                 0                                 0
## 2                                 0                                 0
## 3                                 0                                 0
## 4                                 1                                 0
## 5                                 3                                 0
## 6                                 1                                 0
## 7                                 0                                 0
## 8                                 1                                 1
## 9                                 2                                 2
## 10                                3                                 0
## 11                                1                                 0
## 12                                2                                 0
## 13                                1                                 0
## 14                                1                                 3
## 15                                5                                 0
## 16                                0                                 0
## 17                                2                                 0
## 18                                1                                 0
## 19                                1                                 0
## 20                                2                                 0
## 21                                1                                 1

1d. Reflection

The data is now in a tidy format and is ready for analysis and questions such as:

what’s the total number of other appearances as a coach for championship coaches?

sum(FINAL_COACHES_DF$No_of_Other_Appearances_as_Coach)

## [1] 28

what’s the total number of other appearances as a player for championship coaches?

sum(FINAL_COACHES_DF$No_of_Other_Appearances_as_Player)

## [1] 7

Show the countries and number of World championships?

FINAL_COACHES_DF %>% 
    group_by(Champion_Country) %>% 
    summarise(Champion = n()) %>% 
    arrange(desc(Champion))

## # A tibble: 9 x 2
##   Champion_Country Champion
##   <chr>               <int>
## 1 Brazil                  5
## 2 Italy                   4
## 3 West Germany            3
## 4 Argentina               2
## 5 France                  2
## 6 Uruguay                 2
## 7 England                 1
## 8 Germany                 1
## 9 Spain                   1

2.List of Grand Slam men’s singles champions - posted by Peter Kowalchuk

2a. Discussion

The data is in wide format. Iformation about the number of Grand Slam (“GS”) tournaments won at the time and total number of GS tournament won is in parenthesis. For example, “Don Budge (1/6)”. Also, for the name of the champion, there are other entries other than names. For example,“started in 1891” or “France held under German occupation***" or “tournament not created”. These kind of entries make it difficult for analysis because they have to be accounted for when running the analysis.

url2 <- 'https://en.wikipedia.org/wiki/List_of_Grand_Slam_men%27s_singles_champions'

grand_slam<- url2 %>% 
    read_html() %>% 
    html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>% 
    html_table(fill = TRUE)

grand_slam <- grand_slam[[1]]

grand_slam <- as.data.frame(grand_slam)

head(grand_slam,10)

##    Year        Australian Open            French Open
## 1  1877        started in 1905        started in 1891
## 2  1878 tournament not created tournament not created
## 3  1879 tournament not created tournament not created
## 4  1880 tournament not created tournament not created
## 5  1881 tournament not created tournament not created
## 6  1882 tournament not created tournament not created
## 7  1883 tournament not created tournament not created
## 8  1884 tournament not created tournament not created
## 9  1885 tournament not created tournament not created
## 10 1886 tournament not created tournament not created
##                Wimbledon                  US Open
## 1      GBR: Spencer Gore          started in 1881
## 2            Frank Hadow   tournament not created
## 3     John Hartley (1/2)   tournament not created
## 4     John Hartley (2/2)   tournament not created
## 5  William Renshaw (1/7) USA: Richard Sears (1/7)
## 6  William Renshaw (2/7)      Richard Sears (2/7)
## 7  William Renshaw (3/7)      Richard Sears (3/7)
## 8  William Renshaw (4/7)      Richard Sears (4/7)
## 9  William Renshaw (5/7)      Richard Sears (5/7)
## 10 William Renshaw (6/7)      Richard Sears (6/7)

2b. The Plan accoring to Peter Kowalchuk

Set our columns to: Year, Tournament Name, Player Name.
Add another column with the player’s country, and another with the tournaments location.
Another column that might be added is an identifier for the kind of tournament,tournament type for example. This might show GS for the original tournaments in wiki, and maybe ATPM for ATP Masters, EX for exhibition games, and so on.

I added: 4. Parse out the tournament won at the time and total tournaments won into their own variables. Use regular expressions to parse out the data, transform it to numeric data, and replace the NAs with 0 for those years without a champion. 5. For Mens_Champion, if there was no winner for whatever reason, change it None. 6. For country of champion, use full country’s name, not just the three letter abbreviation. 7. Add a column for the reason why there was no champion that particular year.

2c. The following code below executes the plan outlined above.

grand_slam2<- grand_slam %>% 
gather(Tournament_Name,Mens_Champion,-Year) %>% 
    mutate(Tournament_Type="Grand Slam", Country_of_Tournament="") %>% 
    separate(Mens_Champion, c('Country_of_Champion','Mens_Champion'), sep=":", extra = "merge", fill="left")
    

grand_slam2<-grand_slam2 %>%
    mutate(Country_of_Tournament = case_when(Tournament_Name == 'Australian Open'  ~ "Australia",  Tournament_Name == 'French Open' ~ "France",
                            Tournament_Name == 'Wimbledon' ~ "England", Tournament_Name == "US Open" ~ "United States", TRUE ~ Country_of_Tournament)) %>% 
mutate(Country_of_Champion = case_when(Country_of_Champion=='AUS'~'Australia',
Country_of_Champion=='NZL'~'New Zealand',
Country_of_Champion=='PER'~'Peru',
Country_of_Champion=='RSA'~'Republic of South Africa',
Country_of_Champion=='GER'~'Germany',
Country_of_Champion=='CZE'~'Czech Republic',
Country_of_Champion=='SRB'~'Serbia',
Country_of_Champion=='FRA'~'France',
Country_of_Champion=='HUN'~'Hungary',
Country_of_Champion=='EGY'~'Egypt',
Country_of_Champion=='SWE'~'Sweden',
Country_of_Champion=='ITA'~'Italy',
Country_of_Champion=='ESP'~'Spain',
Country_of_Champion=='TCH'~'Czechoslovakia',
Country_of_Champion=='ARG'~'Argentina',
Country_of_Champion=='ECU'~'Ecuador',
Country_of_Champion=='AUT'~'Austria',
Country_of_Champion=='RUS'~'Russia',
Country_of_Champion=='BRA'~'Brazil',
Country_of_Champion=='GBR'~'Great Britain',
Country_of_Champion=='FRG'~'West Germany',
Country_of_Champion=='NED'~'Netherlands',
Country_of_Champion=='SUI'~'Switzerland',
Country_of_Champion=='USA'~'United States of America',
Country_of_Champion=='MEX'~'Mexico',
Country_of_Champion=='ROU'~'Romania', TRUE ~ Country_of_Champion)) %>% 
mutate("No_Champion_Reason" = case_when(Mens_Champion=='started in 1905'~ Mens_Champion, 
                                        Mens_Champion=="tournament not created"~ Mens_Champion,
                                        Mens_Champion == "World War I" ~ Mens_Champion,
                                        Mens_Champion == "World War II"~ Mens_Champion,
                                        Mens_Champion ==  "Tournament canceled" ~ Mens_Champion,
                                        Mens_Champion == "France held under German occupation***"~ Mens_Champion, 
                                        Mens_Champion == "started in 1891"~ Mens_Champion,
                                        Mens_Champion == "Tournament date changed"~ Mens_Champion,
                                        Mens_Champion == "started in 1881"~ Mens_Champion,TRUE ~ 'See Champion')) %>% 
    
mutate(Mens_Champion = case_when(Mens_Champion=='started in 1905'~'None',
                                 Mens_Champion=="tournament not created"~ 'None',
                                 Mens_Champion == "World War I" ~ 'None',
                                 Mens_Champion == "World War II"~ 'None',
                                 Mens_Champion ==  "Tournament canceled" ~ 'None',
                                 Mens_Champion == "France held under German occupation***"~ 'None', 
                                 Mens_Champion == "started in 1891"~ 'None', 
                                 Mens_Champion == "started in 1881"~ 'None',
                                 Mens_Champion == 'Tournament date changed'~ 'None',
                                 TRUE ~ Mens_Champion ))

pattern = "\\([[:digit:]]{1,2}\\/[[:digit:]]{1,2}\\)"
pattern2 = "[()]"
pattern3 = "[^[:alpha:][:space:]]"

grand_slam3 <- grand_slam2 %>% 
    mutate(Championships=gsub(pattern2, "", str_extract(grand_slam2$Mens_Champion, pattern))) %>% 
    separate(Championships,c("GS_Championships_Won_at_the_Time", "Total_GS_Won"), sep="/") 
    

grand_slam3$GS_Championships_Won_at_the_Time[is.na(grand_slam3$GS_Championships_Won_at_the_Time) & grand_slam3$Mens_Champion != 'None'] <- 1

grand_slam3$Total_GS_Won[is.na(grand_slam3$Total_GS_Won) & grand_slam3$Mens_Champion != 'None'] <- 1

grand_slam3$Mens_Champion<-str_trim(gsub(pattern3, "", grand_slam3$Mens_Champion))

grand_slam3$GS_Championships_Won_at_the_Time <- as.numeric(grand_slam3$GS_Championships_Won_at_the_Time)

grand_slam3$Total_GS_Won <- as.numeric(grand_slam3$Total_GS_Won)

grand_slam3$Total_GS_Won[is.na(grand_slam3$Total_GS_Won)] <- 0

grand_slam3$GS_Championships_Won_at_the_Time[is.na(grand_slam3$GS_Championships_Won_at_the_Time)] <- 0

grand_slam_final <- arrange(grand_slam3, Year)

2d. Reflection

The Country_of_Champion variable was not consistently recorded. For some entries, the player’s country had an abbreviation before the name like so, “GER: Gottfried von Cramm”, but this was not consistent with other players. One future correction would be to update that variable.
I debated on whether to keep in records without a champion entry or not. Keeping in data for those years where there was no champion may not be good for analysis. In the end, I left those records in since they do offer addtional information and be can easily removed.
The data is now in a tidy format available for analysis:

What mens player has won the most Grand Slam tennis titles?

grand_slam_final %>% 
    filter( GS_Championships_Won_at_the_Time == max(GS_Championships_Won_at_the_Time)) %>% 
    select(Mens_Champion, Total_GS_Won)

##   Mens_Champion Total_GS_Won
## 1 Roger Federer           20

What mens player has won the most Grand Slam tennis titles per tournament?

grand_slam_final %>%
    filter(Mens_Champion != 'None') %>% 
    group_by(Tournament_Name) %>%
    add_count(Mens_Champion) %>%
    select(Tournament_Name, Mens_Champion,n) %>% 
    filter(n==max(n)) %>% 
    arrange(Tournament_Name) %>% 
    distinct()

## # A tibble: 8 x 3
## # Groups:   Tournament_Name [4]
##   Tournament_Name Mens_Champion      n
##   <chr>           <chr>          <int>
## 1 Australian Open Roy Emerson        6
## 2 Australian Open Roger Federer      6
## 3 Australian Open Novak Djokovic     6
## 4 French Open     Rafael Nadal      11
## 5 US Open         Richard Sears      7
## 6 US Open         William Larned     7
## 7 US Open         Bill Tilden        7
## 8 Wimbledon       Roger Federer      8

3.Chicago Homicides 2018. Posted by John K. Hancock

3a. Discussion

This dataset was exported from the website Chicago Data Portal (link) The site tracks crime statistics from 2001 to the present. For this project, I exported Homicide statistics for 2018. The data has a lot of duplicative variables which can be removed for analysis purposes.

Chicago_Homicides_2018<- read.csv('Chicago Homicides 2018.csv')

Chicago_Homicides_2018$Date <- mdy_hms(Chicago_Homicides_2018$Date)

head(Chicago_Homicides_2018,10)

##       ID Case.Number                Date
## 1  23810    JB140548 2018-02-04 01:36:00
## 2  23811    JB141441 2018-02-05 01:10:00
## 3  23812    JB142661 2018-02-06 04:10:00
## 4  23813    JB143932 2018-02-07 09:23:00
## 5  23814    JB143932 2018-02-07 09:23:00
## 6  23815    JB145915 2018-02-08 20:43:00
## 7  23834    JB160133 2018-02-22 09:22:00
## 8  23835    JB161360 2018-02-22 15:35:00
## 9  23836    JB161698 2018-02-22 18:39:00
## 10 23838    JB163448 2018-02-24 08:23:00
##                                  Block IUCR Primary.Type
## 1                  117XX S SANGAMON ST  110     HOMICIDE
## 2                  118XX S INDIANA AVE  110     HOMICIDE
## 3                       062XX S MAY ST  110     HOMICIDE
## 4                  014XX S KENNETH AVE  110     HOMICIDE
## 5                  014XX S KENNETH AVE  110     HOMICIDE
## 6                  006XX N LARAMIE AVE  110     HOMICIDE
## 7                    027XX S AVERS AVE  110     HOMICIDE
## 8  112XX S DR MARTIN LUTHER KING JR DR  110     HOMICIDE
## 9                    070XX S MORGAN ST  110     HOMICIDE
## 10                 066XX S MINERVA AVE  110     HOMICIDE
##            Description Location.Description Arrest Domestic Beat District
## 1  FIRST DEGREE MURDER               STREET  false    false  524        5
## 2  FIRST DEGREE MURDER           VACANT LOT  false    false  532        5
## 3  FIRST DEGREE MURDER            APARTMENT  false    false  712        7
## 4  FIRST DEGREE MURDER            APARTMENT  false    false 1011       10
## 5  FIRST DEGREE MURDER            APARTMENT  false    false 1011       10
## 6  FIRST DEGREE MURDER                HOUSE  false    false 1532       15
## 7  FIRST DEGREE MURDER            APARTMENT   true    false 1031       10
## 8  FIRST DEGREE MURDER        PARK PROPERTY   true    false  531        5
## 9  FIRST DEGREE MURDER                ALLEY  false    false  733        7
## 10 FIRST DEGREE MURDER            APARTMENT   true     true  321        3
##    Ward Community.Area FBI.Code X.Coordinate Y.Coordinate Year
## 1    34             53      01A      1172100      1827003 2018
## 2     9             53      01A      1179707      1826280 2018
## 3    16             68      01A      1169720      1863456 2018
## 4    24             29      01A      1146951      1892767 2018
## 5    24             29      01A      1146951      1892767 2018
## 6    28             25      01A      1141583      1903680 2018
## 7    22             30      01A      1151152      1885454 2018
## 8     9             49      01A      1180913      1830483 2018
## 9    17             68      01A      1170861      1858101 2018
## 10    5             42      01A      1185025      1861230 2018
##                Updated.On Latitude Longitude                      Location
## 1  02/11/2018 03:57:41 PM 41.68074 -87.64566  (41.68073915, -87.645661144)
## 2  02/12/2018 03:49:14 PM 41.67859 -87.61784 (41.678585145, -87.617837834)
## 3  02/13/2018 03:58:53 PM 41.78082 -87.65332  (41.78082323, -87.653318937)
## 4  02/14/2018 03:52:17 PM 41.86172 -87.73605 (41.861720639, -87.736048391)
## 5  02/14/2018 03:52:17 PM 41.86172 -87.73605 (41.861720639, -87.736048391)
## 6  02/15/2018 03:50:23 PM 41.89177 -87.75548 (41.891768093, -87.755483944)
## 7  03/01/2018 03:54:55 PM 41.84157 -87.72082 (41.841571637, -87.720818596)
## 8  03/01/2018 03:54:55 PM 41.69009 -87.61329  (41.690091209, -87.61329487)
## 9  03/01/2018 03:54:55 PM 41.76610 -87.64929  (41.766103626, -87.64929197)
## 10 03/03/2018 03:54:43 PM 41.77437 -87.59728 (41.774369057, -87.597278237)

3b. Plan

Separate the Date field into Date and Time. By doing this we can get better analysis as to when Homicides are most likely to occur.
Remove Unneccsary and redundant columns ID, Location, X.Coordinate, Y.Coordinate, Primary.Type, IUCR, Year, Updated.On, and FBI.Code
Remove the house information from the Block variable.
Replace the “true/false” values in the Arrest and Domestic variables to avoid confusion.
Add a column that tallies the number of Homicides on each block.

3c. Execution

Chicago_Homicides_2018_01<-Chicago_Homicides_2018 %>% 
    select(Case.Number, Date, Block, Description, Location.Description, Arrest, Domestic, Beat, District, Ward, Latitude, Longitude) %>% 
    arrange(Date)

pattern4 = "^[[:digit:]]{1,3}[[Xx]]{1,3}"

Chicago_Homicides_2018_01$Block<-str_trim(str_replace_all(Chicago_Homicides_2018_01$Block,pattern4,""))

Chicago_Homicides_2018_01$Arrest <- str_replace(Chicago_Homicides_2018_01$Arrest, "false", "No")
Chicago_Homicides_2018_01$Arrest <- str_replace(Chicago_Homicides_2018_01$Arrest, "true", "Yes")
Chicago_Homicides_2018_01$Domestic <- str_replace(Chicago_Homicides_2018_01$Domestic, "false", "No")
Chicago_Homicides_2018_01$Domestic <- str_replace(Chicago_Homicides_2018_01$Domestic, "true", "Yes")

Chicago_Homicides_2018_01<-Chicago_Homicides_2018_01 %>% 
    add_count(Block)

names(Chicago_Homicides_2018_01)[13] <- c('No_of_Homicides_on_Block')
Chicago_Homicides_2018_01

## # A tibble: 424 x 13
##    Case.Number Date                Block    Description   Location.Descri~
##    <fct>       <dttm>              <chr>    <fct>         <fct>           
##  1 JB100159    2018-01-01 02:46:00 W CORNE~ FIRST DEGREE~ AUTO            
##  2 JB100522    2018-01-01 11:33:00 N HAMPD~ FIRST DEGREE~ APARTMENT       
##  3 JB100839    2018-01-01 18:27:00 S PRINC~ FIRST DEGREE~ AUTO            
##  4 JB101043    2018-01-01 22:40:00 S STONY~ FIRST DEGREE~ ALLEY           
##  5 JB105765    2018-01-05 18:50:00 W LAKE ~ FIRST DEGREE~ STREET          
##  6 JB105277    2018-01-06 12:54:00 E 62ND ~ FIRST DEGREE~ STREET          
##  7 JB106545    2018-01-06 13:20:00 S MAY ST FIRST DEGREE~ STREET          
##  8 JB107582    2018-01-07 13:35:00 N LARAM~ FIRST DEGREE~ STREET          
##  9 JB107731    2018-01-07 16:37:00 N CHRIS~ FIRST DEGREE~ STREET          
## 10 JB109411    2018-01-09 04:08:00 W GARFI~ FIRST DEGREE~ PARKING LOT     
## # ... with 414 more rows, and 8 more variables: Arrest <chr>,
## #   Domestic <chr>, Beat <int>, District <int>, Ward <int>,
## #   Latitude <dbl>, Longitude <dbl>, No_of_Homicides_on_Block <int>

3d. Reflections

I had to convert the Date Time variable from character data to a date field.
I had to scrap my original plan to separate out the Date/Time due to the fact that doing so changes the Date field to a character field.
By removing the house number from the block, I was able to tally the number of Homcides on each block. By doing so, I can identify the block with the most number of Homcides so far this year. “S Paulina Street” has had six homicides so far this year.

Chicago_Homicides_2018_01 %>% 
    filter(No_of_Homicides_on_Block == max(No_of_Homicides_on_Block))

## # A tibble: 6 x 13
##   Case.Number Date                Block    Description   Location.Descrip~
##   <fct>       <dttm>              <chr>    <fct>         <fct>            
## 1 JB247027    2018-05-23 14:19:00 S PAULI~ FIRST DEGREE~ STREET           
## 2 JB314967    2018-06-21 15:01:00 S PAULI~ FIRST DEGREE~ STREET           
## 3 JB345261    2018-07-11 15:58:00 S PAULI~ FIRST DEGREE~ STREET           
## 4 JB345912    2018-07-11 23:30:00 S PAULI~ FIRST DEGREE~ YARD             
## 5 JB407497    2018-08-24 23:55:00 S PAULI~ FIRST DEGREE~ AUTO             
## 6 JB432708    2018-09-12 17:46:00 S PAULI~ FIRST DEGREE~ STREET           
## # ... with 8 more variables: Arrest <chr>, Domestic <chr>, Beat <int>,
## #   District <int>, Ward <int>, Latitude <dbl>, Longitude <dbl>,
## #   No_of_Homicides_on_Block <int>

By converting the Date field to a POSIXct variable, I was able to do a quick plot showing the number homicides per month this year so far.

months<-month(Chicago_Homicides_2018_01$Date, label = TRUE)
plot(months)

Project Two

John K. Hancock

September 30, 2018