This assignment is designed to simulate a scenario where you are taking over someone’s existing work, and continuing with it to draw some further insights.
This is a real world dataset taken from the New South Wales Bureau of Crime Statistics and Research. The data can be found here https://www.bocsar.nsw.gov.au/Documents/Datasets/SuburbData.zip, specifically the data called “SuburbData2019csv”. This file contains the data for this assignment.
You have just joined a consulting company as a data scientist. To give you some experience and guidance, you are performing a quick summary of the data while answering some questions that the chief business analytics leader has. This is not a formal report, but rather something you are giving to your manager that describes the data, and some interesting insights.
Please make sure you read the hints throughout the assignment to help guide you on the tasks.
The points for each of the elements in the assignment are marked next to the code for each question.
For this part of the assignment you will need to upload into Moodle:
- Your Rmd file
- Your html file
- And pdf copy of your html file.
After you have completed the first part of the assignment, on Friday 28 August* * you will be provided with the solutions and you will mark your own assignment before we mark it. This part of the task will be worth 4% of your total grade. The rubric to mark the assignment is the following:
Is the project reproducible? (Can the project be knitted?) yes –> (10pt), no –> (0 pt)
How many correct results you had? (Number of points) 10pts
In the pdf output that you submitted in the first part of the assignment, you will highliting in green the points that are correct and in yellow those that are incorrect. You will need to submit the markup pdf.
What are the things that you can improved? 10pts You will summit a pdf document with the comments.
Remember, you can look up the help file for functions by typing ?function_name. For example, ?mean. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ___. At a minimum, your assignment should be able to be “knitted” using the knit button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE like so:
```{r this-chunk-will-not-run, eval = FALSE} `r''`
ggplot()
```
Once you have completed each question you can add the following option into your R code chunk cache = TRUE that will store the computations of that chunk into a folder so the next time you run the project it will be faster. This option is handy in this assignment because the data set is quite large and some of the R code chunks take time to run. Remember to add this option only when you are sure you have completed the Rcode chunk correctly
If you use eval = FALSE or cache = TRUE, please remember to ensure that you have set to eval = TRUE when you submit the assignment , to ensure all your R code runs.
You will be completing this assignment individually.
This assignment is due in by close of business (1pm) on Thursday 27th August. You will submit the assignment via Moodle. Please make sure you add your name on the YAML part of this Rmd file.
You work as a data scientist in the well named consulting company, “Consulting for You”.
It’s your second day at the company, and you’re taken to your desk. Your boss says to you:
We hae a data set with the crime statistics in New South Wales for the past years!
We’ve got a meeting coming up soon to get insights about the crime in NSW and we want you to tell us about this dataset and what we can do with it.
You’re in with the new hires of data scientists here. We’d like you to take a look at the data and tell me what the spreadsheet tells us. I’ve written some questions on the report for you to answer.
Most Importantly, can you get this to me by Thursday 27th August, 1pm.
Please read below and answer all the questions (make sure the file knit so that you can produce an html file to hand back to me):
library(tidyverse)
crime_dat <- read_csv("data/SuburbData2019.csv")
# I am selecting here only a portion of the data
# to reduce computation times.
crime_data <-crime_dat %>%
select(-c(`Jan 1995`:`Jan 2010`)) %>%
dplyr::filter(Suburb %in% c("Chifley",
"Redfern",
"Clare",
"Paddington",
"Redfern",
"Zetland",
"Claymore",
"Congo",
"Yenda",
"Young",
"Yarra",
"Woodcroft",
"Woodhill",
"Warri",
"Waterloo",
"Randwick",
"Coogee"))
Hint: Check ?head in your R console
head(crime_data, 10) # 1pt
## # A tibble: 10 x 122
## Suburb `Offence catego… Subcategory `Feb 2010` `Mar 2010` `Apr 2010`
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chifl… Homicide Murder * 0 0 0
## 2 Chifl… Homicide Attempted … 0 0 0
## 3 Chifl… Homicide Murder acc… 0 0 0
## 4 Chifl… Homicide Manslaught… 0 0 0
## 5 Chifl… Assault Domestic v… 1 0 1
## 6 Chifl… Assault Non-domest… 2 0 0
## 7 Chifl… Assault Assault Po… 0 0 0
## 8 Chifl… Sexual offences Sexual ass… 1 0 0
## 9 Chifl… Sexual offences Indecent a… 0 0 0
## 10 Chifl… Abduction and k… Abduction … 0 0 0
## # … with 116 more variables: `May 2010` <dbl>, `Jun 2010` <dbl>, `Jul
## # 2010` <dbl>, `Aug 2010` <dbl>, `Sep 2010` <dbl>, `Oct 2010` <dbl>, `Nov
## # 2010` <dbl>, `Dec 2010` <dbl>, `Jan 2011` <dbl>, `Feb 2011` <dbl>, `Mar
## # 2011` <dbl>, `Apr 2011` <dbl>, `May 2011` <dbl>, `Jun 2011` <dbl>, `Jul
## # 2011` <dbl>, `Aug 2011` <dbl>, `Sep 2011` <dbl>, `Oct 2011` <dbl>, `Nov
## # 2011` <dbl>, `Dec 2011` <dbl>, `Jan 2012` <dbl>, `Feb 2012` <dbl>, `Mar
## # 2012` <dbl>, `Apr 2012` <dbl>, `May 2012` <dbl>, `Jun 2012` <dbl>, `Jul
## # 2012` <dbl>, `Aug 2012` <dbl>, `Sep 2012` <dbl>, `Oct 2012` <dbl>, `Nov
## # 2012` <dbl>, `Dec 2012` <dbl>, `Jan 2013` <dbl>, `Feb 2013` <dbl>, `Mar
## # 2013` <dbl>, `Apr 2013` <dbl>, `May 2013` <dbl>, `Jun 2013` <dbl>, `Jul
## # 2013` <dbl>, `Aug 2013` <dbl>, `Sep 2013` <dbl>, `Oct 2013` <dbl>, `Nov
## # 2013` <dbl>, `Dec 2013` <dbl>, `Jan 2014` <dbl>, `Feb 2014` <dbl>, `Mar
## # 2014` <dbl>, `Apr 2014` <dbl>, `May 2014` <dbl>, `Jun 2014` <dbl>, `Jul
## # 2014` <dbl>, `Aug 2014` <dbl>, `Sep 2014` <dbl>, `Oct 2014` <dbl>, `Nov
## # 2014` <dbl>, `Dec 2014` <dbl>, `Jan 2015` <dbl>, `Feb 2015` <dbl>, `Mar
## # 2015` <dbl>, `Apr 2015` <dbl>, `May 2015` <dbl>, `Jun 2015` <dbl>, `Jul
## # 2015` <dbl>, `Aug 2015` <dbl>, `Sep 2015` <dbl>, `Oct 2015` <dbl>, `Nov
## # 2015` <dbl>, `Dec 2015` <dbl>, `Jan 2016` <dbl>, `Feb 2016` <dbl>, `Mar
## # 2016` <dbl>, `Apr 2016` <dbl>, `May 2016` <dbl>, `Jun 2016` <dbl>, `Jul
## # 2016` <dbl>, `Aug 2016` <dbl>, `Sep 2016` <dbl>, `Oct 2016` <dbl>, `Nov
## # 2016` <dbl>, `Dec 2016` <dbl>, `Jan 2017` <dbl>, `Feb 2017` <dbl>, `Mar
## # 2017` <dbl>, `Apr 2017` <dbl>, `May 2017` <dbl>, `Jun 2017` <dbl>, `Jul
## # 2017` <dbl>, `Aug 2017` <dbl>, `Sep 2017` <dbl>, `Oct 2017` <dbl>, `Nov
## # 2017` <dbl>, `Dec 2017` <dbl>, `Jan 2018` <dbl>, `Feb 2018` <dbl>, `Mar
## # 2018` <dbl>, `Apr 2018` <dbl>, `May 2018` <dbl>, `Jun 2018` <dbl>, `Jul
## # 2018` <dbl>, `Aug 2018` <dbl>, …
Hint: Look for help ?dim in your R console and remember that variables are in columns and observations in rows. dim tells you the number of row and the number of columns in the data set (in the order)
dim(crime_data) # 1pt
## [1] 992 122
We have 122 variables and 930 observations.
The number of variables are dim(crime_data)[2] (1pt) and the number of rows are dim(crime_data)[1] (1pt)
names(crime_data)[1:20] #1pt
## [1] "Suburb" "Offence category" "Subcategory" "Feb 2010"
## [5] "Mar 2010" "Apr 2010" "May 2010" "Jun 2010"
## [9] "Jul 2010" "Aug 2010" "Sep 2010" "Oct 2010"
## [13] "Nov 2010" "Dec 2010" "Jan 2011" "Feb 2011"
## [17] "Mar 2011" "Apr 2011" "May 2011" "Jun 2011"
crime <- crime_data %>%
rename(`Offence_category` = `Offence category`) # 1pt
names(crime)[1:4] #1pt
## [1] "Suburb" "Offence_category" "Subcategory" "Feb 2010"
crime_long <- crime %>%
pivot_longer(cols = 4:122, # 2pt
names_to = "year", # 1pt
values_to = "Incidents") # 1pt
head(crime_long) # 1pt
## # A tibble: 6 x 5
## Suburb Offence_category Subcategory year Incidents
## <chr> <chr> <chr> <chr> <dbl>
## 1 Chifley Homicide Murder * Feb 2010 0
## 2 Chifley Homicide Murder * Mar 2010 0
## 3 Chifley Homicide Murder * Apr 2010 0
## 4 Chifley Homicide Murder * May 2010 0
## 5 Chifley Homicide Murder * Jun 2010 0
## 6 Chifley Homicide Murder * Jul 2010 0
crime_long_new <- crime_long %>%
separate(col = year, # 1pt
into = c("Month", "Year"), " " ) # 2pt
head(crime_long_new, 3) # 1pt
## # A tibble: 3 x 6
## Suburb Offence_category Subcategory Month Year Incidents
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Chifley Homicide Murder * Feb 2010 0
## 2 Chifley Homicide Murder * Mar 2010 0
## 3 Chifley Homicide Murder * Apr 2010 0
crime_long_new <- crime_long_new %>%
mutate(Year = as.numeric(Year)) # 1pt
head(crime_long_new) # 1pt
## # A tibble: 6 x 6
## Suburb Offence_category Subcategory Month Year Incidents
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Chifley Homicide Murder * Feb 2010 0
## 2 Chifley Homicide Murder * Mar 2010 0
## 3 Chifley Homicide Murder * Apr 2010 0
## 4 Chifley Homicide Murder * May 2010 0
## 5 Chifley Homicide Murder * Jun 2010 0
## 6 Chifley Homicide Murder * Jul 2010 0
Remember that you can learn more about what these functions do by typing
?uniqueor?lengthinto the console.
unique(crime_long_new$Year) # 1pt
## [1] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
# length tell us the length or longitude of a variable or a vector
length(unique(crime_long_new$Year)) #1pt
## [1] 10
Remember that you can learn more about what these functions do by typing
?unique,?n_distinctorlengthinto the console.
length(unique(crime_long_new$Suburb)) # 1pt
## [1] 16
n_distinct(crime_long_new$Suburb) # 1pt
## [1] 16
crime_long_new %>%
dplyr::filter(Year == "2019") %>% # 1pt
count(Offence_category) # 1pt
## # A tibble: 21 x 2
## Offence_category n
## <chr> <int>
## 1 Abduction and kidnapping 192
## 2 Against justice procedures 1152
## 3 Arson 192
## 4 Assault 576
## 5 Betting and gaming offences 192
## 6 Blackmail and extortion 192
## 7 Disorderly conduct 768
## 8 Drug offences 3072
## 9 Homicide 768
## 10 Intimidation, stalking and harassment 192
## # … with 11 more rows
crime_long_new %>%
dplyr::filter(Year == "2019") %>% # 1pt
count(Offence_category, sort = TRUE) # 1pt
## # A tibble: 21 x 2
## Offence_category n
## <chr> <int>
## 1 Drug offences 3072
## 2 Theft 2112
## 3 Against justice procedures 1152
## 4 Disorderly conduct 768
## 5 Homicide 768
## 6 Assault 576
## 7 Robbery 576
## 8 Sexual offences 384
## 9 Abduction and kidnapping 192
## 10 Arson 192
## # … with 11 more rows
It is “Drug offences”
crime_long_new %>%
dplyr::filter(Offence_category == "Homicide") %>% # 1pt
group_by(Subcategory) %>% # 1pt
summarise(Number_of_incidents = sum(Incidents)) # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## Subcategory Number_of_incidents
## <chr> <dbl>
## 1 Attempted murder 3
## 2 Manslaughter * 1
## 3 Murder * 14
## 4 Murder accessory, conspiracy 1
Paddington <- crime_long_new %>%
dplyr::filter( Suburb== "Paddington", # 2pt
Offence_category == "Drug offences") %>% #1pt
group_by(Subcategory) %>% # 1pt
summarise(Number_of_incidents = sum(Incidents)) %>% # 1pt
arrange(-Number_of_incidents) # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
head(Paddington) # 1pt
## # A tibble: 6 x 2
## Subcategory Number_of_incidents
## <chr> <dbl>
## 1 Possession and/or use of cannabis 154
## 2 Possession and/or use of cocaine 111
## 3 Possession and/or use of other drugs 82
## 4 Other drug offences 73
## 5 Dealing, trafficking in cocaine 68
## 6 Possession and/or use of amphetamines 57
To answer that question we need to first filter the suburb and the Subcategory then we are going to group incident by year and finally sum the number of incidents per year
Paddington_cannabis <- crime_long_new %>%
dplyr::filter( Suburb == 'Paddington', # 1pt
Subcategory == 'Possession and/or use of cannabis') %>% # 1pt
group_by(Year) %>% # 1pt
summarise(Number_of_incidents = sum(Incidents)) %>% #1pt
mutate(Year = as.numeric(Year)) # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
head(Paddington_cannabis,3) # 1pt
## # A tibble: 3 x 2
## Year Number_of_incidents
## <dbl> <dbl>
## 1 2010 17
## 2 2011 17
## 3 2012 15
On the x axis you should have the year and on the y axis you should display the number of incidents.
ggplot(Paddington_cannabis, aes(x = Year, y = Number_of_incidents)) + # 2pt
geom_line() # 1pt
both_cannabis <- crime_long_new %>%
dplyr::filter(Suburb %in% c('Paddington', 'Randwick'), # 1pt
Subcategory == 'Possession and/or use of cannabis') %>% # 1pt
group_by(Year, Suburb) %>% # 1pt
summarise(Number_of_incidents = sum(Incidents)) %>% # 1pt
mutate(Year = as.numeric(Year), # 1pt
Suburb = as.factor(Suburb)) # 1pt
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
ggplot(both_cannabis, aes( x = Year, # 1pt
y = Number_of_incidents, # 1pt
color = Suburb)) + # 1pt
geom_line() # 1pt
crime_long_new %>%
dplyr::select(Incidents, # 1pt
Year) %>% # 1pt
group_by(Year) %>% # 1pt
summarise(Number_of_incidents = sum(Incidents)) %>% # 1pt
mutate(Year = as.numeric(Year)) %>% # 1pt
ggplot(aes(x = Year, y = Number_of_incidents )) + # 1pt
geom_line() # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
crime_long_new %>%
dplyr::select(Incidents, # 1pt
Year) %>% # 1pt
group_by(Year) %>% # 1pt
summarise(Number_of_incidents = sum(Incidents)) %>% # 1pt
mutate(Year = as.numeric(Year)) %>% # 1pt
ggplot(aes(x =Year, y = Number_of_incidents)) + # 1pt
geom_line() + # 1pt
theme_bw() # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
crime_long_new %>%
dplyr::select(Incidents, # 1pt
Year) %>% # 1pt
group_by(Year) %>% # 1pt
summarise(Number_of_incidents = sum(Incidents)) %>% # 1pt
mutate(Year = as.numeric(Year)) %>% # 1pt
ggplot(aes(x = Year, y = Number_of_incidents)) + # 1pt
geom_line(linetype = 'dotted', color = 'green') # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
comparison_data<- crime_long_new %>%
dplyr::select(Suburb, # 1pt
Year, # 1pt
Incidents) %>% # 1pt
dplyr::filter( Suburb %in% c('Redfern', 'Coogee', 'Zetland')) %>% # 1pt
group_by(Year, Suburb) %>% # 1pt
summarise(Number_of_incidents = sum(Incidents)) # 1pt
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
ggplot(comparison_data, aes(x = Year, # 1pt
y = Number_of_incidents, # 1pt
fill = Suburb)) + # 1pt
geom_bar(stat = "identity", # 1pt
position = "dodge") + # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # 1pt
ggplot(comparison_data, aes(x = Year, # 1pt
y = Number_of_incidents, # 1pt
fill = Suburb)) + # 1pt
geom_bar(stat = "identity", # 1pt
position = "dodge") + # 1pt
theme_bw() + # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + # 1pt
xlab("Years") + # 1pt
ylab("Incidents") # 1pt
ggplot(comparison_data, aes(x = Year, # 1pt
y = Number_of_incidents, # 1pt
fill = Suburb)) + # 1pt
geom_bar(stat = "identity", # 1pt
position = "dodge") + # 1pt
theme_bw() + # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + # 1pt
xlab("Years") + # 1pt
ylab("Incidents") + # 1pt
ggtitle("Number of criminal incidents") # 1pt
ggplot(comparison_data, aes(x = Year, # 1pt
y = Number_of_incidents, # 1pt
fill =Suburb)) + # 1pt
geom_line() + # 1pt
facet_wrap(~Suburb) + # 1pt
theme_bw() + # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # 1pt
comparison_data %>%
pivot_wider(id_cols = Year, # 1pt
names_from = Suburb, # 1pt
values_from = Number_of_incidents) # 1pt
## # A tibble: 10 x 4
## # Groups: Year [10]
## Year Coogee Redfern Zetland
## <dbl> <dbl> <dbl> <dbl>
## 1 2010 897 3225 197
## 2 2011 1189 3822 318
## 3 2012 877 3959 380
## 4 2013 885 4440 312
## 5 2014 762 4400 359
## 6 2015 912 4674 562
## 7 2016 1016 5623 493
## 8 2017 1011 4411 526
## 9 2018 1013 4102 572
## 10 2019 1119 4052 621