Instructions to Students

This assignment is designed to simulate a scenario where you are taking over someone’s existing work, and continuing with it to draw some further insights.

This is a real world dataset taken from the New South Wales Bureau of Crime Statistics and Research. The data can be found here https://www.bocsar.nsw.gov.au/Documents/Datasets/SuburbData.zip, specifically the data called “SuburbData2019csv”. This file contains the data for this assignment.

You have just joined a consulting company as a data scientist. To give you some experience and guidance, you are performing a quick summary of the data while answering some questions that the chief business analytics leader has. This is not a formal report, but rather something you are giving to your manager that describes the data, and some interesting insights.

Please make sure you read the hints throughout the assignment to help guide you on the tasks.

The points for each of the elements in the assignment are marked next to the code for each question.

Marking + Grades

  • This part of the assignment will be worth 6% of your total grade, and is marked out of 116 marks total. Due on: Thursday 27 August.

For this part of the assignment you will need to upload into Moodle:

  - Your Rmd file
  - Your html file
  - And  pdf copy of your html file.
  
  • After you have completed the first part of the assignment, on Friday 28 August* * you will be provided with the solutions and you will mark your own assignment before we mark it. This part of the task will be worth 4% of your total grade. The rubric to mark the assignment is the following:

    • Is the project reproducible? (Can the project be knitted?) yes –> (10pt), no –> (0 pt)

    • How many correct results you had? (Number of points) 10pts

      In the pdf output that you submitted in the first part of the assignment, you will highliting in green the points that are correct and in yellow those that are incorrect. You will need to submit the markup pdf.

    • What are the things that you can improved? 10pts You will summit a pdf document with the comments.

How to find help from R functions?

Remember, you can look up the help file for functions by typing ?function_name. For example, ?mean. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.

How to complete this assignment.

To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ___. At a minimum, your assignment should be able to be “knitted” using the knit button for your Rmarkdown document.

If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE like so:

```{r this-chunk-will-not-run, eval = FALSE} `r''`
ggplot()
```

Once you have completed each question you can add the following option into your R code chunk cache = TRUE that will store the computations of that chunk into a folder so the next time you run the project it will be faster. This option is handy in this assignment because the data set is quite large and some of the R code chunks take time to run. Remember to add this option only when you are sure you have completed the Rcode chunk correctly

If you use eval = FALSE or cache = TRUE, please remember to ensure that you have set to eval = TRUE when you submit the assignment , to ensure all your R code runs.

You will be completing this assignment individually.

Due Date

This assignment is due in by close of business (1pm) on Thursday 27th August. You will submit the assignment via Moodle. Please make sure you add your name on the YAML part of this Rmd file.

Treatment

You work as a data scientist in the well named consulting company, “Consulting for You”.

It’s your second day at the company, and you’re taken to your desk. Your boss says to you:

We hae a data set with the crime statistics in New South Wales for the past years!

We’ve got a meeting coming up soon to get insights about the crime in NSW and we want you to tell us about this dataset and what we can do with it.

You’re in with the new hires of data scientists here. We’d like you to take a look at the data and tell me what the spreadsheet tells us. I’ve written some questions on the report for you to answer.

Most Importantly, can you get this to me by Thursday 27th August, 1pm.

Please read below and answer all the questions (make sure the file knit so that you can produce an html file to hand back to me):

Load all the libraries that you need here

library(tidyverse)

Reading and preparing data

crime_dat <- read_csv("data/SuburbData2019.csv")
# I am selecting here only a portion of the data
# to reduce computation times.
crime_data <-crime_dat %>%
  select(-c(`Jan 1995`:`Jan 2010`)) %>%
  dplyr::filter(Suburb %in% c("Chifley",
                              "Redfern",
                              "Clare",
                              "Paddington",
                              "Redfern",
                              "Zetland",
                              "Claymore",
                              "Congo",
                              "Yenda", 
                              "Young",
                              "Yarra",
                              "Woodcroft",
                              "Woodhill",
                              "Warri",
                              "Waterloo",
                              "Randwick",
                              "Coogee"))

Question 1: Display the first 10 rows of the data set

Hint: Check ?head in your R console

head(crime_data, 10)   # 1pt
## # A tibble: 10 x 122
##    Suburb `Offence catego… Subcategory `Feb 2010` `Mar 2010` `Apr 2010`
##    <chr>  <chr>            <chr>            <dbl>      <dbl>      <dbl>
##  1 Chifl… Homicide         Murder *             0          0          0
##  2 Chifl… Homicide         Attempted …          0          0          0
##  3 Chifl… Homicide         Murder acc…          0          0          0
##  4 Chifl… Homicide         Manslaught…          0          0          0
##  5 Chifl… Assault          Domestic v…          1          0          1
##  6 Chifl… Assault          Non-domest…          2          0          0
##  7 Chifl… Assault          Assault Po…          0          0          0
##  8 Chifl… Sexual offences  Sexual ass…          1          0          0
##  9 Chifl… Sexual offences  Indecent a…          0          0          0
## 10 Chifl… Abduction and k… Abduction …          0          0          0
## # … with 116 more variables: `May 2010` <dbl>, `Jun 2010` <dbl>, `Jul
## #   2010` <dbl>, `Aug 2010` <dbl>, `Sep 2010` <dbl>, `Oct 2010` <dbl>, `Nov
## #   2010` <dbl>, `Dec 2010` <dbl>, `Jan 2011` <dbl>, `Feb 2011` <dbl>, `Mar
## #   2011` <dbl>, `Apr 2011` <dbl>, `May 2011` <dbl>, `Jun 2011` <dbl>, `Jul
## #   2011` <dbl>, `Aug 2011` <dbl>, `Sep 2011` <dbl>, `Oct 2011` <dbl>, `Nov
## #   2011` <dbl>, `Dec 2011` <dbl>, `Jan 2012` <dbl>, `Feb 2012` <dbl>, `Mar
## #   2012` <dbl>, `Apr 2012` <dbl>, `May 2012` <dbl>, `Jun 2012` <dbl>, `Jul
## #   2012` <dbl>, `Aug 2012` <dbl>, `Sep 2012` <dbl>, `Oct 2012` <dbl>, `Nov
## #   2012` <dbl>, `Dec 2012` <dbl>, `Jan 2013` <dbl>, `Feb 2013` <dbl>, `Mar
## #   2013` <dbl>, `Apr 2013` <dbl>, `May 2013` <dbl>, `Jun 2013` <dbl>, `Jul
## #   2013` <dbl>, `Aug 2013` <dbl>, `Sep 2013` <dbl>, `Oct 2013` <dbl>, `Nov
## #   2013` <dbl>, `Dec 2013` <dbl>, `Jan 2014` <dbl>, `Feb 2014` <dbl>, `Mar
## #   2014` <dbl>, `Apr 2014` <dbl>, `May 2014` <dbl>, `Jun 2014` <dbl>, `Jul
## #   2014` <dbl>, `Aug 2014` <dbl>, `Sep 2014` <dbl>, `Oct 2014` <dbl>, `Nov
## #   2014` <dbl>, `Dec 2014` <dbl>, `Jan 2015` <dbl>, `Feb 2015` <dbl>, `Mar
## #   2015` <dbl>, `Apr 2015` <dbl>, `May 2015` <dbl>, `Jun 2015` <dbl>, `Jul
## #   2015` <dbl>, `Aug 2015` <dbl>, `Sep 2015` <dbl>, `Oct 2015` <dbl>, `Nov
## #   2015` <dbl>, `Dec 2015` <dbl>, `Jan 2016` <dbl>, `Feb 2016` <dbl>, `Mar
## #   2016` <dbl>, `Apr 2016` <dbl>, `May 2016` <dbl>, `Jun 2016` <dbl>, `Jul
## #   2016` <dbl>, `Aug 2016` <dbl>, `Sep 2016` <dbl>, `Oct 2016` <dbl>, `Nov
## #   2016` <dbl>, `Dec 2016` <dbl>, `Jan 2017` <dbl>, `Feb 2017` <dbl>, `Mar
## #   2017` <dbl>, `Apr 2017` <dbl>, `May 2017` <dbl>, `Jun 2017` <dbl>, `Jul
## #   2017` <dbl>, `Aug 2017` <dbl>, `Sep 2017` <dbl>, `Oct 2017` <dbl>, `Nov
## #   2017` <dbl>, `Dec 2017` <dbl>, `Jan 2018` <dbl>, `Feb 2018` <dbl>, `Mar
## #   2018` <dbl>, `Apr 2018` <dbl>, `May 2018` <dbl>, `Jun 2018` <dbl>, `Jul
## #   2018` <dbl>, `Aug 2018` <dbl>, …

Question 2: How many variables and observations do we have?

Hint: Look for help ?dim in your R console and remember that variables are in columns and observations in rows. dim tells you the number of row and the number of columns in the data set (in the order)

dim(crime_data)   # 1pt
## [1] 992 122

We have 122 variables and 930 observations.

The number of variables are dim(crime_data)[2] (1pt) and the number of rows are dim(crime_data)[1] (1pt)

Question 3: What are the names of the first 20 variables in this data?

names(crime_data)[1:20]   #1pt
##  [1] "Suburb"           "Offence category" "Subcategory"      "Feb 2010"        
##  [5] "Mar 2010"         "Apr 2010"         "May 2010"         "Jun 2010"        
##  [9] "Jul 2010"         "Aug 2010"         "Sep 2010"         "Oct 2010"        
## [13] "Nov 2010"         "Dec 2010"         "Jan 2011"         "Feb 2011"        
## [17] "Mar 2011"         "Apr 2011"         "May 2011"         "Jun 2011"

Question 4: Rename the variable “Offence category” to “Offence_category” and show the names of the first four variables

crime <- crime_data %>%
  rename(`Offence_category` = `Offence category`)  # 1pt

names(crime)[1:4] #1pt
## [1] "Suburb"           "Offence_category" "Subcategory"      "Feb 2010"

Question 5: Change the “crime” data into long format so that all the years are grouped together into a variable called “year”

crime_long <- crime %>%
  pivot_longer(cols = 4:122,  # 2pt
               names_to = "year",             # 1pt
               values_to = "Incidents")       # 1pt

head(crime_long)     # 1pt
## # A tibble: 6 x 5
##   Suburb  Offence_category Subcategory year     Incidents
##   <chr>   <chr>            <chr>       <chr>        <dbl>
## 1 Chifley Homicide         Murder *    Feb 2010         0
## 2 Chifley Homicide         Murder *    Mar 2010         0
## 3 Chifley Homicide         Murder *    Apr 2010         0
## 4 Chifley Homicide         Murder *    May 2010         0
## 5 Chifley Homicide         Murder *    Jun 2010         0
## 6 Chifley Homicide         Murder *    Jul 2010         0

Question 6: Separate the column year into two columns “Month” and “Year”. Display only 3 lines to show the updated data

crime_long_new <- crime_long %>%
  separate(col = year,                        # 1pt
           into = c("Month", "Year"), " " )   # 2pt

head(crime_long_new, 3)                        # 1pt
## # A tibble: 3 x 6
##   Suburb  Offence_category Subcategory Month Year  Incidents
##   <chr>   <chr>            <chr>       <chr> <chr>     <dbl>
## 1 Chifley Homicide         Murder *    Feb   2010          0
## 2 Chifley Homicide         Murder *    Mar   2010          0
## 3 Chifley Homicide         Murder *    Apr   2010          0

Question 7: If you look at the data crime_long_new you will observed that the variable Year is coded as character. In this section we are going to change the variable Year to be a numeric variable.

crime_long_new <- crime_long_new %>%
  mutate(Year = as.numeric(Year))   # 1pt

head(crime_long_new)                # 1pt
## # A tibble: 6 x 6
##   Suburb  Offence_category Subcategory Month  Year Incidents
##   <chr>   <chr>            <chr>       <chr> <dbl>     <dbl>
## 1 Chifley Homicide         Murder *    Feb    2010         0
## 2 Chifley Homicide         Murder *    Mar    2010         0
## 3 Chifley Homicide         Murder *    Apr    2010         0
## 4 Chifley Homicide         Murder *    May    2010         0
## 5 Chifley Homicide         Murder *    Jun    2010         0
## 6 Chifley Homicide         Murder *    Jul    2010         0

Question 8: Display the years in the data set. How many years of data are there?

Remember that you can learn more about what these functions do by typing ?unique or ?length into the console.

unique(crime_long_new$Year)    # 1pt
##  [1] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
# length tell us the length or longitude of a variable or a vector
length(unique(crime_long_new$Year))   #1pt
## [1] 10

Question 9: How many different suburbs are there in the data?

Remember that you can learn more about what these functions do by typing ?unique, ?n_distinct or length into the console.

length(unique(crime_long_new$Suburb))     # 1pt
## [1] 16
n_distinct(crime_long_new$Suburb) # 1pt 
## [1] 16

Question 10: How many incidents do we have per “Offence_category” for 2019 in total?

crime_long_new %>% 
  dplyr::filter(Year == "2019") %>%    # 1pt
  count(Offence_category)              # 1pt
## # A tibble: 21 x 2
##    Offence_category                          n
##    <chr>                                 <int>
##  1 Abduction and kidnapping                192
##  2 Against justice procedures             1152
##  3 Arson                                   192
##  4 Assault                                 576
##  5 Betting and gaming offences             192
##  6 Blackmail and extortion                 192
##  7 Disorderly conduct                      768
##  8 Drug offences                          3072
##  9 Homicide                                768
## 10 Intimidation, stalking and harassment   192
## # … with 11 more rows

Question 11: Which is the “Offence_category” with more incidents in 2019?

crime_long_new %>% 
  dplyr::filter(Year == "2019") %>%    #  1pt
  count(Offence_category, sort = TRUE) # 1pt
## # A tibble: 21 x 2
##    Offence_category               n
##    <chr>                      <int>
##  1 Drug offences               3072
##  2 Theft                       2112
##  3 Against justice procedures  1152
##  4 Disorderly conduct           768
##  5 Homicide                     768
##  6 Assault                      576
##  7 Robbery                      576
##  8 Sexual offences              384
##  9 Abduction and kidnapping     192
## 10 Arson                        192
## # … with 11 more rows

It is “Drug offences”

Question 12: How many offences are there in each Subcategory of the Offence_category Homicide?

crime_long_new %>% 
  dplyr::filter(Offence_category == "Homicide") %>%    # 1pt
  group_by(Subcategory) %>%                            # 1pt
  summarise(Number_of_incidents = sum(Incidents))      # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
##   Subcategory                  Number_of_incidents
##   <chr>                                      <dbl>
## 1 Attempted murder                               3
## 2 Manslaughter *                                 1
## 3 Murder *                                      14
## 4 Murder accessory, conspiracy                   1

Question 13: Select the Suburb called “Paddington” and calculate the number of incidents for the Offence_category of “Drug offences” then calculate the total number of incidents for each Subcategory. Finally, show a table arranged by the Number of incidents (from higher to lower)

Paddington <- crime_long_new %>% 
  dplyr::filter( Suburb== "Paddington",                       # 2pt
                 Offence_category == "Drug offences") %>%      #1pt
          group_by(Subcategory) %>%                            # 1pt
          summarise(Number_of_incidents = sum(Incidents)) %>%    # 1pt
          arrange(-Number_of_incidents)    # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
head(Paddington)                             # 1pt
## # A tibble: 6 x 2
##   Subcategory                           Number_of_incidents
##   <chr>                                               <dbl>
## 1 Possession and/or use of cannabis                     154
## 2 Possession and/or use of cocaine                      111
## 3 Possession and/or use of other drugs                   82
## 4 Other drug offences                                    73
## 5 Dealing, trafficking in cocaine                        68
## 6 Possession and/or use of amphetamines                  57

Question 14: Let’s have a look at the changes over time for “Possession and/or use of cannabis” in the suburb of Paddington

To answer that question we need to first filter the suburb and the Subcategory then we are going to group incident by year and finally sum the number of incidents per year

Paddington_cannabis <- crime_long_new %>% 
  dplyr::filter( Suburb == 'Paddington',               # 1pt
                 Subcategory == 'Possession and/or use of cannabis') %>%   # 1pt
          group_by(Year) %>%                                               # 1pt
          summarise(Number_of_incidents = sum(Incidents)) %>%              #1pt
  mutate(Year = as.numeric(Year))    # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)
head(Paddington_cannabis,3)  # 1pt
## # A tibble: 3 x 2
##    Year Number_of_incidents
##   <dbl>               <dbl>
## 1  2010                  17
## 2  2011                  17
## 3  2012                  15

Question 15: Create a line plot to display the trend of the incidents that you calculated for Paddington.

On the x axis you should have the year and on the y axis you should display the number of incidents.

ggplot(Paddington_cannabis, aes(x = Year, y = Number_of_incidents)) + # 2pt 
  geom_line()                     # 1pt

Question 17: Let’s now look at the total number of crime incidents in NSW and create a plot to visualize the trend.

crime_long_new %>% 
  dplyr::select(Incidents,   # 1pt
                Year) %>%  # 1pt
          group_by(Year) %>%  # 1pt
          summarise(Number_of_incidents = sum(Incidents)) %>%  # 1pt
  mutate(Year = as.numeric(Year)) %>%    # 1pt
  ggplot(aes(x = Year, y = Number_of_incidents )) +   # 1pt
  geom_line()  # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)

Question 18: Now let’s change the background color of the plot to white using the theme_bw()

crime_long_new %>%     
  dplyr::select(Incidents,    # 1pt
                 Year) %>% # 1pt
          group_by(Year) %>%   # 1pt
          summarise(Number_of_incidents = sum(Incidents)) %>%  # 1pt
  mutate(Year = as.numeric(Year)) %>%  # 1pt
  ggplot(aes(x =Year, y = Number_of_incidents)) +  # 1pt
  geom_line() +  # 1pt
  theme_bw()  # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)

Question 19: Let’s change the line color to green and replace it by a dotted line

crime_long_new %>% 
  dplyr::select(Incidents,  # 1pt
                 Year) %>%  # 1pt
          group_by(Year) %>%  # 1pt
         summarise(Number_of_incidents = sum(Incidents)) %>%  # 1pt
  mutate(Year = as.numeric(Year)) %>%  # 1pt
  ggplot(aes(x = Year, y = Number_of_incidents)) +  # 1pt
  geom_line(linetype = 'dotted', color = 'green')  # 1pt
## `summarise()` ungrouping output (override with `.groups` argument)

Question 20: Now let’s look at the total number of crime incidents for the suburbs Redfern, Coogee and Zetland by creating a bar figure where we have the incidents per suburb by year next to each other:

comparison_data<- crime_long_new %>%
  dplyr::select(Suburb,  # 1pt
                Year,    # 1pt 
                Incidents) %>%  # 1pt
  dplyr::filter( Suburb %in% c('Redfern', 'Coogee', 'Zetland')) %>%  # 1pt
          group_by(Year, Suburb) %>% # 1pt
          summarise(Number_of_incidents = sum(Incidents)) # 1pt
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
ggplot(comparison_data, aes(x = Year,    # 1pt
                            y = Number_of_incidents,  # 1pt
                            fill = Suburb)) +  # 1pt
  geom_bar(stat = "identity",   # 1pt
             position = "dodge") +  # 1pt
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # 1pt

Question 21: Change the x and y axis label to “Years” and " Incidents" for the figure in question 20 and use the black and white theme

ggplot(comparison_data, aes(x = Year,     # 1pt
                            y = Number_of_incidents,  # 1pt
                            fill = Suburb)) +   # 1pt
  geom_bar(stat = "identity",    # 1pt
             position  = "dodge") +  # 1pt
  theme_bw() +  # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +  # 1pt
  xlab("Years") +  # 1pt
  ylab("Incidents")  # 1pt

Question 22: Add the following title to the figure in question 21: “Number of criminal incidents”

ggplot(comparison_data, aes(x = Year,    # 1pt
                            y = Number_of_incidents,  # 1pt
                            fill = Suburb)) +   # 1pt
  geom_bar(stat = "identity",   # 1pt
             position = "dodge") +  # 1pt
  theme_bw() +   # 1pt
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +   # 1pt
  xlab("Years") +   # 1pt
 ylab("Incidents") +   # 1pt
  ggtitle("Number of criminal incidents")  # 1pt

Question 24: Transform the data set comparison_data into a wide format where the Coogee, Redfern and Zetland are displayed as columns

comparison_data %>%
  pivot_wider(id_cols = Year,   # 1pt
              names_from = Suburb,  # 1pt
              values_from = Number_of_incidents)  # 1pt
## # A tibble: 10 x 4
## # Groups:   Year [10]
##     Year Coogee Redfern Zetland
##    <dbl>  <dbl>   <dbl>   <dbl>
##  1  2010    897    3225     197
##  2  2011   1189    3822     318
##  3  2012    877    3959     380
##  4  2013    885    4440     312
##  5  2014    762    4400     359
##  6  2015    912    4674     562
##  7  2016   1016    5623     493
##  8  2017   1011    4411     526
##  9  2018   1013    4102     572
## 10  2019   1119    4052     621