Assignment 1

Name: [Charlotte Noroian] ID: [920958834] Date: [10/3/2024]

You will be writing code into what are called “chunks”. This where you can actually run code. These are dilineated by the ” ``` ” marks you see around the large grey boxes. Written answers should be given outside of the chunks. You will be turning in the .Rmd with your name in the title, as well as a knit html file.

For this assignment, and the next several assignments, use the following data set, which examines crime statistics in LA. This dataset, LA_Crime.xlsx, reflects incidents of crime in the City of Los Angeles on the 1st of September, 2023 (i.e. only one single day). The source of data can be found here: https://catalog.data.gov/dataset/crime-data-from-2020-to-presentLinks to an external site.

The original dataset comprises more than 800,000 entries. However, I have narrowed it down to data from just one day, September 1, 2023, focusing solely on individuals identified as female (F) or male (M). The trimmed dataset consists of two columns: “Vict Age,” representing the age of the victim, and “Vict Sex,” indicating the victim’s gender. This reduced Excel file contains a total of 555 rows. Utilize this dataset to accomplish the following tasks.

RUN THE FOLLOWING CHUNK BEFORE STARTING

PART 1: Load the data into R for use in the rest of the assignment

#load the data into R
crime_stats<-read.csv("LA_Crime.csv")
  
#examine the data
str(crime_stats)
## 'data.frame':    555 obs. of  2 variables:
##  $ Vict.Age: int  23 23 56 40 27 63 30 18 32 54 ...
##  $ Vict.Sex: chr  "F" "M" "F" "M" ...

the str() function will tell you a lot about your data, such as the type of data it is, how many levels there are, and the dimensions of the your data frame. Using str(), answer the following questions:

  1. What type of data is “Vict Age”? How many levels does it have? #“Vict Age” is integer data and levels are not relevant to this type of data.
  2. What type of data is “Vict Sex”? How many levels does it have? #“Vict Sex” is character data and it has two levels: M & F. ——————————————————————————–

PART 2: Create a frequency table and histogram for victim age

Visualization is often a good way to observe macro level trends in data sets. A common method of doing this is using a histogram, which visualizes the number of occurences of a particular data point. In this exercise, you will be making a histogram based upon the ages of the victims in the dataset. A frequency table will tell create a table with the corresponding

#create a frequency table with ages divided into ten groups

frequency_table<-table(crime_stats$Vict.Sex, cut(crime_stats$Vict.Age,breaks = 10))

print(frequency_table)
##    
##     (7.92,16.3] (16.3,24.6] (24.6,32.9] (32.9,41.2] (41.2,49.5] (49.5,57.8]
##   F           8          36          80          72          32          29
##   M           7          23          44          60          41          27
##    
##     (57.8,66.1] (66.1,74.4] (74.4,82.7] (82.7,91.1]
##   F          22          16           7           1
##   M          33          14           2           1
#create a histogram with 10 bins

hist(crime_stats$Vict.Age, breaks = 10)

Now that you have created two tables describing the crime statistics visually, what do these tables tell you about the age of victims of crimes? At what age is someone most commonly to be a victim of crime, based on the data?


PART 3: Create Male and Female tables

Now that we have observed from trends in the overall dataset, we can now dig a little deeper. We know the gender of each person in the dataset, so maybe we can learn more information and make new inferences if we can understand how each group is affected differently.

#create two new dataframes: one for male victims and one for female victims
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
crime_male<- crime_stats %>% 
  filter(Vict.Sex == "M")

crime_female<- crime_stats %>%
  filter(Vict.Sex == "F")
#here I am using a package called tidyverse to select only the males. We'll learn more about this later, but try adapting the code to select only the females
  
# create a histogram for each data set with 10 bins
  
hist(crime_female$Vict.Age, breaks = 10)

hist(crime_male$Vict.Age, breaks = 10)

#adjust the bins to 20

hist(crime_female$Vict.Age, breaks = 20)

hist(crime_male$Vict.Age, breaks = 20)

  1. Based on the histograms for the male and female victims, how does the trend in the data change? What new inferences can you make now that you have separated the data? #The trend becomes clear that for females, crime peaks between the ages of 20 and 40, whereas for males, the peak happens in middle age and is overall higher for the entire age range compared to females.

  2. What does changing the bins do? How may this affect your interpretation of the data? #Changing the breaks can affect the precision of data being presented. More breaks makes the frequency higher but the data less precise. With only 10 breaks, the frequency peaks at 30.