1.Extracting crime for 2024 and creating a scatter plot. The scatter plot shows the total amount of crime by the Area code in red and then the blue dashed lines on the scatter plot shows the average number of crimes. Area code that has the highest crime rate in 2024 is 01, Central LA, the lowest crime rate area is area 16, Foothil.

crime <- crime %>%
  mutate(
    year_occ = year(`DATE OCC`),
    year_rptd= year(`Date Rptd`)
  )
crime2024 <- crime %>%
  filter(year_occ== 2024)
#creating a scatterplot
crimeRate2024 <- crime2024 %>%
  group_by(AREA) %>%
  summarise(total_crimes=n(), .groups = "drop")
mean_crime <- mean(crimeRate2024$total_crimes, na.rm = TRUE)
ggplot(crimeRate2024, aes(x = AREA, y= total_crimes))+
  geom_point(color="red", size=3)+
  geom_hline(yintercept = mean_crime, linetype="dashed", color="blue")+
  labs(title = "Crime count in 2024 by Area Code",
       x="Area Code",
       y="Number of Crimes")

3. Finding out if there is a correlation between total crimes and average crime per month. The correlation coefficient is 1, which means there’s a perfect positive linear relationship between Total Crimes per year and Average crimes per month.

# Merge total crimes and average monthly crimes per year
crime_summary <- total_crime_per_year %>%
  left_join(average_crime_per_month, by ="year_occ")
#Finding correlation
cor_test <- cor.test(crime_summary$Totalcrime, crime_summary$AverageCrimePerMonth)
print(cor_test)
## 
##  Pearson's product-moment correlation
## 
## data:  crime_summary$Totalcrime and crime_summary$AverageCrimePerMonth
## t = 7616.4, df = 4, p-value = 1.783e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9999997 1.0000000
## sample estimates:
## cor 
##   1

4. The histogram below shows the distribution of victim ages. The most frequent age recorded is 0, which occurs because, in certain crime categories such as vehicle theft or vandalism, the data set records the age of the property rather than the age of a person. Besides that, the highest victim age falls between late 20’s and early 30’s.

hist(crime$`Vict Age`,
     main = "Histogram of Victim Age",
     xlab = "Age",
     ylab = "Frequency",
     col = "lightblue",
     border = "black")

5.Question: Is there a significant difference in the number of crimes between area 1 and 3?

The results showed that Area 1 had slightly higher average yearly crime count(11,611) compared to Area 3(9,573). However, the p-value (0.55) indicates that the difference is not statistically significant. Therefore,the difference in the crime levels between these two areas are not meaningful.

#Creating a subset with only Area 1 and Area 2
crime_subset <- crime %>%
  filter(AREA %in% c("01","03")) %>%
  mutate(AREA= factor(AREA))
#count crimes per year per area
crime_counts <-crime_subset %>%
  group_by(AREA,  year_occ) %>%
  summarise(TotalCrimes= n(), .groups= "drop")
#t-test comparing Area 1 Vs Area3
t.test_result <- t.test(TotalCrimes ~ AREA, data =crime_counts)
print(t.test_result)
## 
##  Welch Two Sample t-test
## 
## data:  TotalCrimes by AREA
## t = 0.61315, df = 9.4664, p-value = 0.5542
## alternative hypothesis: true difference in means between group 01 and group 03 is not equal to 0
## 95 percent confidence interval:
##  -5425.367  9501.701
## sample estimates:
## mean in group 01 mean in group 03 
##         11611.67          9573.50