Assignment 2

Name: Sydney Goodman ID: 919224391 Date: 10/15/2024

For this assignment, you will be using various summary statistics, or descriptions of central tendency, to describe and summarize data. Statistics of central tendency are another way to summarize and get information from you data. Along with different forms of visualization, central tendency is often a good first step when working with data. We will be using the same data LA crime aata as we did last week.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#load in the LA crimes data (1 point)

file.exists("~/Downloads/la_crime.csv")

## [1] TRUE

setwd("~/Downloads") 
getwd()

## [1] "/Users/syd/Downloads"

LA_Crime<- read.csv("la_crime.csv")
?read.csv() 

str(LA_Crime)

## 'data.frame':    555 obs. of  2 variables:
##  $ Vict.Age: int  23 23 56 40 27 63 30 18 32 54 ...
##  $ Vict.Sex: chr  "F" "M" "F" "M" ...

#divide the data into male and female (1 point)
crime_male<-
LA_Crime %>%
  filter(Vict.Sex == "M")

crime_female<-
LA_Crime %>%
  filter(Vict.Sex == "F")

PART 1: Mean, Median and Mode

#first we will calculate the mean arithmetically, Using the sum function, calculate the mean victim age for the whole data set, then the male and female subsets. (1 point)

summary(LA_Crime$Vict.Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   29.00   37.00   40.51   51.00   91.00

summary(crime_female$Vict.Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   28.00   35.00   39.29   49.00   91.00

summary(crime_male$Vict.Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   30.00   40.00   41.98   52.25   85.00

#now calculate the mean age for the the three data sets using the mean() function in R (1 point)

mean.vict.age <- mean(LA_Crime$Vict.Age)
mean.vict.age

## [1] 40.51351

mean.fem.age <- mean(crime_female$Vict.Age)
mean.fem.age

## [1] 39.29043

mean.male.age <- mean(crime_male$Vict.Age)
mean.male.age

## [1] 41.98413

#calculate the median age for the the three data sets using the median() function in R(1 point)

med.vict.age <- median(LA_Crime$Vict.Age)
med.vict.age

## [1] 37

med.fem.age <- median(crime_female$Vict.Age)
med.fem.age

## [1] 35

med.male.age <- median(crime_male$Vict.Age)
med.male.age

## [1] 40

#R doesn't have a function built in to calculate the mode, so we are going to have to make one 

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

#after you have run the above line, use the Mode() function we just made to calculate the mode for age in allthree data sets (1 point)

mode.vict.age <- Mode(LA_Crime$Vict.Age)
mode.vict.age

## [1] 29

mode.fem.age <- Mode(crime_female$Vict.Age)
mode.fem.age

## [1] 29

mode.male.age <- Mode(crime_male$Vict.Age)
mode.male.age

## [1] 30

1.For each dataset, how do the mean, median and mode differ? What do each description of central tendency tell you about the data? (1 point) The mean is higher than the median in every dataset, and the mode is much lower than the mean and median for all sets as well. For dataset 1, the The mean for male only is higher than for female only and all gender data, which shows a higher age for men when they’re victims. The median and mode for males is also higher than the median and mode for females and all genders.

How the measures of central tendency differ between the datasets? What inferences can you draw from these differences? (1 point) The differences in means and medians show that males tend to be slightly older when they’re the victim of crimes. The modes are pretty even, but the male mode is still sligtly higher, which maintains distribution patterns between the male, female, and general data. This proves the trend that males will be older than females when they are victims of crimes, and that they also contribute to higher numbers on the all gender data, since the male data is higher than that of all genders.

PART 2: Variance, Standard Deviation and Coefficient of Variation

#calculate the variance for all three datasets (1 point)

var.vict.age <- var(LA_Crime$Vict.Age)
var.vict.age

## [1] 239.3044

var.fem.age <- var(crime_female$Vict.Age)
var.fem.age

## [1] 245.3326

var.male.age <- var(crime_male$Vict.Age)
var.male.age

## [1] 229.0276

#calculate the the standard variation for all three datasets (1 point)

sd.vict.age <- sd(LA_Crime$Vict.Age)
sd.vict.age

## [1] 15.46947

sd.fem.age <- sd(crime_female$Vict.Age)
sd.fem.age

## [1] 15.6631

sd.male.age <- sd(crime_male$Vict.Age)
sd.male.age

## [1] 15.13366

#calculate the coefficient of variation for all three dataset (1 point)

cv.vict.age <- sd.vict.age/mean.vict.age * 100
cv.vict.age

## [1] 38.18348

cv.fem.age <- sd.fem.age/mean.fem.age * 100
cv.fem.age

## [1] 39.86492

cv.male.age <- sd.male.age/mean.male.age * 100
cv.male.age

## [1] 36.04614

Each of these statistics describe the variability of the data. Describe what this means and what information you can gather from describing variation. (1 point) Variability shows how spread out data points are, which portrays whether ages in this case are consistent or different. This information helps identify outliers, helps find a most common age group (sector with lowest variability), and highlights the range of ages of crime victims of each gender within these datasets.
How does theses variabilty statistics change between the three datasets? (1 point) The male variability is the lowest for all 3 of these statistics, indicating a low level of variation and differences between ages, maybe meaning there is a more concentrated age group for men to be victims of crimes. Women show the highest variation, meaning that there is a wider range of ages/larger gap between age groups for when women are victims. The all gender data is between these two, showing that the male and female only datasets balance out the whole set.

Part 3: Quantiles

#calculate the  quantiles for all three datasets (1 point)

quantiles.vict.age <- quantile(LA_Crime$Vict.Age)
quantiles.vict.age

##   0%  25%  50%  75% 100% 
##    8   29   37   51   91

quantiles.fem.age <- quantile(crime_female$Vict.Age)
quantiles.fem.age

##   0%  25%  50%  75% 100% 
##   12   28   35   49   91

quantiles.male.age <- quantile(crime_male$Vict.Age)
quantiles.male.age

##    0%   25%   50%   75%  100% 
##  8.00 30.00 40.00 52.25 85.00

For each dataset, what is the age of the first and third quantiles for each dataset. What can you infer from this information? (1 point) For general victim age, the 1st quantile is 29 and the third is 51. For females only, first quantile is 28 and third is 49. For males only, 1st quantile is 30 and third is 52.25. The male only quantiles are higher than both the female only and the general data’s, so men are more likely to be older when they’re the victim of crimes than women.

Assignment 2

2024-10-07