Assignment 2
Name: Sydney Goodman ID: 919224391 Date: 10/15/2024
For this assignment, you will be using various summary statistics, or descriptions of central tendency, to describe and summarize data. Statistics of central tendency are another way to summarize and get information from you data. Along with different forms of visualization, central tendency is often a good first step when working with data. We will be using the same data LA crime aata as we did last week.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#load in the LA crimes data (1 point)
file.exists("~/Downloads/la_crime.csv")
## [1] TRUE
setwd("~/Downloads")
getwd()
## [1] "/Users/syd/Downloads"
LA_Crime<- read.csv("la_crime.csv")
?read.csv()
str(LA_Crime)
## 'data.frame': 555 obs. of 2 variables:
## $ Vict.Age: int 23 23 56 40 27 63 30 18 32 54 ...
## $ Vict.Sex: chr "F" "M" "F" "M" ...
#divide the data into male and female (1 point)
crime_male<-
LA_Crime %>%
filter(Vict.Sex == "M")
crime_female<-
LA_Crime %>%
filter(Vict.Sex == "F")
PART 1: Mean, Median and Mode
#first we will calculate the mean arithmetically, Using the sum function, calculate the mean victim age for the whole data set, then the male and female subsets. (1 point)
summary(LA_Crime$Vict.Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 29.00 37.00 40.51 51.00 91.00
summary(crime_female$Vict.Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 28.00 35.00 39.29 49.00 91.00
summary(crime_male$Vict.Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 30.00 40.00 41.98 52.25 85.00
#now calculate the mean age for the the three data sets using the mean() function in R (1 point)
mean.vict.age <- mean(LA_Crime$Vict.Age)
mean.vict.age
## [1] 40.51351
mean.fem.age <- mean(crime_female$Vict.Age)
mean.fem.age
## [1] 39.29043
mean.male.age <- mean(crime_male$Vict.Age)
mean.male.age
## [1] 41.98413
#calculate the median age for the the three data sets using the median() function in R(1 point)
med.vict.age <- median(LA_Crime$Vict.Age)
med.vict.age
## [1] 37
med.fem.age <- median(crime_female$Vict.Age)
med.fem.age
## [1] 35
med.male.age <- median(crime_male$Vict.Age)
med.male.age
## [1] 40
#R doesn't have a function built in to calculate the mode, so we are going to have to make one
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
#after you have run the above line, use the Mode() function we just made to calculate the mode for age in allthree data sets (1 point)
mode.vict.age <- Mode(LA_Crime$Vict.Age)
mode.vict.age
## [1] 29
mode.fem.age <- Mode(crime_female$Vict.Age)
mode.fem.age
## [1] 29
mode.male.age <- Mode(crime_male$Vict.Age)
mode.male.age
## [1] 30
1.For each dataset, how do the mean, median and mode differ? What do each description of central tendency tell you about the data? (1 point) The mean is higher than the median in every dataset, and the mode is much lower than the mean and median for all sets as well. For dataset 1, the The mean for male only is higher than for female only and all gender data, which shows a higher age for men when they’re victims. The median and mode for males is also higher than the median and mode for females and all genders.
PART 2: Variance, Standard Deviation and Coefficient of Variation
#calculate the variance for all three datasets (1 point)
var.vict.age <- var(LA_Crime$Vict.Age)
var.vict.age
## [1] 239.3044
var.fem.age <- var(crime_female$Vict.Age)
var.fem.age
## [1] 245.3326
var.male.age <- var(crime_male$Vict.Age)
var.male.age
## [1] 229.0276
#calculate the the standard variation for all three datasets (1 point)
sd.vict.age <- sd(LA_Crime$Vict.Age)
sd.vict.age
## [1] 15.46947
sd.fem.age <- sd(crime_female$Vict.Age)
sd.fem.age
## [1] 15.6631
sd.male.age <- sd(crime_male$Vict.Age)
sd.male.age
## [1] 15.13366
#calculate the coefficient of variation for all three dataset (1 point)
cv.vict.age <- sd.vict.age/mean.vict.age * 100
cv.vict.age
## [1] 38.18348
cv.fem.age <- sd.fem.age/mean.fem.age * 100
cv.fem.age
## [1] 39.86492
cv.male.age <- sd.male.age/mean.male.age * 100
cv.male.age
## [1] 36.04614
Each of these statistics describe the variability of the data. Describe what this means and what information you can gather from describing variation. (1 point) Variability shows how spread out data points are, which portrays whether ages in this case are consistent or different. This information helps identify outliers, helps find a most common age group (sector with lowest variability), and highlights the range of ages of crime victims of each gender within these datasets.
How does theses variabilty statistics change between the three datasets? (1 point) The male variability is the lowest for all 3 of these statistics, indicating a low level of variation and differences between ages, maybe meaning there is a more concentrated age group for men to be victims of crimes. Women show the highest variation, meaning that there is a wider range of ages/larger gap between age groups for when women are victims. The all gender data is between these two, showing that the male and female only datasets balance out the whole set.
Part 3: Quantiles
#calculate the quantiles for all three datasets (1 point)
quantiles.vict.age <- quantile(LA_Crime$Vict.Age)
quantiles.vict.age
## 0% 25% 50% 75% 100%
## 8 29 37 51 91
quantiles.fem.age <- quantile(crime_female$Vict.Age)
quantiles.fem.age
## 0% 25% 50% 75% 100%
## 12 28 35 49 91
quantiles.male.age <- quantile(crime_male$Vict.Age)
quantiles.male.age
## 0% 25% 50% 75% 100%
## 8.00 30.00 40.00 52.25 85.00