I’ll use the carData package, focusing on the Arrests dataset, which details arrests in Toronto for simple possession of small quantities of marijuana.
# install.packages("carData")
library(carData)
head(Arrests,5)
## released colour year age sex employed citizen checks
## 1 Yes White 2002 21 Male Yes Yes 3
## 2 No Black 1999 17 Male Yes Yes 3
## 3 Yes White 2000 24 Male Yes Yes 3
## 4 No Black 2000 46 Male Yes Yes 1
## 5 Yes Black 1999 27 Female Yes Yes 1
summary(Arrests)
## released colour year age sex
## No : 892 Black:1288 Min. :1997 Min. :12.00 Female: 443
## Yes:4334 White:3938 1st Qu.:1998 1st Qu.:18.00 Male :4783
## Median :2000 Median :21.00
## Mean :2000 Mean :23.85
## 3rd Qu.:2001 3rd Qu.:27.00
## Max. :2002 Max. :66.00
## employed citizen checks
## No :1115 No : 771 Min. :0.000
## Yes:4111 Yes:4455 1st Qu.:0.000
## Median :1.000
## Mean :1.636
## 3rd Qu.:3.000
## Max. :6.000
cat('Mean Age:', mean(Arrests$`age`),'\n',
'Median Age:', median(Arrests$`age`), '\n')
## Mean Age: 23.84654
## Median Age: 21
cat('Mean Checks:', mean(Arrests$`checks`),'\n',
'Median Checks:', median(Arrests$`checks`), '\n')
## Mean Checks: 1.636433
## Median Checks: 1
male_post2000 <- subset(Arrests,
sex = 'Male',
year >= 2000,
select = c(released,colour,year,age,sex,employed,checks)
)
head(male_post2000,5)
## released colour year age sex employed checks
## 1 Yes White 2002 21 Male Yes 3
## 3 Yes White 2000 24 Male Yes 3
## 4 No Black 2000 46 Male Yes 1
## 9 Yes Black 2000 23 Male Yes 4
## 10 Yes White 2001 30 Male Yes 3
library(plyr)
male_post2000 <- rename(male_post2000, c('released' = 'Release (Y/N)',
'colour' = 'Race',
'checks' = 'No. of Checks'
)
)
head(male_post2000,5)
## Release (Y/N) Race year age sex employed No. of Checks
## 1 Yes White 2002 21 Male Yes 3
## 3 Yes White 2000 24 Male Yes 3
## 4 No Black 2000 46 Male Yes 1
## 9 Yes Black 2000 23 Male Yes 4
## 10 Yes White 2001 30 Male Yes 3
summary(male_post2000)
## Release (Y/N) Race year age sex
## No : 438 Black: 675 Min. :2000 Min. :12.00 Female: 216
## Yes:2320 White:2083 1st Qu.:2000 1st Qu.:18.00 Male :2542
## Median :2001 Median :21.00
## Mean :2001 Mean :23.73
## 3rd Qu.:2001 3rd Qu.:27.00
## Max. :2002 Max. :66.00
## employed No. of Checks
## No : 552 Min. :0.000
## Yes:2206 1st Qu.:0.000
## Median :1.000
## Mean :1.596
## 3rd Qu.:3.000
## Max. :6.000
First, we compare the mean / median Age between the original and subset datasets.
cat('Original Mean:', mean(Arrests$age),'\n',
'Subset Mean:', mean(male_post2000$age), '\n')
## Original Mean: 23.84654
## Subset Mean: 23.72698
cat('Original Median:',median(Arrests$age),'\n',
'Subset Median:',median(male_post2000$age),'\n')
## Original Median: 21
## Subset Median: 21
Next, we compare the mean / median No. of checks between each dataset.
cat('Original Mean:', mean(Arrests$checks),'\n',
'Subset Mean:', mean(male_post2000$`No. of Checks`), '\n')
## Original Mean: 1.636433
## Subset Mean: 1.596084
cat('Original Median:',median(Arrests$checks),'\n',
'Subset Median:',median(male_post2000$`No. of Checks`),'\n')
## Original Median: 1
## Subset Median: 1
The subsets have only minimally different means, and the same medians. This indicates that the subset largely resembles the original data set.
For a column with factors, we can change the levels.
levels(male_post2000$employed)
## [1] "No" "Yes"
levels(male_post2000$employed) <- c('Y','N')
For a column with numerical data, we can perform an operation.
male_post2000$age <- male_post2000$age + 5
As this dataset has no character datatypes (only factors), I first changed the datatype to character. Then we can rename the values.
male_post2000$`Release (Y/N)` = as.character(male_post2000$`Release (Y/N)`)
male_post2000['Release (Y/N)'][male_post2000['Release (Y/N)'] == 'Yes'] <- 'Y'
male_post2000['Release (Y/N)'][male_post2000['Release (Y/N)'] == 'No'] <- 'N'
head(male_post2000,10)
## Release (Y/N) Race year age sex employed No. of Checks
## 1 Y White 2002 26 Male N 3
## 3 Y White 2000 29 Male N 3
## 4 N Black 2000 51 Male N 1
## 9 Y Black 2000 28 Male N 4
## 10 Y White 2001 35 Male N 3
## 12 Y White 2000 23 Male N 3
## 13 Y White 2000 22 Male N 1
## 16 Y White 2001 30 Male Y 3
## 17 N White 2001 50 Male Y 4
## 18 N White 2002 25 Male N 5
I downloaded the original Arrests dataset, then uploaded to my github repository, from which the code below then reads in the file.
Arrests2 <- read.csv('https://github.com/kac624/notebooks/raw/main/Arrests.csv')
head(Arrests2,5)
## released colour year age sex employed citizen checks
## 1 Yes White 2002 21 Male Yes Yes 3
## 2 No Black 1999 17 Male Yes Yes 3
## 3 Yes White 2000 24 Male Yes Yes 3
## 4 No Black 2000 46 Male Yes Yes 1
## 5 Yes Black 1999 27 Female Yes Yes 1