Setup

I’ll use the carData package, focusing on the Arrests dataset, which details arrests in Toronto for simple possession of small quantities of marijuana.

# install.packages("carData")
library(carData)
 head(Arrests,5)
##   released colour year age    sex employed citizen checks
## 1      Yes  White 2002  21   Male      Yes     Yes      3
## 2       No  Black 1999  17   Male      Yes     Yes      3
## 3      Yes  White 2000  24   Male      Yes     Yes      3
## 4       No  Black 2000  46   Male      Yes     Yes      1
## 5      Yes  Black 1999  27 Female      Yes     Yes      1

Question #1 - Summary of Data

summary(Arrests)
##  released     colour          year           age            sex      
##  No : 892   Black:1288   Min.   :1997   Min.   :12.00   Female: 443  
##  Yes:4334   White:3938   1st Qu.:1998   1st Qu.:18.00   Male  :4783  
##                          Median :2000   Median :21.00                
##                          Mean   :2000   Mean   :23.85                
##                          3rd Qu.:2001   3rd Qu.:27.00                
##                          Max.   :2002   Max.   :66.00                
##  employed   citizen        checks     
##  No :1115   No : 771   Min.   :0.000  
##  Yes:4111   Yes:4455   1st Qu.:0.000  
##                        Median :1.000  
##                        Mean   :1.636  
##                        3rd Qu.:3.000  
##                        Max.   :6.000
cat('Mean Age:', mean(Arrests$`age`),'\n',
    'Median Age:', median(Arrests$`age`), '\n')
## Mean Age: 23.84654 
##  Median Age: 21
cat('Mean Checks:', mean(Arrests$`checks`),'\n',
    'Median Checks:', median(Arrests$`checks`), '\n')
## Mean Checks: 1.636433 
##  Median Checks: 1

Question #2 - Subset of dataset

male_post2000 <- subset(Arrests,
                        sex = 'Male',
                        year >= 2000,
                        select = c(released,colour,year,age,sex,employed,checks)
                        )
head(male_post2000,5)
##    released colour year age  sex employed checks
## 1       Yes  White 2002  21 Male      Yes      3
## 3       Yes  White 2000  24 Male      Yes      3
## 4        No  Black 2000  46 Male      Yes      1
## 9       Yes  Black 2000  23 Male      Yes      4
## 10      Yes  White 2001  30 Male      Yes      3

Question #3 - New column names

library(plyr)
male_post2000 <- rename(male_post2000, c('released' = 'Release (Y/N)', 
                                         'colour' = 'Race', 
                                         'checks' = 'No. of Checks'
                                         )
                        )

head(male_post2000,5)
##    Release (Y/N)  Race year age  sex employed No. of Checks
## 1            Yes White 2002  21 Male      Yes             3
## 3            Yes White 2000  24 Male      Yes             3
## 4             No Black 2000  46 Male      Yes             1
## 9            Yes Black 2000  23 Male      Yes             4
## 10           Yes White 2001  30 Male      Yes             3

Question #4 - Summary of new dataframe / Comparison to original dataset

summary(male_post2000)
##  Release (Y/N)    Race           year           age            sex      
##  No : 438      Black: 675   Min.   :2000   Min.   :12.00   Female: 216  
##  Yes:2320      White:2083   1st Qu.:2000   1st Qu.:18.00   Male  :2542  
##                             Median :2001   Median :21.00                
##                             Mean   :2001   Mean   :23.73                
##                             3rd Qu.:2001   3rd Qu.:27.00                
##                             Max.   :2002   Max.   :66.00                
##  employed   No. of Checks  
##  No : 552   Min.   :0.000  
##  Yes:2206   1st Qu.:0.000  
##             Median :1.000  
##             Mean   :1.596  
##             3rd Qu.:3.000  
##             Max.   :6.000

First, we compare the mean / median Age between the original and subset datasets.

cat('Original Mean:', mean(Arrests$age),'\n',
    'Subset Mean:', mean(male_post2000$age), '\n')
## Original Mean: 23.84654 
##  Subset Mean: 23.72698
cat('Original Median:',median(Arrests$age),'\n',
    'Subset Median:',median(male_post2000$age),'\n')
## Original Median: 21 
##  Subset Median: 21

Next, we compare the mean / median No. of checks between each dataset.

cat('Original Mean:', mean(Arrests$checks),'\n',
    'Subset Mean:', mean(male_post2000$`No. of Checks`), '\n')
## Original Mean: 1.636433 
##  Subset Mean: 1.596084
cat('Original Median:',median(Arrests$checks),'\n',
    'Subset Median:',median(male_post2000$`No. of Checks`),'\n')
## Original Median: 1 
##  Subset Median: 1

The subsets have only minimally different means, and the same medians. This indicates that the subset largely resembles the original data set.

Question #5 - Adjusting / Renaming Values for Three Columns

For a column with factors, we can change the levels.

levels(male_post2000$employed)
## [1] "No"  "Yes"
levels(male_post2000$employed) <- c('Y','N')

For a column with numerical data, we can perform an operation.

male_post2000$age <- male_post2000$age + 5

As this dataset has no character datatypes (only factors), I first changed the datatype to character. Then we can rename the values.

male_post2000$`Release (Y/N)` = as.character(male_post2000$`Release (Y/N)`)
male_post2000['Release (Y/N)'][male_post2000['Release (Y/N)'] == 'Yes'] <- 'Y'
male_post2000['Release (Y/N)'][male_post2000['Release (Y/N)'] == 'No'] <- 'N'

Question #6 - View Changes

head(male_post2000,10)
##    Release (Y/N)  Race year age  sex employed No. of Checks
## 1              Y White 2002  26 Male        N             3
## 3              Y White 2000  29 Male        N             3
## 4              N Black 2000  51 Male        N             1
## 9              Y Black 2000  28 Male        N             4
## 10             Y White 2001  35 Male        N             3
## 12             Y White 2000  23 Male        N             3
## 13             Y White 2000  22 Male        N             1
## 16             Y White 2001  30 Male        Y             3
## 17             N White 2001  50 Male        Y             4
## 18             N White 2002  25 Male        N             5

BONUS

I downloaded the original Arrests dataset, then uploaded to my github repository, from which the code below then reads in the file.

Arrests2 <- read.csv('https://github.com/kac624/notebooks/raw/main/Arrests.csv')
head(Arrests2,5)
##   released colour year age    sex employed citizen checks
## 1      Yes  White 2002  21   Male      Yes     Yes      3
## 2       No  Black 1999  17   Male      Yes     Yes      3
## 3      Yes  White 2000  24   Male      Yes     Yes      3
## 4       No  Black 2000  46   Male      Yes     Yes      1
## 5      Yes  Black 1999  27 Female      Yes     Yes      1