Selected (CSV) dataset: “Arrests” from carData library.
# I have called/imported tidyverse library (it includes the dplyr package which I used for data wrangling).
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Data <- carData::Arrests
str(Data)
## 'data.frame': 5226 obs. of 8 variables:
## $ released: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 2 2 2 ...
## $ colour : Factor w/ 2 levels "Black","White": 2 1 2 1 1 1 2 2 1 2 ...
## $ year : int 2002 1999 2000 2000 1999 1998 1999 1998 2000 2001 ...
## $ age : int 21 17 24 46 27 16 40 34 23 30 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 2 1 2 2 ...
## $ employed: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
## $ citizen : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ checks : int 3 3 3 1 1 0 0 1 4 3 ...
We can see the total number of observations is 5226 and Total number of variables is 8 (5 factors and 3 integers).
Task 1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.
summary(Data)
## released colour year age sex
## No : 892 Black:1288 Min. :1997 Min. :12.00 Female: 443
## Yes:4334 White:3938 1st Qu.:1998 1st Qu.:18.00 Male :4783
## Median :2000 Median :21.00
## Mean :2000 Mean :23.85
## 3rd Qu.:2001 3rd Qu.:27.00
## Max. :2002 Max. :66.00
## employed citizen checks
## No :1115 No : 771 Min. :0.000
## Yes:4111 Yes:4455 1st Qu.:0.000
## Median :1.000
## Mean :1.636
## 3rd Qu.:3.000
## Max. :6.000
mean_age <- mean(Data$age,na.rm = T);mean_age
## [1] 23.84654
mean_year <- mean(Data$year,na.rm = T);mean_year
## [1] 1999.509
median_age <- median(Data$age,na.rm = T); median_age
## [1] 21
median_year <- median(Data$year);median_year
## [1] 2000
The na.rm attribute is used to remove any na’s in the column (as it will hinder in mean and median calculation).
Task 2. Create a new data frame with a subset of the columns and rows. Make sure to rename it. I will subset the data by selecting only first 4 columns using select function and selecting only those rows which corresponds to year 2000,2001 or 2002 using filter function. And we will rename our new data frame as arrest .
arrest <- Data %>% select(1:4) %>% filter(year ==c(2000,2001,2002))
I used the pipe operator in dplyr package . This operator will forward a value, or the result of an expression, into the next function call/expression.
Task 3. Create new column names for the new data frame.
colnames(arrest) <- c("Released_or_not","Skin_color","Year","Age")
Task 4. Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.
summary(arrest)
## Released_or_not Skin_color Year Age
## No :150 Black:229 Min. :2000 Min. :13.00
## Yes:789 White:710 1st Qu.:2000 1st Qu.:18.00
## Median :2001 Median :21.00
## Mean :2001 Mean :23.52
## 3rd Qu.:2001 3rd Qu.:26.00
## Max. :2002 Max. :64.00
mean_Age <- mean(arrest$Age,na.rm = T);mean_Age
## [1] 23.52077
mean_Year <- mean(arrest$Year,na.rm = T);mean_Year
## [1] 2000.672
median_Age <- median(arrest$Age,na.rm = T); median_Age
## [1] 21
median_Year <- median(arrest$Year);median_Year
## [1] 2001
We see that mean and median of age is approximately the same before sub-setting the data.
Task 5. For at least 3 values in a column please rename so that every value in that column is renamed.
First I will convert/cast the variable (from factor form) to character form. Then I rename the Yes or No into Released_or_not variable as “Y” and “N” respectively.
arrest$Released_or_not <- as.character(arrest$Released_or_not)
arrest[arrest$Released_or_not== "Yes",1] <- "Y"
arrest[arrest$Released_or_not== "No",1] <- "N"
Task 6. Display enough rows to see examples of all of steps 1-5 above.
The head function below will display first 15 lines from our new data-frame called arrest.
head(arrest,15)
## Released_or_not Skin_color Year Age
## 4 N Black 2000 46
## 13 Y White 2000 17
## 17 N White 2001 45
## 18 N White 2002 20
## 23 Y White 2001 16
## 25 Y White 2000 49
## 26 Y White 2001 21
## 28 Y White 2000 28
## 31 Y White 2000 19
## 32 Y White 2001 30
## 41 Y White 2001 27
## 44 Y White 2001 17
## 50 N Black 2001 26
## 52 Y Black 2000 21
## 73 Y Black 2000 17