Week 2 R Homework

by Jhalak Das

01/05/2020

Selected (CSV) dataset: “Arrests” from carData library.

# I have called/imported tidyverse library (it includes the dplyr package which I used for data wrangling).
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
Data <- carData::Arrests
str(Data)
## 'data.frame':    5226 obs. of  8 variables:
##  $ released: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 2 2 2 ...
##  $ colour  : Factor w/ 2 levels "Black","White": 2 1 2 1 1 1 2 2 1 2 ...
##  $ year    : int  2002 1999 2000 2000 1999 1998 1999 1998 2000 2001 ...
##  $ age     : int  21 17 24 46 27 16 40 34 23 30 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 2 1 2 2 ...
##  $ employed: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
##  $ citizen : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ checks  : int  3 3 3 1 1 0 0 1 4 3 ...

We can see the total number of observations is 5226 and Total number of variables is 8 (5 factors and 3 integers).

Task 1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

summary(Data)
##  released     colour          year           age            sex      
##  No : 892   Black:1288   Min.   :1997   Min.   :12.00   Female: 443  
##  Yes:4334   White:3938   1st Qu.:1998   1st Qu.:18.00   Male  :4783  
##                          Median :2000   Median :21.00                
##                          Mean   :2000   Mean   :23.85                
##                          3rd Qu.:2001   3rd Qu.:27.00                
##                          Max.   :2002   Max.   :66.00                
##  employed   citizen        checks     
##  No :1115   No : 771   Min.   :0.000  
##  Yes:4111   Yes:4455   1st Qu.:0.000  
##                        Median :1.000  
##                        Mean   :1.636  
##                        3rd Qu.:3.000  
##                        Max.   :6.000
mean_age <- mean(Data$age,na.rm = T);mean_age
## [1] 23.84654
mean_year <- mean(Data$year,na.rm = T);mean_year
## [1] 1999.509
median_age <- median(Data$age,na.rm = T); median_age
## [1] 21
median_year <- median(Data$year);median_year
## [1] 2000

The na.rm attribute is used to remove any na’s in the column (as it will hinder in mean and median calculation).

Task 2. Create a new data frame with a subset of the columns and rows. Make sure to rename it. I will subset the data by selecting only first 4 columns using select function and selecting only those rows which corresponds to year 2000,2001 or 2002 using filter function. And we will rename our new data frame as arrest .

arrest <- Data %>% select(1:4) %>% filter(year ==c(2000,2001,2002))

I used the pipe operator in dplyr package . This operator will forward a value, or the result of an expression, into the next function call/expression.

Task 3. Create new column names for the new data frame.

colnames(arrest) <- c("Released_or_not","Skin_color","Year","Age")

Task 4. Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.

summary(arrest)
##  Released_or_not Skin_color       Year           Age       
##  No :150         Black:229   Min.   :2000   Min.   :13.00  
##  Yes:789         White:710   1st Qu.:2000   1st Qu.:18.00  
##                              Median :2001   Median :21.00  
##                              Mean   :2001   Mean   :23.52  
##                              3rd Qu.:2001   3rd Qu.:26.00  
##                              Max.   :2002   Max.   :64.00
mean_Age <- mean(arrest$Age,na.rm = T);mean_Age
## [1] 23.52077
mean_Year <- mean(arrest$Year,na.rm = T);mean_Year
## [1] 2000.672
median_Age <- median(arrest$Age,na.rm = T); median_Age
## [1] 21
median_Year <- median(arrest$Year);median_Year
## [1] 2001

We see that mean and median of age is approximately the same before sub-setting the data.

Task 5. For at least 3 values in a column please rename so that every value in that column is renamed.

First I will convert/cast the variable (from factor form) to character form. Then I rename the Yes or No into Released_or_not variable as “Y” and “N” respectively.

arrest$Released_or_not <- as.character(arrest$Released_or_not)

arrest[arrest$Released_or_not== "Yes",1] <- "Y"
arrest[arrest$Released_or_not== "No",1] <- "N"

Task 6. Display enough rows to see examples of all of steps 1-5 above.

The head function below will display first 15 lines from our new data-frame called arrest.

head(arrest,15)
##    Released_or_not Skin_color Year Age
## 4                N      Black 2000  46
## 13               Y      White 2000  17
## 17               N      White 2001  45
## 18               N      White 2002  20
## 23               Y      White 2001  16
## 25               Y      White 2000  49
## 26               Y      White 2001  21
## 28               Y      White 2000  28
## 31               Y      White 2000  19
## 32               Y      White 2001  30
## 41               Y      White 2001  27
## 44               Y      White 2001  17
## 50               N      Black 2001  26
## 52               Y      Black 2000  21
## 73               Y      Black 2000  17

THE END