This Project is for analyzing 3 different datasets from forums on the
CUNY SPS IS607 Fall 2016, clean and tidy the data using dplyr and tidyr
and then analyzing the data.
First dataset to look at; a table showing income based on various religion
traditions.
# load the income_religion dataset
income_religion <- read.csv("/home/jonboy1987/Desktop/CUNYSPS/IS607/Assignments/Project2/religion_income.csv")
names(income_religion)
## [1] "Religious.tradition" "Less.than..30.000" "X.30.000..49.999"
## [4] "X.50.000..99.999" "X.100.000.or.more" "Sample.Size"
# Change the attribute names of the dataset
colnames(income_religion) <- c("Religion", "Low_Class",
"Middle_Class_1", "Middle_Class_2",
"Upper_Middle_Class", "Sample_Size")
income_religion
## Religion Low_Class Middle_Class_1
## 1 Buddhist 36% 18%
## 2 Catholic 36% 19%
## 3 Evangelical Protestant 35% 22%
## 4 Hindu 17% 13%
## 5 Historically Black Protestant 53% 22%
## 6 Jehovah's Witness 48% 25%
## 7 Jewish 16% 15%
## 8 Mainline Protestant 29% 20%
## 9 Mormon 27% 20%
## 10 Muslim 34% 17%
## 11 Orthodox Christian 18% 17%
## 12 Unaffiliated (religious "nones") 33% 20%
## Middle_Class_2 Upper_Middle_Class Sample_Size
## 1 32% 13% 233
## 2 26% 19% 6137
## 3 28% 14% 7462
## 4 34% 36% 172
## 5 17% 8% 1704
## 6 22% 4% 208
## 7 24% 44% 708
## 8 28% 23% 5208
## 9 33% 20% 594
## 10 29% 20% 205
## 11 36% 29% 155
## 12 26% 21% 6790
Data Dictionary
- Low_class –> income < $30,000
- Middle_Class_1 –> income between $30,000 and $49,999 inclusive
- Middle_Class_2 –> income between $50,000 and $99,999 inclusive
- Upper_Middle_Class –> income $100k or more
Let’s use the dplyr and tidyr packages and combine the incomes into
one column
# Tidy up the dataset by grouping the income levels into a single column
tidy_income_religion <- income_religion %>% gather(Working_Class,
percentage_income, 2:5)
# Check out the new dataset
tidy_income_religion
## Religion Sample_Size Working_Class
## 1 Buddhist 233 Low_Class
## 2 Catholic 6137 Low_Class
## 3 Evangelical Protestant 7462 Low_Class
## 4 Hindu 172 Low_Class
## 5 Historically Black Protestant 1704 Low_Class
## 6 Jehovah's Witness 208 Low_Class
## 7 Jewish 708 Low_Class
## 8 Mainline Protestant 5208 Low_Class
## 9 Mormon 594 Low_Class
## 10 Muslim 205 Low_Class
## 11 Orthodox Christian 155 Low_Class
## 12 Unaffiliated (religious "nones") 6790 Low_Class
## 13 Buddhist 233 Middle_Class_1
## 14 Catholic 6137 Middle_Class_1
## 15 Evangelical Protestant 7462 Middle_Class_1
## 16 Hindu 172 Middle_Class_1
## 17 Historically Black Protestant 1704 Middle_Class_1
## 18 Jehovah's Witness 208 Middle_Class_1
## 19 Jewish 708 Middle_Class_1
## 20 Mainline Protestant 5208 Middle_Class_1
## 21 Mormon 594 Middle_Class_1
## 22 Muslim 205 Middle_Class_1
## 23 Orthodox Christian 155 Middle_Class_1
## 24 Unaffiliated (religious "nones") 6790 Middle_Class_1
## 25 Buddhist 233 Middle_Class_2
## 26 Catholic 6137 Middle_Class_2
## 27 Evangelical Protestant 7462 Middle_Class_2
## 28 Hindu 172 Middle_Class_2
## 29 Historically Black Protestant 1704 Middle_Class_2
## 30 Jehovah's Witness 208 Middle_Class_2
## 31 Jewish 708 Middle_Class_2
## 32 Mainline Protestant 5208 Middle_Class_2
## 33 Mormon 594 Middle_Class_2
## 34 Muslim 205 Middle_Class_2
## 35 Orthodox Christian 155 Middle_Class_2
## 36 Unaffiliated (religious "nones") 6790 Middle_Class_2
## 37 Buddhist 233 Upper_Middle_Class
## 38 Catholic 6137 Upper_Middle_Class
## 39 Evangelical Protestant 7462 Upper_Middle_Class
## 40 Hindu 172 Upper_Middle_Class
## 41 Historically Black Protestant 1704 Upper_Middle_Class
## 42 Jehovah's Witness 208 Upper_Middle_Class
## 43 Jewish 708 Upper_Middle_Class
## 44 Mainline Protestant 5208 Upper_Middle_Class
## 45 Mormon 594 Upper_Middle_Class
## 46 Muslim 205 Upper_Middle_Class
## 47 Orthodox Christian 155 Upper_Middle_Class
## 48 Unaffiliated (religious "nones") 6790 Upper_Middle_Class
## percentage_income
## 1 36%
## 2 36%
## 3 35%
## 4 17%
## 5 53%
## 6 48%
## 7 16%
## 8 29%
## 9 27%
## 10 34%
## 11 18%
## 12 33%
## 13 18%
## 14 19%
## 15 22%
## 16 13%
## 17 22%
## 18 25%
## 19 15%
## 20 20%
## 21 20%
## 22 17%
## 23 17%
## 24 20%
## 25 32%
## 26 26%
## 27 28%
## 28 34%
## 29 17%
## 30 22%
## 31 24%
## 32 28%
## 33 33%
## 34 29%
## 35 36%
## 36 26%
## 37 13%
## 38 19%
## 39 14%
## 40 36%
## 41 8%
## 42 4%
## 43 44%
## 44 23%
## 45 20%
## 46 20%
## 47 29%
## 48 21%
summary(tidy_income_religion)
## Religion Sample_Size Working_Class
## Buddhist : 4 Min. : 155.0 Length:48
## Catholic : 4 1st Qu.: 207.2 Class :character
## Evangelical Protestant : 4 Median : 651.0 Mode :character
## Hindu : 4 Mean :2464.7
## Historically Black Protestant: 4 3rd Qu.:5440.2
## Jehovah's Witness : 4 Max. :7462.0
## (Other) :24
## percentage_income
## Length:48
## Class :character
## Mode :character
##
##
##
##
# As people make more money, do percentage of Jehova witnesses decrease?
JehovaW <- filter(tidy_income_religion, Religion == "Jehovah's Witness")
JehovaW
## Religion Sample_Size Working_Class percentage_income
## 1 Jehovah's Witness 208 Low_Class 48%
## 2 Jehovah's Witness 208 Middle_Class_1 25%
## 3 Jehovah's Witness 208 Middle_Class_2 22%
## 4 Jehovah's Witness 208 Upper_Middle_Class 4%
Second dataset to examine (Social Networking Use)
Data Dictionary for the age groups in the tidy dataset:
- Young_Adults –> people who were 18-29 years of age
- Adults_Middleage1 –> people who were 30-49 years of age
- Adults_Middleage2 –> people who were 50-64 years of age
- Seniors –> people who were 65+ years of age
Do some Analysis on the second dataset; A dataset that shows Social Network
usage amongst different age groups as of July 2015
# Show the boxplots of each type of age group
# convert Age_group to factors and not characters
tidy_social_net_use <- tidy_social_net_use %>%
mutate_each(funs(factor), Age_Group) %>%
mutate_each(funs(as.numeric), num_users)
# Plot the boxplot
with(tidy_social_net_use,
plot(Age_Group, num_users,
main = "Boxplot of Internet Users for each age bracket",
ylab = "Number of Internet Users", xaxt = "n"))
labels <- tidy_social_net_use$Age_Group
text(labels, labels = labels,
par("usr")[3] - .25, srt = 45, xpd = TRUE, adj = 1, cex.axis = .75)

We can see that the distribution of number of internet users is just about
right skewed for age brackets. The median for almost each age group can
perhaps indicate that over the years, people began using the internet more.
Now to examine the final dataset; a dataset that shows crimes in
Chicago, IL since 2001
Note: this is a ~ 1.4 GB dataset so make sure you have enough RAM to load
into R.
Examine the dataset in more detail and get a sense of the data
str(chicago_crimes)
## Classes 'data.table' and 'data.frame': 6176743 obs. of 22 variables:
## $ ID : int 10001595 10007031 10009684 10012713 10033820 10158010 10292456 10296227 10296236 10296237 ...
## $ Case Number : chr "HY191041" "HY196398" "HY199045" "HY202475" ...
## $ Date : chr "11/02/2014 05:57:00 AM" "03/08/2015 09:00:00 AM" "03/08/2015 04:00:00 AM" "03/08/2015 08:00:00 AM" ...
## $ Block : chr "003XX W 110TH ST" "082XX S EVANS AVE" "018XX N SHEFFIELD AVE" "084XX S EXCHANGE AVE" ...
## $ IUCR : chr "2825" "1155" "1150" "5002" ...
## $ Primary Type : chr "OTHER OFFENSE" "DECEPTIVE PRACTICE" "DECEPTIVE PRACTICE" "OTHER OFFENSE" ...
## $ Description : chr "HARASSMENT BY TELEPHONE" "AGGRAVATED FINANCIAL IDENTITY THEFT" "CREDIT CARD FRAUD" "OTHER VEHICLE OFFENSE" ...
## $ Location Description: chr "RESIDENCE" "APARTMENT" "RESIDENCE" "STREET" ...
## $ Arrest : chr "false" "false" "false" "false" ...
## $ Domestic : chr "false" "false" "false" "true" ...
## $ Beat : int 513 631 1813 423 1412 924 1524 922 1424 224 ...
## $ District : int 5 6 18 4 14 9 15 9 14 2 ...
## $ Ward : int 34 6 43 10 35 3 37 14 1 3 ...
## $ Community Area : int 49 44 7 46 21 61 25 58 24 38 ...
## $ FBI Code : chr "26" "11" "11" "26" ...
## $ X Coordinate : int 1176052 1182660 1169343 1197281 1152188 1165992 NA 1158073 1163043 1177976 ...
## $ Y Coordinate : int 1831981 1850549 1912524 1849663 1920160 1874208 NA 1873363 1910387 1872910 ...
## $ Year : int 2014 2015 2015 2015 2014 2015 2014 2015 2015 2015 ...
## $ Updated On : chr "02/04/2016 06:33:39 AM" "08/17/2015 03:03:40 PM" "08/17/2015 03:03:40 PM" "08/17/2015 03:03:40 PM" ...
## $ Latitude : num 41.7 41.7 41.9 41.7 41.9 ...
## $ Longitude : num -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ Location : chr "(41.694312, -87.631046)" "(41.745115, -87.606279)" "(41.915478, -87.653277)" "(41.742332, -87.552736)" ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(chicago_crimes)
## ID Case Number Date
## Min. : 634 Length:6176743 Length:6176743
## 1st Qu.: 3249580 Class :character Class :character
## Median : 5735233 Mode :character Mode :character
## Mean : 5775611
## 3rd Qu.: 8176338
## Max. :10709001
##
## Block IUCR Primary Type
## Length:6176743 Length:6176743 Length:6176743
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Description Location Description Arrest
## Length:6176743 Length:6176743 Length:6176743
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Domestic Beat District Ward
## Length:6176743 Min. : 111 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.: 623 1st Qu.: 6.00 1st Qu.:10.0
## Mode :character Median :1111 Median :10.00 Median :22.0
## Mean :1196 Mean :11.31 Mean :22.6
## 3rd Qu.:1732 3rd Qu.:17.00 3rd Qu.:34.0
## Max. :2535 Max. :31.00 Max. :50.0
## NA's :49 NA's :614865
## Community Area FBI Code X Coordinate Y Coordinate
## Min. : 0.0 Length:6176743 Min. : 0 Min. : 0
## 1st Qu.:23.0 Class :character 1st Qu.:1152908 1st Qu.:1859144
## Median :32.0 Mode :character Median :1165916 Median :1890181
## Mean :37.7 Mean :1164472 Mean :1885624
## 3rd Qu.:58.0 3rd Qu.:1176339 3rd Qu.:1909355
## Max. :77.0 Max. :1205119 Max. :1951622
## NA's :616052 NA's :68415 NA's :68415
## Year Updated On Latitude Longitude
## Min. :2001 Length:6176743 Min. :36.62 Min. :-91.69
## 1st Qu.:2004 Class :character 1st Qu.:41.77 1st Qu.:-87.71
## Median :2007 Mode :character Median :41.85 Median :-87.67
## Mean :2007 Mean :41.84 Mean :-87.67
## 3rd Qu.:2011 3rd Qu.:41.91 3rd Qu.:-87.63
## Max. :2016 Max. :42.02 Max. :-87.52
## NA's :68415 NA's :68415
## Location
## Length:6176743
## Class :character
## Mode :character
##
##
##
##
Examine the crimes that had an arrest and where the primary reason and/or
description was either an “assault” or “theft” only
But lets clean some portion of this dataset first
# update underscores of the attribute names with a '_' symbol so the attribute
# names can be referenced with '$'
colnames(chicago_crimes) <- gsub(" ", "_", colnames(chicago_crimes))
# make the Primary_Type, Description, "Arrest" factor variables instead of
# character vectors (better to work with)
tidy_chicago_crimes <- chicago_crimes %>%
mutate_each_(funs(factor), c("Primary_Type", "Description", "Arrest"))
# select the data where Arrest == true and the primary type or description have
# the words assault or theft in them
tidy_chicago_crimes_arrest <- tidy_chicago_crimes %>% filter(Arrest == "true")
# Any arrests dealing with assault?
tidy_chicago_crimes_arrest_assault <-
tidy_chicago_crimes_arrest[grepl("assault",
tidy_chicago_crimes_arrest$Description,
ignore.case = TRUE) |
grepl("assault",
tidy_chicago_crimes_arrest$Primary_Type,
ignore.case = TRUE), ]
# Any arrests dealing with theft?
tidy_chicago_crimes_arrest_theft <-
tidy_chicago_crimes_arrest[grepl("theft",
tidy_chicago_crimes_arrest$Description,
ignore.case = TRUE) |
grepl("theft",
tidy_chicago_crimes_arrest$Primary_Type,
ignore.case = TRUE), ]
dim(tidy_chicago_crimes_arrest_theft)
## [1] 207813 22
dim(tidy_chicago_crimes_arrest_assault)
## [1] 93427 22
As we can see
Assuming the description and primary type of each crime only have those two
strings that are about 207k thefts and about 93k assaults in Chicago reported
since 2001.