Cleaning Various Datasets

This Project is for analyzing 3 different datasets from forums on the

CUNY SPS IS607 Fall 2016, clean and tidy the data using dplyr and tidyr

and then analyzing the data.

First dataset to look at; a table showing income based on various religion

traditions.

# load the income_religion dataset
income_religion <- read.csv("/home/jonboy1987/Desktop/CUNYSPS/IS607/Assignments/Project2/religion_income.csv")
names(income_religion)

## [1] "Religious.tradition" "Less.than..30.000"   "X.30.000..49.999"   
## [4] "X.50.000..99.999"    "X.100.000.or.more"   "Sample.Size"

# Change the attribute names of the dataset
colnames(income_religion) <- c("Religion", "Low_Class", 
                               "Middle_Class_1", "Middle_Class_2",
                               "Upper_Middle_Class", "Sample_Size")

income_religion

##                            Religion Low_Class Middle_Class_1
## 1                          Buddhist       36%            18%
## 2                          Catholic       36%            19%
## 3            Evangelical Protestant       35%            22%
## 4                             Hindu       17%            13%
## 5     Historically Black Protestant       53%            22%
## 6                 Jehovah's Witness       48%            25%
## 7                            Jewish       16%            15%
## 8               Mainline Protestant       29%            20%
## 9                            Mormon       27%            20%
## 10                           Muslim       34%            17%
## 11               Orthodox Christian       18%            17%
## 12 Unaffiliated (religious "nones")       33%            20%
##    Middle_Class_2 Upper_Middle_Class Sample_Size
## 1             32%                13%         233
## 2             26%                19%        6137
## 3             28%                14%        7462
## 4             34%                36%         172
## 5             17%                 8%        1704
## 6             22%                 4%         208
## 7             24%                44%         708
## 8             28%                23%        5208
## 9             33%                20%         594
## 10            29%                20%         205
## 11            36%                29%         155
## 12            26%                21%        6790

Data Dictionary

Low_class –> income < $30,000
Middle_Class_1 –> income between $30,000 and $49,999 inclusive
Middle_Class_2 –> income between $50,000 and $99,999 inclusive
Upper_Middle_Class –> income $100k or more

Let’s use the dplyr and tidyr packages and combine the incomes into

one column

# Tidy up the dataset by grouping the income levels into a single column 
tidy_income_religion <- income_religion %>% gather(Working_Class,
                                                   percentage_income, 2:5)

# Check out the new dataset
tidy_income_religion

##                            Religion Sample_Size      Working_Class
## 1                          Buddhist         233          Low_Class
## 2                          Catholic        6137          Low_Class
## 3            Evangelical Protestant        7462          Low_Class
## 4                             Hindu         172          Low_Class
## 5     Historically Black Protestant        1704          Low_Class
## 6                 Jehovah's Witness         208          Low_Class
## 7                            Jewish         708          Low_Class
## 8               Mainline Protestant        5208          Low_Class
## 9                            Mormon         594          Low_Class
## 10                           Muslim         205          Low_Class
## 11               Orthodox Christian         155          Low_Class
## 12 Unaffiliated (religious "nones")        6790          Low_Class
## 13                         Buddhist         233     Middle_Class_1
## 14                         Catholic        6137     Middle_Class_1
## 15           Evangelical Protestant        7462     Middle_Class_1
## 16                            Hindu         172     Middle_Class_1
## 17    Historically Black Protestant        1704     Middle_Class_1
## 18                Jehovah's Witness         208     Middle_Class_1
## 19                           Jewish         708     Middle_Class_1
## 20              Mainline Protestant        5208     Middle_Class_1
## 21                           Mormon         594     Middle_Class_1
## 22                           Muslim         205     Middle_Class_1
## 23               Orthodox Christian         155     Middle_Class_1
## 24 Unaffiliated (religious "nones")        6790     Middle_Class_1
## 25                         Buddhist         233     Middle_Class_2
## 26                         Catholic        6137     Middle_Class_2
## 27           Evangelical Protestant        7462     Middle_Class_2
## 28                            Hindu         172     Middle_Class_2
## 29    Historically Black Protestant        1704     Middle_Class_2
## 30                Jehovah's Witness         208     Middle_Class_2
## 31                           Jewish         708     Middle_Class_2
## 32              Mainline Protestant        5208     Middle_Class_2
## 33                           Mormon         594     Middle_Class_2
## 34                           Muslim         205     Middle_Class_2
## 35               Orthodox Christian         155     Middle_Class_2
## 36 Unaffiliated (religious "nones")        6790     Middle_Class_2
## 37                         Buddhist         233 Upper_Middle_Class
## 38                         Catholic        6137 Upper_Middle_Class
## 39           Evangelical Protestant        7462 Upper_Middle_Class
## 40                            Hindu         172 Upper_Middle_Class
## 41    Historically Black Protestant        1704 Upper_Middle_Class
## 42                Jehovah's Witness         208 Upper_Middle_Class
## 43                           Jewish         708 Upper_Middle_Class
## 44              Mainline Protestant        5208 Upper_Middle_Class
## 45                           Mormon         594 Upper_Middle_Class
## 46                           Muslim         205 Upper_Middle_Class
## 47               Orthodox Christian         155 Upper_Middle_Class
## 48 Unaffiliated (religious "nones")        6790 Upper_Middle_Class
##    percentage_income
## 1                36%
## 2                36%
## 3                35%
## 4                17%
## 5                53%
## 6                48%
## 7                16%
## 8                29%
## 9                27%
## 10               34%
## 11               18%
## 12               33%
## 13               18%
## 14               19%
## 15               22%
## 16               13%
## 17               22%
## 18               25%
## 19               15%
## 20               20%
## 21               20%
## 22               17%
## 23               17%
## 24               20%
## 25               32%
## 26               26%
## 27               28%
## 28               34%
## 29               17%
## 30               22%
## 31               24%
## 32               28%
## 33               33%
## 34               29%
## 35               36%
## 36               26%
## 37               13%
## 38               19%
## 39               14%
## 40               36%
## 41                8%
## 42                4%
## 43               44%
## 44               23%
## 45               20%
## 46               20%
## 47               29%
## 48               21%

summary(tidy_income_religion)

##                           Religion   Sample_Size     Working_Class     
##  Buddhist                     : 4   Min.   : 155.0   Length:48         
##  Catholic                     : 4   1st Qu.: 207.2   Class :character  
##  Evangelical Protestant       : 4   Median : 651.0   Mode  :character  
##  Hindu                        : 4   Mean   :2464.7                     
##  Historically Black Protestant: 4   3rd Qu.:5440.2                     
##  Jehovah's Witness            : 4   Max.   :7462.0                     
##  (Other)                      :24                                      
##  percentage_income 
##  Length:48         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

# As people make more money, do percentage of Jehova witnesses decrease?
JehovaW <- filter(tidy_income_religion, Religion == "Jehovah's Witness")
JehovaW

##            Religion Sample_Size      Working_Class percentage_income
## 1 Jehovah's Witness         208          Low_Class               48%
## 2 Jehovah's Witness         208     Middle_Class_1               25%
## 3 Jehovah's Witness         208     Middle_Class_2               22%
## 4 Jehovah's Witness         208 Upper_Middle_Class                4%

http://www.pewresearch.org/data-trend/media-and-technology/social-networking-use/

social_net_use <- read.csv('/home/jonboy1987/Desktop/CUNYSPS/IS607/Assignments/Project2/socialNetworkingUse.csv')
social_net_use

##       Date All.internet.users X18.29 X30.49 X50.64 X65.
## 1   2/2005                  8      9      7      6    —
## 2   8/2006                 16     49      8      4    1
## 3   5/2008                 29     67     25     11    7
## 4   4/2009                 46     76     48     24   13
## 5   5/2010                 61     86     61     47   26
## 6   8/2011                 64     87     68     49   29
## 7   2/2012                 66     86     72     50   34
## 8   8/2012                 69     92     73     57   38
## 9  12/2012                 67     83     77     52   32
## 10  5/2013                 72     89     78     60   43
## 11  1/2014                 74     89     89     65   49
## 12  7/2015                 76     92     81     67   56

# Change the attribute names around
colnames(social_net_use) <- c("Date", "All_Internet_Users", "Young_Adults",
                              "Adults_Middleage1", "Adults_Middleage2",
                              "Seniors")
names(social_net_use)

## [1] "Date"               "All_Internet_Users" "Young_Adults"      
## [4] "Adults_Middleage1"  "Adults_Middleage2"  "Seniors"

# combine the age groups into one variable namely Age for a more tidy dataset

tidy_social_net_use <- social_net_use %>% gather(Age_Group, num_users, 2:6)

## Warning: attributes are not identical across measure variables; they will
## be dropped

tidy_social_net_use

##       Date          Age_Group num_users
## 1   2/2005 All_Internet_Users         8
## 2   8/2006 All_Internet_Users        16
## 3   5/2008 All_Internet_Users        29
## 4   4/2009 All_Internet_Users        46
## 5   5/2010 All_Internet_Users        61
## 6   8/2011 All_Internet_Users        64
## 7   2/2012 All_Internet_Users        66
## 8   8/2012 All_Internet_Users        69
## 9  12/2012 All_Internet_Users        67
## 10  5/2013 All_Internet_Users        72
## 11  1/2014 All_Internet_Users        74
## 12  7/2015 All_Internet_Users        76
## 13  2/2005       Young_Adults         9
## 14  8/2006       Young_Adults        49
## 15  5/2008       Young_Adults        67
## 16  4/2009       Young_Adults        76
## 17  5/2010       Young_Adults        86
## 18  8/2011       Young_Adults        87
## 19  2/2012       Young_Adults        86
## 20  8/2012       Young_Adults        92
## 21 12/2012       Young_Adults        83
## 22  5/2013       Young_Adults        89
## 23  1/2014       Young_Adults        89
## 24  7/2015       Young_Adults        92
## 25  2/2005  Adults_Middleage1         7
## 26  8/2006  Adults_Middleage1         8
## 27  5/2008  Adults_Middleage1        25
## 28  4/2009  Adults_Middleage1        48
## 29  5/2010  Adults_Middleage1        61
## 30  8/2011  Adults_Middleage1        68
## 31  2/2012  Adults_Middleage1        72
## 32  8/2012  Adults_Middleage1        73
## 33 12/2012  Adults_Middleage1        77
## 34  5/2013  Adults_Middleage1        78
## 35  1/2014  Adults_Middleage1        89
## 36  7/2015  Adults_Middleage1        81
## 37  2/2005  Adults_Middleage2         6
## 38  8/2006  Adults_Middleage2         4
## 39  5/2008  Adults_Middleage2        11
## 40  4/2009  Adults_Middleage2        24
## 41  5/2010  Adults_Middleage2        47
## 42  8/2011  Adults_Middleage2        49
## 43  2/2012  Adults_Middleage2        50
## 44  8/2012  Adults_Middleage2        57
## 45 12/2012  Adults_Middleage2        52
## 46  5/2013  Adults_Middleage2        60
## 47  1/2014  Adults_Middleage2        65
## 48  7/2015  Adults_Middleage2        67
## 49  2/2005            Seniors         —
## 50  8/2006            Seniors         1
## 51  5/2008            Seniors         7
## 52  4/2009            Seniors        13
## 53  5/2010            Seniors        26
## 54  8/2011            Seniors        29
## 55  2/2012            Seniors        34
## 56  8/2012            Seniors        38
## 57 12/2012            Seniors        32
## 58  5/2013            Seniors        43
## 59  1/2014            Seniors        49
## 60  7/2015            Seniors        56

Data Dictionary for the age groups in the tidy dataset:

Young_Adults –> people who were 18-29 years of age
Adults_Middleage1 –> people who were 30-49 years of age
Adults_Middleage2 –> people who were 50-64 years of age
Seniors –> people who were 65+ years of age

usage amongst different age groups as of July 2015

# Show the boxplots of each type of age group

# convert Age_group to factors and not characters
tidy_social_net_use <- tidy_social_net_use %>% 
  mutate_each(funs(factor), Age_Group) %>%
  mutate_each(funs(as.numeric), num_users)

# Plot the boxplot
with(tidy_social_net_use,
                 plot(Age_Group, num_users,
                      main = "Boxplot of Internet Users for each age bracket",
                      ylab = "Number of Internet Users", xaxt = "n"))
labels <- tidy_social_net_use$Age_Group
text(labels, labels = labels, 
     par("usr")[3] - .25, srt = 45, xpd = TRUE, adj = 1, cex.axis = .75)

We can see that the distribution of number of internet users is just about

right skewed for age brackets. The median for almost each age group can

perhaps indicate that over the years, people began using the internet more.

Now to examine the final dataset; a dataset that shows crimes in

Chicago, IL since 2001

Note: this is a ~ 1.4 GB dataset so make sure you have enough RAM to load

into R.

data can be found at https://data.cityofchicago.org/view/5cd6-ry5g

Examine the dataset in more detail and get a sense of the data

str(chicago_crimes)

## Classes 'data.table' and 'data.frame':   6176743 obs. of  22 variables:
##  $ ID                  : int  10001595 10007031 10009684 10012713 10033820 10158010 10292456 10296227 10296236 10296237 ...
##  $ Case Number         : chr  "HY191041" "HY196398" "HY199045" "HY202475" ...
##  $ Date                : chr  "11/02/2014 05:57:00 AM" "03/08/2015 09:00:00 AM" "03/08/2015 04:00:00 AM" "03/08/2015 08:00:00 AM" ...
##  $ Block               : chr  "003XX W 110TH ST" "082XX S EVANS AVE" "018XX N SHEFFIELD AVE" "084XX S EXCHANGE AVE" ...
##  $ IUCR                : chr  "2825" "1155" "1150" "5002" ...
##  $ Primary Type        : chr  "OTHER OFFENSE" "DECEPTIVE PRACTICE" "DECEPTIVE PRACTICE" "OTHER OFFENSE" ...
##  $ Description         : chr  "HARASSMENT BY TELEPHONE" "AGGRAVATED FINANCIAL IDENTITY THEFT" "CREDIT CARD FRAUD" "OTHER VEHICLE OFFENSE" ...
##  $ Location Description: chr  "RESIDENCE" "APARTMENT" "RESIDENCE" "STREET" ...
##  $ Arrest              : chr  "false" "false" "false" "false" ...
##  $ Domestic            : chr  "false" "false" "false" "true" ...
##  $ Beat                : int  513 631 1813 423 1412 924 1524 922 1424 224 ...
##  $ District            : int  5 6 18 4 14 9 15 9 14 2 ...
##  $ Ward                : int  34 6 43 10 35 3 37 14 1 3 ...
##  $ Community Area      : int  49 44 7 46 21 61 25 58 24 38 ...
##  $ FBI Code            : chr  "26" "11" "11" "26" ...
##  $ X Coordinate        : int  1176052 1182660 1169343 1197281 1152188 1165992 NA 1158073 1163043 1177976 ...
##  $ Y Coordinate        : int  1831981 1850549 1912524 1849663 1920160 1874208 NA 1873363 1910387 1872910 ...
##  $ Year                : int  2014 2015 2015 2015 2014 2015 2014 2015 2015 2015 ...
##  $ Updated On          : chr  "02/04/2016 06:33:39 AM" "08/17/2015 03:03:40 PM" "08/17/2015 03:03:40 PM" "08/17/2015 03:03:40 PM" ...
##  $ Latitude            : num  41.7 41.7 41.9 41.7 41.9 ...
##  $ Longitude           : num  -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ Location            : chr  "(41.694312, -87.631046)" "(41.745115, -87.606279)" "(41.915478, -87.653277)" "(41.742332, -87.552736)" ...
##  - attr(*, ".internal.selfref")=<externalptr>

summary(chicago_crimes)

##        ID           Case Number            Date          
##  Min.   :     634   Length:6176743     Length:6176743    
##  1st Qu.: 3249580   Class :character   Class :character  
##  Median : 5735233   Mode  :character   Mode  :character  
##  Mean   : 5775611                                        
##  3rd Qu.: 8176338                                        
##  Max.   :10709001                                        
##                                                          
##     Block               IUCR           Primary Type      
##  Length:6176743     Length:6176743     Length:6176743    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Description        Location Description    Arrest         
##  Length:6176743     Length:6176743       Length:6176743    
##  Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character     Mode  :character  
##                                                            
##                                                            
##                                                            
##                                                            
##    Domestic              Beat         District          Ward       
##  Length:6176743     Min.   : 111   Min.   : 1.00   Min.   : 1.0    
##  Class :character   1st Qu.: 623   1st Qu.: 6.00   1st Qu.:10.0    
##  Mode  :character   Median :1111   Median :10.00   Median :22.0    
##                     Mean   :1196   Mean   :11.31   Mean   :22.6    
##                     3rd Qu.:1732   3rd Qu.:17.00   3rd Qu.:34.0    
##                     Max.   :2535   Max.   :31.00   Max.   :50.0    
##                                    NA's   :49      NA's   :614865  
##  Community Area     FBI Code          X Coordinate      Y Coordinate    
##  Min.   : 0.0     Length:6176743     Min.   :      0   Min.   :      0  
##  1st Qu.:23.0     Class :character   1st Qu.:1152908   1st Qu.:1859144  
##  Median :32.0     Mode  :character   Median :1165916   Median :1890181  
##  Mean   :37.7                        Mean   :1164472   Mean   :1885624  
##  3rd Qu.:58.0                        3rd Qu.:1176339   3rd Qu.:1909355  
##  Max.   :77.0                        Max.   :1205119   Max.   :1951622  
##  NA's   :616052                      NA's   :68415     NA's   :68415    
##       Year       Updated On           Latitude       Longitude     
##  Min.   :2001   Length:6176743     Min.   :36.62   Min.   :-91.69  
##  1st Qu.:2004   Class :character   1st Qu.:41.77   1st Qu.:-87.71  
##  Median :2007   Mode  :character   Median :41.85   Median :-87.67  
##  Mean   :2007                      Mean   :41.84   Mean   :-87.67  
##  3rd Qu.:2011                      3rd Qu.:41.91   3rd Qu.:-87.63  
##  Max.   :2016                      Max.   :42.02   Max.   :-87.52  
##                                    NA's   :68415   NA's   :68415   
##    Location        
##  Length:6176743    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Examine the crimes that had an arrest and where the primary reason and/or

description was either an “assault” or “theft” only

But lets clean some portion of this dataset first

# update underscores of the attribute names with a '_' symbol so the attribute
# names can be referenced with '$'
colnames(chicago_crimes) <- gsub(" ", "_", colnames(chicago_crimes))

# make the Primary_Type, Description, "Arrest" factor variables instead of 
# character vectors (better to work with)
tidy_chicago_crimes <- chicago_crimes %>%
  mutate_each_(funs(factor), c("Primary_Type", "Description", "Arrest"))

# select the data where Arrest == true and the primary type or description have
# the words assault or theft in them

tidy_chicago_crimes_arrest <- tidy_chicago_crimes %>% filter(Arrest == "true")

# Any arrests dealing with assault?
tidy_chicago_crimes_arrest_assault <-
  tidy_chicago_crimes_arrest[grepl("assault",
                                   tidy_chicago_crimes_arrest$Description,
                                   ignore.case = TRUE) |
                               grepl("assault",
                                     tidy_chicago_crimes_arrest$Primary_Type,
                                     ignore.case = TRUE), ]
# Any arrests dealing with theft?
tidy_chicago_crimes_arrest_theft <-
  tidy_chicago_crimes_arrest[grepl("theft",
                                   tidy_chicago_crimes_arrest$Description,
                                   ignore.case = TRUE) |
                               grepl("theft",
                                     tidy_chicago_crimes_arrest$Primary_Type,
                                     ignore.case = TRUE), ]

dim(tidy_chicago_crimes_arrest_theft)

## [1] 207813     22

dim(tidy_chicago_crimes_arrest_assault)

## [1] 93427    22

Cleaning Various Datasets

Jonathan Hernandez

October 4, 2016

This Project is for analyzing 3 different datasets from forums on the

CUNY SPS IS607 Fall 2016, clean and tidy the data using dplyr and tidyr

and then analyzing the data.

First dataset to look at; a table showing income based on various religion

traditions.

Data Dictionary

Let’s use the dplyr and tidyr packages and combine the incomes into

one column

http://www.pewresearch.org/data-trend/media-and-technology/social-networking-use/

Data Dictionary for the age groups in the tidy dataset:

usage amongst different age groups as of July 2015

We can see that the distribution of number of internet users is just about

right skewed for age brackets. The median for almost each age group can

perhaps indicate that over the years, people began using the internet more.

Now to examine the final dataset; a dataset that shows crimes in

Chicago, IL since 2001

Note: this is a ~ 1.4 GB dataset so make sure you have enough RAM to load

into R.

data can be found at https://data.cityofchicago.org/view/5cd6-ry5g

Examine the dataset in more detail and get a sense of the data

Examine the crimes that had an arrest and where the primary reason and/or

description was either an “assault” or “theft” only

But lets clean some portion of this dataset first

As we can see

Assuming the description and primary type of each crime only have those two

strings that are about 207k thefts and about 93k assaults in Chicago reported

since 2001.

Cleaning Various Datasets

Jonathan Hernandez

October 4, 2016

This Project is for analyzing 3 different datasets from forums on the

CUNY SPS IS607 Fall 2016, clean and tidy the data using dplyr and tidyr

and then analyzing the data.

First dataset to look at; a table showing income based on various religion

traditions.

Data Dictionary

Let’s use the dplyr and tidyr packages and combine the incomes into

one column

Second dataset to examine (Social Networking Use)

http://www.pewresearch.org/data-trend/media-and-technology/social-networking-use/

Data Dictionary for the age groups in the tidy dataset:

Do some Analysis on the second dataset; A dataset that shows Social Network

usage amongst different age groups as of July 2015

We can see that the distribution of number of internet users is just about

right skewed for age brackets. The median for almost each age group can

perhaps indicate that over the years, people began using the internet more.

Now to examine the final dataset; a dataset that shows crimes in

Chicago, IL since 2001

Note: this is a ~ 1.4 GB dataset so make sure you have enough RAM to load

into R.

data can be found at https://data.cityofchicago.org/view/5cd6-ry5g

Examine the dataset in more detail and get a sense of the data

Examine the crimes that had an arrest and where the primary reason and/or

description was either an “assault” or “theft” only

But lets clean some portion of this dataset first

As we can see

Assuming the description and primary type of each crime only have those two

strings that are about 207k thefts and about 93k assaults in Chicago reported

since 2001.