Introduction
The purpose of this assignment is to prepare data from several sources for analysis.
For each of the three chosen datasets: 1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Dataset 1 Description
This data set is New York City Citibike usage during June 2016.
Analysis: 1. Determine the profile of a Citibike rider 2. Time of day the longest rides occur
Read data into R
citibike <- read.csv('201606_citibike_tripdata.csv',header=TRUE,stringsAsFactors = FALSE)
head(citibike)
## tripduration starttime stoptime start.station.id
## 1 1470 6/1/2016 00:00:18 6/1/2016 00:24:48 380
## 2 229 6/1/2016 00:00:20 6/1/2016 00:04:09 3092
## 3 344 6/1/2016 00:00:21 6/1/2016 00:06:06 449
## 4 1120 6/1/2016 00:00:28 6/1/2016 00:19:09 522
## 5 229 6/1/2016 00:00:53 6/1/2016 00:04:42 335
## 6 946 6/1/2016 00:01:01 6/1/2016 00:16:48 503
## start.station.name start.station.latitude start.station.longitude
## 1 W 4 St & 7 Ave S 40.73401 -74.00294
## 2 Berry St & N 8 St 40.71901 -73.95853
## 3 W 52 St & 9 Ave 40.76462 -73.98789
## 4 E 51 St & Lexington Ave 40.75715 -73.97208
## 5 Washington Pl & Broadway 40.72904 -73.99405
## 6 E 20 St & Park Ave 40.73827 -73.98752
## end.station.id end.station.name end.station.latitude
## 1 3236 W 42 St & Dyer Ave 40.75898
## 2 3103 N 11 St & Wythe Ave 40.72153
## 3 469 Broadway & W 53 St 40.76344
## 4 401 Allen St & Rivington St 40.72020
## 5 285 Broadway & E 14 St 40.73455
## 6 495 W 47 St & 10 Ave 40.76270
## end.station.longitude bikeid usertype birth.year gender
## 1 -73.99380 19859 Subscriber 1972 1
## 2 -73.95782 16233 Subscriber 1967 1
## 3 -73.98268 22397 Subscriber 1989 1
## 4 -73.98998 16231 Subscriber 1991 1
## 5 -73.99074 15400 Subscriber 1989 1
## 6 -73.99301 25193 Subscriber 1974 1
Explore the data set
# Verify whether is a data.frame
class(citibike)
## [1] "data.frame"
# Check the dimensions
dim(citibike)
## [1] 65499 15
# View the column names
names(citibike)
## [1] "tripduration" "starttime"
## [3] "stoptime" "start.station.id"
## [5] "start.station.name" "start.station.latitude"
## [7] "start.station.longitude" "end.station.id"
## [9] "end.station.name" "end.station.latitude"
## [11] "end.station.longitude" "bikeid"
## [13] "usertype" "birth.year"
## [15] "gender"
# View the beginning of the data
head(citibike)
## tripduration starttime stoptime start.station.id
## 1 1470 6/1/2016 00:00:18 6/1/2016 00:24:48 380
## 2 229 6/1/2016 00:00:20 6/1/2016 00:04:09 3092
## 3 344 6/1/2016 00:00:21 6/1/2016 00:06:06 449
## 4 1120 6/1/2016 00:00:28 6/1/2016 00:19:09 522
## 5 229 6/1/2016 00:00:53 6/1/2016 00:04:42 335
## 6 946 6/1/2016 00:01:01 6/1/2016 00:16:48 503
## start.station.name start.station.latitude start.station.longitude
## 1 W 4 St & 7 Ave S 40.73401 -74.00294
## 2 Berry St & N 8 St 40.71901 -73.95853
## 3 W 52 St & 9 Ave 40.76462 -73.98789
## 4 E 51 St & Lexington Ave 40.75715 -73.97208
## 5 Washington Pl & Broadway 40.72904 -73.99405
## 6 E 20 St & Park Ave 40.73827 -73.98752
## end.station.id end.station.name end.station.latitude
## 1 3236 W 42 St & Dyer Ave 40.75898
## 2 3103 N 11 St & Wythe Ave 40.72153
## 3 469 Broadway & W 53 St 40.76344
## 4 401 Allen St & Rivington St 40.72020
## 5 285 Broadway & E 14 St 40.73455
## 6 495 W 47 St & 10 Ave 40.76270
## end.station.longitude bikeid usertype birth.year gender
## 1 -73.99380 19859 Subscriber 1972 1
## 2 -73.95782 16233 Subscriber 1967 1
## 3 -73.98268 22397 Subscriber 1989 1
## 4 -73.98998 16231 Subscriber 1991 1
## 5 -73.99074 15400 Subscriber 1989 1
## 6 -73.99301 25193 Subscriber 1974 1
# View end of the data
tail(citibike)
## tripduration starttime stoptime start.station.id
## 65494 930 6/2/2016 08:58:40 6/2/2016 09:14:10 459
## 65495 482 6/2/2016 08:58:38 6/2/2016 09:06:41 237
## 65496 455 6/2/2016 08:58:39 6/2/2016 09:06:15 362
## 65497 2369 6/2/2016 08:58:41 6/2/2016 09:38:10 486
## 65498 575 6/2/2016 08:58:41 6/2/2016 09:08:17 250
## 65499 1892 6/2/2016 08:58:47 6/2/2016 09:30:19 426
## start.station.name start.station.latitude
## 65494 W 20 St & 11 Ave 40.74674
## 65495 E 11 St & 2 Ave 40.73047
## 65496 Broadway & W 37 St 40.75173
## 65497 Broadway & W 29 St 40.74620
## 65498 Lafayette St & Jersey St 40.72456
## 65499 West St & Chambers St 40.71755
## start.station.longitude end.station.id end.station.name
## 65494 -74.00776 151 Cleveland Pl & Spring St
## 65495 -73.98672 168 W 18 St & 6 Ave
## 65496 -73.98754 402 Broadway & E 22 St
## 65497 -73.98856 3136 5 Ave & E 63 St
## 65498 -73.99565 309 Murray St & West St
## 65499 -74.01322 322 Clinton St & Tillary St
## end.station.latitude end.station.longitude bikeid usertype
## 65494 40.72210 -73.99725 17801 Customer
## 65495 40.73971 -73.99456 15696 Subscriber
## 65496 40.74034 -73.98955 20228 Subscriber
## 65497 40.76650 -73.97148 16743 Subscriber
## 65498 40.71498 -74.01301 23672 Subscriber
## 65499 40.69619 -73.99122 18399 Customer
## birth.year gender
## 65494 NA 0
## 65495 1996 1
## 65496 1961 1
## 65497 1988 2
## 65498 1986 1
## 65499 NA 0
Summarize the data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# View the structure of the data
glimpse(citibike)
## Observations: 65,499
## Variables: 15
## $ tripduration <int> 1470, 229, 344, 1120, 229, 946, 2351, ...
## $ starttime <chr> "6/1/2016 00:00:18", "6/1/2016 00:00:2...
## $ stoptime <chr> "6/1/2016 00:24:48", "6/1/2016 00:04:0...
## $ start.station.id <int> 380, 3092, 449, 522, 335, 503, 533, 49...
## $ start.station.name <chr> "W 4 St & 7 Ave S", "Berry St & N 8 St...
## $ start.station.latitude <dbl> 40.73401, 40.71901, 40.76462, 40.75715...
## $ start.station.longitude <dbl> -74.00294, -73.95853, -73.98789, -73.9...
## $ end.station.id <int> 3236, 3103, 469, 401, 285, 495, 386, 4...
## $ end.station.name <chr> "W 42 St & Dyer Ave", "N 11 St & Wythe...
## $ end.station.latitude <dbl> 40.75898, 40.72153, 40.76344, 40.72020...
## $ end.station.longitude <dbl> -73.99380, -73.95782, -73.98268, -73.9...
## $ bikeid <int> 19859, 16233, 22397, 16231, 15400, 251...
## $ usertype <chr> "Subscriber", "Subscriber", "Subscribe...
## $ birth.year <int> 1972, 1967, 1989, 1991, 1989, 1974, 19...
## $ gender <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,...
# View a summary of the data
summary(citibike)
## tripduration starttime stoptime start.station.id
## Min. : 61 Length:65499 Length:65499 Min. : 72.0
## 1st Qu.: 404 Class :character Class :character 1st Qu.: 329.0
## Median : 662 Mode :character Mode :character Median : 456.0
## Mean : 1052 Mean : 954.5
## 3rd Qu.: 1094 3rd Qu.: 530.0
## Max. :3129291 Max. :3255.0
##
## start.station.name start.station.latitude start.station.longitude
## Length:65499 Min. :40.68 Min. :-74.02
## Class :character 1st Qu.:40.72 1st Qu.:-74.00
## Mode :character Median :40.74 Median :-73.99
## Mean :40.74 Mean :-73.99
## 3rd Qu.:40.76 3rd Qu.:-73.98
## Max. :40.79 Max. :-73.93
##
## end.station.id end.station.name end.station.latitude
## Min. : 72.0 Length:65499 Min. :40.65
## 1st Qu.: 327.0 Class :character 1st Qu.:40.72
## Median : 450.0 Mode :character Median :40.74
## Mean : 898.6 Mean :40.74
## 3rd Qu.: 524.0 3rd Qu.:40.75
## Max. :3255.0 Max. :40.79
##
## end.station.longitude bikeid usertype birth.year
## Min. :-74.03 Min. :14529 Length:65499 Min. :1885
## 1st Qu.:-74.00 1st Qu.:17125 Class :character 1st Qu.:1969
## Median :-73.99 Median :19788 Mode :character Median :1979
## Mean :-73.99 Mean :20006 Mean :1977
## 3rd Qu.:-73.98 3rd Qu.:22894 3rd Qu.:1986
## Max. :-73.93 Max. :25409 Max. :2000
## NA's :5626
## gender
## Min. :0.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.135
## 3rd Qu.:1.000
## Max. :2.000
##
Take a random sample of 300 since dataset is too large
citisample <- citibike[sample(nrow(citibike), 1000), ]
dim(citisample)
## [1] 1000 15
Filter usertype = customer from the dataset since there is no gender or year of birth information is avaialble they are coded as 0. Customers are non-scribers
citisampfilt <- filter(citisample, usertype == 'Subscriber')
# Remove NA
citina <- na.omit(citisampfilt)
Select colums to be used for analysis
library(dplyr)
citisubset <- select(citina, tripduration, starttime, stoptime, start.station.id, start.station.name, end.station.id, end.station.name, bikeid, usertype, birth.year, gender)
names(citisubset)
## [1] "tripduration" "starttime" "stoptime"
## [4] "start.station.id" "start.station.name" "end.station.id"
## [7] "end.station.name" "bikeid" "usertype"
## [10] "birth.year" "gender"
head(citisubset)
## tripduration starttime stoptime start.station.id
## 1 2383 6/1/2016 07:15:07 6/1/2016 07:54:50 467
## 2 431 6/1/2016 19:06:34 6/1/2016 19:13:46 304
## 3 237 6/2/2016 07:55:36 6/2/2016 07:59:33 458
## 4 365 6/1/2016 07:47:53 6/1/2016 07:53:59 349
## 5 92 6/1/2016 18:16:38 6/1/2016 18:18:10 527
## 6 2234 6/1/2016 10:40:52 6/1/2016 11:18:07 462
## start.station.name end.station.id end.station.name bikeid
## 1 Dean St & 4 Ave 466 W 25 St & 6 Ave 24945
## 2 Broadway & Battery Pl 3002 South End Ave & Liberty St 19688
## 3 11 Ave & W 27 St 453 W 22 St & 8 Ave 22553
## 4 Rivington St & Ridge St 428 E 3 St & 1 Ave 17980
## 5 E 33 St & 2 Ave 528 2 Ave & E 31 St 16026
## 6 W 22 St & 10 Ave 3153 E 71 St & 2 Ave 17974
## usertype birth.year gender
## 1 Subscriber 1980 1
## 2 Subscriber 1988 1
## 3 Subscriber 1977 1
## 4 Subscriber 1989 2
## 5 Subscriber 1963 2
## 6 Subscriber 1963 1
Separate starttime and stoptime columnsdate and time into 2 columns to perform analysis
# Split stattime colum into 2 STime and SDate
startDate <- format(as.POSIXct(strptime(citisubset$starttime,"%m/%d/%Y %H:%M:%S",tz="")), format = "%m/%d/%Y")
startTime <- format(as.POSIXct(strptime(citisubset$starttime,"%m/%d/%Y %H:%M:%S",tz="")), format = "%H:%M:%S")
citisubset$startDate <- startDate
citisubset$startTime <- startTime
# Split stoptime colum into 2 STime and SDate
stopDate <- format(as.POSIXct(strptime(citisubset$stoptime,"%m/%d/%Y %H:%M:%S",tz="")), format = "%m/%d/%Y")
stopTime <- format(as.POSIXct(strptime(citisubset$stoptime,"%m/%d/%Y %H:%M:%S",tz="")), format = "%H:%M:%S")
citisubset$stopDate <- stopDate
citisubset$stopTime <- stopTime
citisplit <- citisubset
head(citisplit)
## tripduration starttime stoptime start.station.id
## 1 2383 6/1/2016 07:15:07 6/1/2016 07:54:50 467
## 2 431 6/1/2016 19:06:34 6/1/2016 19:13:46 304
## 3 237 6/2/2016 07:55:36 6/2/2016 07:59:33 458
## 4 365 6/1/2016 07:47:53 6/1/2016 07:53:59 349
## 5 92 6/1/2016 18:16:38 6/1/2016 18:18:10 527
## 6 2234 6/1/2016 10:40:52 6/1/2016 11:18:07 462
## start.station.name end.station.id end.station.name bikeid
## 1 Dean St & 4 Ave 466 W 25 St & 6 Ave 24945
## 2 Broadway & Battery Pl 3002 South End Ave & Liberty St 19688
## 3 11 Ave & W 27 St 453 W 22 St & 8 Ave 22553
## 4 Rivington St & Ridge St 428 E 3 St & 1 Ave 17980
## 5 E 33 St & 2 Ave 528 2 Ave & E 31 St 16026
## 6 W 22 St & 10 Ave 3153 E 71 St & 2 Ave 17974
## usertype birth.year gender startDate startTime stopDate stopTime
## 1 Subscriber 1980 1 06/01/2016 07:15:07 06/01/2016 07:54:50
## 2 Subscriber 1988 1 06/01/2016 19:06:34 06/01/2016 19:13:46
## 3 Subscriber 1977 1 06/02/2016 07:55:36 06/02/2016 07:59:33
## 4 Subscriber 1989 2 06/01/2016 07:47:53 06/01/2016 07:53:59
## 5 Subscriber 1963 2 06/01/2016 18:16:38 06/01/2016 18:18:10
## 6 Subscriber 1963 1 06/01/2016 10:40:52 06/01/2016 11:18:07
Drop starttime and endtime columns
citisplit <- subset(citisubset, select = c(1,4,5,6,7,8,9,10,11,12,13,14,15))
head(citisplit)
## tripduration start.station.id start.station.name end.station.id
## 1 2383 467 Dean St & 4 Ave 466
## 2 431 304 Broadway & Battery Pl 3002
## 3 237 458 11 Ave & W 27 St 453
## 4 365 349 Rivington St & Ridge St 428
## 5 92 527 E 33 St & 2 Ave 528
## 6 2234 462 W 22 St & 10 Ave 3153
## end.station.name bikeid usertype birth.year gender
## 1 W 25 St & 6 Ave 24945 Subscriber 1980 1
## 2 South End Ave & Liberty St 19688 Subscriber 1988 1
## 3 W 22 St & 8 Ave 22553 Subscriber 1977 1
## 4 E 3 St & 1 Ave 17980 Subscriber 1989 2
## 5 2 Ave & E 31 St 16026 Subscriber 1963 2
## 6 E 71 St & 2 Ave 17974 Subscriber 1963 1
## startDate startTime stopDate stopTime
## 1 06/01/2016 07:15:07 06/01/2016 07:54:50
## 2 06/01/2016 19:06:34 06/01/2016 19:13:46
## 3 06/02/2016 07:55:36 06/02/2016 07:59:33
## 4 06/01/2016 07:47:53 06/01/2016 07:53:59
## 5 06/01/2016 18:16:38 06/01/2016 18:18:10
## 6 06/01/2016 10:40:52 06/01/2016 11:18:07
Calculate age and add column to dataset
citisplit$stopDated = as.Date(citisplit$stopDate, format="%m/%d/%Y")
citisplit$birth.yearn = as.numeric(substr(citisplit$birth.year,1,999))
citisplit$age = as.numeric(format(citisplit$stopDated,"%Y")) - citisplit$birth.yearn
head(citisplit)
## tripduration start.station.id start.station.name end.station.id
## 1 2383 467 Dean St & 4 Ave 466
## 2 431 304 Broadway & Battery Pl 3002
## 3 237 458 11 Ave & W 27 St 453
## 4 365 349 Rivington St & Ridge St 428
## 5 92 527 E 33 St & 2 Ave 528
## 6 2234 462 W 22 St & 10 Ave 3153
## end.station.name bikeid usertype birth.year gender
## 1 W 25 St & 6 Ave 24945 Subscriber 1980 1
## 2 South End Ave & Liberty St 19688 Subscriber 1988 1
## 3 W 22 St & 8 Ave 22553 Subscriber 1977 1
## 4 E 3 St & 1 Ave 17980 Subscriber 1989 2
## 5 2 Ave & E 31 St 16026 Subscriber 1963 2
## 6 E 71 St & 2 Ave 17974 Subscriber 1963 1
## startDate startTime stopDate stopTime stopDated birth.yearn age
## 1 06/01/2016 07:15:07 06/01/2016 07:54:50 2016-06-01 1980 36
## 2 06/01/2016 19:06:34 06/01/2016 19:13:46 2016-06-01 1988 28
## 3 06/02/2016 07:55:36 06/02/2016 07:59:33 2016-06-02 1977 39
## 4 06/01/2016 07:47:53 06/01/2016 07:53:59 2016-06-01 1989 27
## 5 06/01/2016 18:16:38 06/01/2016 18:18:10 2016-06-01 1963 53
## 6 06/01/2016 10:40:52 06/01/2016 11:18:07 2016-06-01 1963 53
Convert age variable into intervals and add column of intervals to dataset
citisplit$agerange <- cut(citisplit$age, breaks= c(18, 24, 30, 35, 40, 50, 60, 80),
labels = c("18-24 years", "25-30 years", "31-35 years", "36-40 years",
"41-50 years", "51-60 years", "65 plus"),
right = FALSE)
Examine riders
This histogram is skewed to the right, the majority of citibike rides tend to be younger and male
hist(citisplit$age)
hist(citisplit$gender)
If a user is a customer and not a subscriber there is no age or gender recorded
# Mean age of riders
mean(citisplit$age)
## [1] 38.74183
# Recode gender variable 1=Male, 2=Female
citisplit$gender2 <- ifelse(citisplit$gender == 1, "Male", ifelse(citisplit$gender == 2, "Female", "Other"))
head(citisplit)
## tripduration start.station.id start.station.name end.station.id
## 1 2383 467 Dean St & 4 Ave 466
## 2 431 304 Broadway & Battery Pl 3002
## 3 237 458 11 Ave & W 27 St 453
## 4 365 349 Rivington St & Ridge St 428
## 5 92 527 E 33 St & 2 Ave 528
## 6 2234 462 W 22 St & 10 Ave 3153
## end.station.name bikeid usertype birth.year gender
## 1 W 25 St & 6 Ave 24945 Subscriber 1980 1
## 2 South End Ave & Liberty St 19688 Subscriber 1988 1
## 3 W 22 St & 8 Ave 22553 Subscriber 1977 1
## 4 E 3 St & 1 Ave 17980 Subscriber 1989 2
## 5 2 Ave & E 31 St 16026 Subscriber 1963 2
## 6 E 71 St & 2 Ave 17974 Subscriber 1963 1
## startDate startTime stopDate stopTime stopDated birth.yearn age
## 1 06/01/2016 07:15:07 06/01/2016 07:54:50 2016-06-01 1980 36
## 2 06/01/2016 19:06:34 06/01/2016 19:13:46 2016-06-01 1988 28
## 3 06/02/2016 07:55:36 06/02/2016 07:59:33 2016-06-02 1977 39
## 4 06/01/2016 07:47:53 06/01/2016 07:53:59 2016-06-01 1989 27
## 5 06/01/2016 18:16:38 06/01/2016 18:18:10 2016-06-01 1963 53
## 6 06/01/2016 10:40:52 06/01/2016 11:18:07 2016-06-01 1963 53
## agerange gender2
## 1 36-40 years Male
## 2 25-30 years Male
## 3 36-40 years Male
## 4 25-30 years Female
## 5 51-60 years Female
## 6 51-60 years Male
# Crosstab agerange by subscriber
require(eeptools)
## Loading required package: eeptools
## Loading required package: ggplot2
varnames<-c('Male','Female')
crosstabs(citisplit, rowvar = "agerange",colvar = "gender2",
varnames=varnames, digits = 2)
## $TABS
## Female
## Male Female Male Other
## 18-24 years 6 29 0
## 25-30 years 60 147 0
## 31-35 years 41 111 0
## 36-40 years 28 112 1
## 41-50 years 47 154 1
## 51-60 years 27 110 0
## 65 plus 13 30 0
##
## $PROPORTIONS
## Female
## Male Female Male Other
## 18-24 years 1 3 0
## 25-30 years 7 16 0
## 31-35 years 4 12 0
## 36-40 years 3 12 0
## 41-50 years 5 17 0
## 51-60 years 3 12 0
## 65 plus 1 3 0
##
## $TABSPROPORTIONS
## Female
## Male Female Male Other
## 18-24 years 1%\n(6) 3%\n(29) 0%\n(0)
## 25-30 years 7%\n(60) 16%\n(147) 0%\n(0)
## 31-35 years 4%\n(41) 12%\n(111) 0%\n(0)
## 36-40 years 3%\n(28) 12%\n(112) 0%\n(1)
## 41-50 years 5%\n(47) 17%\n(154) 0%\n(1)
## 51-60 years 3%\n(27) 12%\n(110) 0%\n(0)
## 65 plus 1%\n(13) 3%\n(30) 0%\n(0)
Conclusion
The majority of citibike users are subscribers. The average age of citibike subscribers is 39 years old. The majority are male.
Dataset 2 Description New York City demographic data
One can use this dataset for targeted Marketing Campaign based on Demography. For example, a manufacturer of Sporting Goods or a Sport Store can choose a county for their campaign based on residents age. Another example would be to do targeted campaign on the basis of second/native language.
Read data
demographic <- read.csv('Demographic_Statistics_By_Zip_Code.csv',
header=TRUE,stringsAsFactors = FALSE)
head(demographic)
## JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
## 1 10001 44 22 0.50
## 2 10002 35 19 0.54
## 3 10003 1 1 1.00
## 4 10004 0 0 0.00
## 5 10005 2 2 1.00
## 6 10006 6 2 0.33
## COUNT.MALE PERCENT.MALE COUNT.GENDER.UNKNOWN PERCENT.GENDER.UNKNOWN
## 1 22 0.50 0 0
## 2 16 0.46 0 0
## 3 0 0.00 0 0
## 4 0 0.00 0 0
## 5 0 0.00 0 0
## 6 4 0.67 0 0
## COUNT.GENDER.TOTAL PERCENT.GENDER.TOTAL COUNT.PACIFIC.ISLANDER
## 1 44 100 0
## 2 35 100 0
## 3 1 100 0
## 4 0 0 0
## 5 2 100 0
## 6 6 100 0
## PERCENT.PACIFIC.ISLANDER COUNT.HISPANIC.LATINO PERCENT.HISPANIC.LATINO
## 1 0 16 0.36
## 2 0 1 0.03
## 3 0 0 0.00
## 4 0 0 0.00
## 5 0 0 0.00
## 6 0 2 0.33
## COUNT.AMERICAN.INDIAN PERCENT.AMERICAN.INDIAN COUNT.ASIAN.NON.HISPANIC
## 1 0 0 3
## 2 0 0 28
## 3 0 0 1
## 4 0 0 0
## 5 0 0 1
## 6 0 0 0
## PERCENT.ASIAN.NON.HISPANIC COUNT.WHITE.NON.HISPANIC
## 1 0.07 1
## 2 0.80 6
## 3 1.00 0
## 4 0.00 0
## 5 0.50 0
## 6 0.00 1
## PERCENT.WHITE.NON.HISPANIC COUNT.BLACK.NON.HISPANIC
## 1 0.02 21
## 2 0.17 0
## 3 0.00 0
## 4 0.00 0
## 5 0.00 1
## 6 0.17 3
## PERCENT.BLACK.NON.HISPANIC COUNT.OTHER.ETHNICITY PERCENT.OTHER.ETHNICITY
## 1 0.48 3 0.07
## 2 0.00 0 0.00
## 3 0.00 0 0.00
## 4 0.00 0 0.00
## 5 0.50 0 0.00
## 6 0.50 0 0.00
## COUNT.ETHNICITY.UNKNOWN PERCENT.ETHNICITY.UNKNOWN COUNT.ETHNICITY.TOTAL
## 1 0 0 44
## 2 0 0 35
## 3 0 0 1
## 4 0 0 0
## 5 0 0 2
## 6 0 0 6
## PERCENT.ETHNICITY.TOTAL COUNT.PERMANENT.RESIDENT.ALIEN
## 1 100 2
## 2 100 2
## 3 100 0
## 4 0 0
## 5 100 1
## 6 100 0
## PERCENT.PERMANENT.RESIDENT.ALIEN COUNT.US.CITIZEN PERCENT.US.CITIZEN
## 1 0.05 42 0.95
## 2 0.06 33 0.94
## 3 0.00 1 1.00
## 4 0.00 0 0.00
## 5 0.50 1 0.50
## 6 0.00 6 1.00
## COUNT.OTHER.CITIZEN.STATUS PERCENT.OTHER.CITIZEN.STATUS
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## COUNT.CITIZEN.STATUS.UNKNOWN PERCENT.CITIZEN.STATUS.UNKNOWN
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## COUNT.CITIZEN.STATUS.TOTAL PERCENT.CITIZEN.STATUS.TOTAL
## 1 44 100
## 2 35 100
## 3 1 100
## 4 0 0
## 5 2 100
## 6 6 100
## COUNT.RECEIVES.PUBLIC.ASSISTANCE PERCENT.RECEIVES.PUBLIC.ASSISTANCE
## 1 20 0.45
## 2 2 0.06
## 3 0 0.00
## 4 0 0.00
## 5 0 0.00
## 6 0 0.00
## COUNT.NRECEIVES.PUBLIC.ASSISTANCE PERCENT.NRECEIVES.PUBLIC.ASSISTANCE
## 1 24 0.55
## 2 33 0.94
## 3 1 1.00
## 4 0 0.00
## 5 2 1.00
## 6 6 1.00
## COUNT.PUBLIC.ASSISTANCE.UNKNOWN PERCENT.PUBLIC.ASSISTANCE.UNKNOWN
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## COUNT.PUBLIC.ASSISTANCE.TOTAL PERCENT.PUBLIC.ASSISTANCE.TOTAL
## 1 44 100
## 2 35 100
## 3 1 100
## 4 0 0
## 5 2 100
## 6 6 100
Convert data frame to table to make it easier
demographic <- tbl_df(demographic)
Explore the data
# Verify whether is a dataframe
class(demographic)
## [1] "tbl_df" "tbl" "data.frame"
# Check the dimensions
dim(demographic)
## [1] 236 46
# View the column names
names(demographic)
## [1] "JURISDICTION.NAME"
## [2] "COUNT.PARTICIPANTS"
## [3] "COUNT.FEMALE"
## [4] "PERCENT.FEMALE"
## [5] "COUNT.MALE"
## [6] "PERCENT.MALE"
## [7] "COUNT.GENDER.UNKNOWN"
## [8] "PERCENT.GENDER.UNKNOWN"
## [9] "COUNT.GENDER.TOTAL"
## [10] "PERCENT.GENDER.TOTAL"
## [11] "COUNT.PACIFIC.ISLANDER"
## [12] "PERCENT.PACIFIC.ISLANDER"
## [13] "COUNT.HISPANIC.LATINO"
## [14] "PERCENT.HISPANIC.LATINO"
## [15] "COUNT.AMERICAN.INDIAN"
## [16] "PERCENT.AMERICAN.INDIAN"
## [17] "COUNT.ASIAN.NON.HISPANIC"
## [18] "PERCENT.ASIAN.NON.HISPANIC"
## [19] "COUNT.WHITE.NON.HISPANIC"
## [20] "PERCENT.WHITE.NON.HISPANIC"
## [21] "COUNT.BLACK.NON.HISPANIC"
## [22] "PERCENT.BLACK.NON.HISPANIC"
## [23] "COUNT.OTHER.ETHNICITY"
## [24] "PERCENT.OTHER.ETHNICITY"
## [25] "COUNT.ETHNICITY.UNKNOWN"
## [26] "PERCENT.ETHNICITY.UNKNOWN"
## [27] "COUNT.ETHNICITY.TOTAL"
## [28] "PERCENT.ETHNICITY.TOTAL"
## [29] "COUNT.PERMANENT.RESIDENT.ALIEN"
## [30] "PERCENT.PERMANENT.RESIDENT.ALIEN"
## [31] "COUNT.US.CITIZEN"
## [32] "PERCENT.US.CITIZEN"
## [33] "COUNT.OTHER.CITIZEN.STATUS"
## [34] "PERCENT.OTHER.CITIZEN.STATUS"
## [35] "COUNT.CITIZEN.STATUS.UNKNOWN"
## [36] "PERCENT.CITIZEN.STATUS.UNKNOWN"
## [37] "COUNT.CITIZEN.STATUS.TOTAL"
## [38] "PERCENT.CITIZEN.STATUS.TOTAL"
## [39] "COUNT.RECEIVES.PUBLIC.ASSISTANCE"
## [40] "PERCENT.RECEIVES.PUBLIC.ASSISTANCE"
## [41] "COUNT.NRECEIVES.PUBLIC.ASSISTANCE"
## [42] "PERCENT.NRECEIVES.PUBLIC.ASSISTANCE"
## [43] "COUNT.PUBLIC.ASSISTANCE.UNKNOWN"
## [44] "PERCENT.PUBLIC.ASSISTANCE.UNKNOWN"
## [45] "COUNT.PUBLIC.ASSISTANCE.TOTAL"
## [46] "PERCENT.PUBLIC.ASSISTANCE.TOTAL"
Summarize the data
library(dplyr)
glimpse(demographic)
## Observations: 236
## Variables: 46
## $ JURISDICTION.NAME <int> 10001, 10002, 10003, 10004...
## $ COUNT.PARTICIPANTS <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ COUNT.FEMALE <int> 22, 19, 1, 0, 2, 2, 0, 0, ...
## $ PERCENT.FEMALE <dbl> 0.50, 0.54, 1.00, 0.00, 1....
## $ COUNT.MALE <int> 22, 16, 0, 0, 0, 4, 1, 2, ...
## $ PERCENT.MALE <dbl> 0.50, 0.46, 0.00, 0.00, 0....
## $ COUNT.GENDER.UNKNOWN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.GENDER.UNKNOWN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.GENDER.TOTAL <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ PERCENT.GENDER.TOTAL <int> 100, 100, 100, 0, 100, 100...
## $ COUNT.PACIFIC.ISLANDER <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.PACIFIC.ISLANDER <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.HISPANIC.LATINO <int> 16, 1, 0, 0, 0, 2, 0, 0, 0...
## $ PERCENT.HISPANIC.LATINO <dbl> 0.36, 0.03, 0.00, 0.00, 0....
## $ COUNT.AMERICAN.INDIAN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.AMERICAN.INDIAN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.ASIAN.NON.HISPANIC <int> 3, 28, 1, 0, 1, 0, 1, 2, 0...
## $ PERCENT.ASIAN.NON.HISPANIC <dbl> 0.07, 0.80, 1.00, 0.00, 0....
## $ COUNT.WHITE.NON.HISPANIC <int> 1, 6, 0, 0, 0, 1, 0, 0, 0,...
## $ PERCENT.WHITE.NON.HISPANIC <dbl> 0.02, 0.17, 0.00, 0.00, 0....
## $ COUNT.BLACK.NON.HISPANIC <int> 21, 0, 0, 0, 1, 3, 0, 0, 0...
## $ PERCENT.BLACK.NON.HISPANIC <dbl> 0.48, 0.00, 0.00, 0.00, 0....
## $ COUNT.OTHER.ETHNICITY <int> 3, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.OTHER.ETHNICITY <dbl> 0.07, 0.00, 0.00, 0.00, 0....
## $ COUNT.ETHNICITY.UNKNOWN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.ETHNICITY.UNKNOWN <dbl> 0.00, 0.00, 0.00, 0.00, 0....
## $ COUNT.ETHNICITY.TOTAL <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ PERCENT.ETHNICITY.TOTAL <int> 100, 100, 100, 0, 100, 100...
## $ COUNT.PERMANENT.RESIDENT.ALIEN <int> 2, 2, 0, 0, 1, 0, 0, 0, 0,...
## $ PERCENT.PERMANENT.RESIDENT.ALIEN <dbl> 0.05, 0.06, 0.00, 0.00, 0....
## $ COUNT.US.CITIZEN <int> 42, 33, 1, 0, 1, 6, 1, 2, ...
## $ PERCENT.US.CITIZEN <dbl> 0.95, 0.94, 1.00, 0.00, 0....
## $ COUNT.OTHER.CITIZEN.STATUS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.OTHER.CITIZEN.STATUS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.CITIZEN.STATUS.UNKNOWN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.CITIZEN.STATUS.UNKNOWN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.CITIZEN.STATUS.TOTAL <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ PERCENT.CITIZEN.STATUS.TOTAL <int> 100, 100, 100, 0, 100, 100...
## $ COUNT.RECEIVES.PUBLIC.ASSISTANCE <int> 20, 2, 0, 0, 0, 0, 1, 0, 0...
## $ PERCENT.RECEIVES.PUBLIC.ASSISTANCE <dbl> 0.45, 0.06, 0.00, 0.00, 0....
## $ COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int> 24, 33, 1, 0, 2, 6, 0, 2, ...
## $ PERCENT.NRECEIVES.PUBLIC.ASSISTANCE <dbl> 0.55, 0.94, 1.00, 0.00, 1....
## $ COUNT.PUBLIC.ASSISTANCE.UNKNOWN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.PUBLIC.ASSISTANCE.UNKNOWN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.PUBLIC.ASSISTANCE.TOTAL <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ PERCENT.PUBLIC.ASSISTANCE.TOTAL <int> 100, 100, 100, 0, 100, 100...
# Summurize data
summary(demographic)
## JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
## Min. :10001 Min. : 0.00 Min. : 0.0 Min. :0.000
## 1st Qu.:10452 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:0.000
## Median :11216 Median : 0.00 Median : 0.0 Median :0.000
## Mean :11127 Mean : 17.66 Mean : 10.3 Mean :0.244
## 3rd Qu.:11422 3rd Qu.: 13.00 3rd Qu.: 6.0 3rd Qu.:0.515
## Max. :20459 Max. :272.00 Max. :194.0 Max. :1.000
## COUNT.MALE PERCENT.MALE COUNT.GENDER.UNKNOWN
## Min. : 0.000 Min. :0.000 Min. :0
## 1st Qu.: 0.000 1st Qu.:0.000 1st Qu.:0
## Median : 0.000 Median :0.000 Median :0
## Mean : 7.364 Mean :0.201 Mean :0
## 3rd Qu.: 4.000 3rd Qu.:0.400 3rd Qu.:0
## Max. :157.000 Max. :1.000 Max. :0
## PERCENT.GENDER.UNKNOWN COUNT.GENDER.TOTAL PERCENT.GENDER.TOTAL
## Min. :0 Min. : 0.00 Min. : 0.00
## 1st Qu.:0 1st Qu.: 0.00 1st Qu.: 0.00
## Median :0 Median : 0.00 Median : 0.00
## Mean :0 Mean : 17.66 Mean : 44.49
## 3rd Qu.:0 3rd Qu.: 13.00 3rd Qu.:100.00
## Max. :0 Max. :272.00 Max. :100.00
## COUNT.PACIFIC.ISLANDER PERCENT.PACIFIC.ISLANDER COUNT.HISPANIC.LATINO
## Min. :0.00000 Min. :0.0000000 Min. : 0.000
## 1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.: 0.000
## Median :0.00000 Median :0.0000000 Median : 0.000
## Mean :0.02542 Mean :0.0002966 Mean : 1.856
## 3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.: 0.000
## Max. :2.00000 Max. :0.0200000 Max. :51.000
## PERCENT.HISPANIC.LATINO COUNT.AMERICAN.INDIAN PERCENT.AMERICAN.INDIAN
## Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.07983 Mean :0.02119 Mean :0.001017
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :2.00000 Max. :0.200000
## COUNT.ASIAN.NON.HISPANIC PERCENT.ASIAN.NON.HISPANIC
## Min. : 0.0000 Min. :0.00000
## 1st Qu.: 0.0000 1st Qu.:0.00000
## Median : 0.0000 Median :0.00000
## Mean : 0.5254 Mean :0.05657
## 3rd Qu.: 0.0000 3rd Qu.:0.00000
## Max. :28.0000 Max. :1.00000
## COUNT.WHITE.NON.HISPANIC PERCENT.WHITE.NON.HISPANIC
## Min. : 0.00 Min. :0.0000
## 1st Qu.: 0.00 1st Qu.:0.0000
## Median : 0.00 Median :0.0000
## Mean : 12.19 Mean :0.1778
## 3rd Qu.: 1.00 3rd Qu.:0.0225
## Max. :262.00 Max. :1.0000
## COUNT.BLACK.NON.HISPANIC PERCENT.BLACK.NON.HISPANIC COUNT.OTHER.ETHNICITY
## Min. : 0.000 Min. :0.0000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 0.0000
## Median : 0.000 Median :0.0000 Median : 0.0000
## Mean : 2.233 Mean :0.1042 Mean : 0.6864
## 3rd Qu.: 0.000 3rd Qu.:0.0000 3rd Qu.: 0.0000
## Max. :60.000 Max. :1.0000 Max. :17.0000
## PERCENT.OTHER.ETHNICITY COUNT.ETHNICITY.UNKNOWN PERCENT.ETHNICITY.UNKNOWN
## Min. :0.00000 Min. :0.0000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000000
## Median :0.00000 Median :0.0000 Median :0.000000
## Mean :0.02131 Mean :0.1229 Mean :0.003941
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :0.50000 Max. :5.0000 Max. :0.250000
## COUNT.ETHNICITY.TOTAL PERCENT.ETHNICITY.TOTAL
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 17.66 Mean : 44.45
## 3rd Qu.: 13.00 3rd Qu.:100.00
## Max. :272.00 Max. :100.00
## COUNT.PERMANENT.RESIDENT.ALIEN PERCENT.PERMANENT.RESIDENT.ALIEN
## Min. : 0.0000 Min. :0.00000
## 1st Qu.: 0.0000 1st Qu.:0.00000
## Median : 0.0000 Median :0.00000
## Mean : 0.4449 Mean :0.02419
## 3rd Qu.: 0.0000 3rd Qu.:0.00000
## Max. :10.0000 Max. :1.00000
## COUNT.US.CITIZEN PERCENT.US.CITIZEN COUNT.OTHER.CITIZEN.STATUS
## Min. : 0.00 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.00 Median :0.0000 Median :0.00000
## Mean : 17.17 Mean :0.4187 Mean :0.04661
## 3rd Qu.: 12.00 3rd Qu.:0.9900 3rd Qu.:0.00000
## Max. :271.00 Max. :1.0000 Max. :2.00000
## PERCENT.OTHER.CITIZEN.STATUS COUNT.CITIZEN.STATUS.UNKNOWN
## Min. :0.000000 Min. :0
## 1st Qu.:0.000000 1st Qu.:0
## Median :0.000000 Median :0
## Mean :0.002076 Mean :0
## 3rd Qu.:0.000000 3rd Qu.:0
## Max. :0.330000 Max. :0
## PERCENT.CITIZEN.STATUS.UNKNOWN COUNT.CITIZEN.STATUS.TOTAL
## Min. :0 Min. : 0.00
## 1st Qu.:0 1st Qu.: 0.00
## Median :0 Median : 0.00
## Mean :0 Mean : 17.66
## 3rd Qu.:0 3rd Qu.: 13.00
## Max. :0 Max. :272.00
## PERCENT.CITIZEN.STATUS.TOTAL COUNT.RECEIVES.PUBLIC.ASSISTANCE
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.000
## Mean : 44.49 Mean : 5.975
## 3rd Qu.:100.00 3rd Qu.: 3.250
## Max. :100.00 Max. :155.000
## PERCENT.RECEIVES.PUBLIC.ASSISTANCE COUNT.NRECEIVES.PUBLIC.ASSISTANCE
## Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 0.00
## Median :0.0000 Median : 0.00
## Mean :0.1392 Mean : 11.69
## 3rd Qu.:0.2525 3rd Qu.: 8.00
## Max. :1.0000 Max. :206.00
## PERCENT.NRECEIVES.PUBLIC.ASSISTANCE COUNT.PUBLIC.ASSISTANCE.UNKNOWN
## Min. :0.0000 Min. :0
## 1st Qu.:0.0000 1st Qu.:0
## Median :0.0000 Median :0
## Mean :0.3058 Mean :0
## 3rd Qu.:0.6625 3rd Qu.:0
## Max. :1.0000 Max. :0
## PERCENT.PUBLIC.ASSISTANCE.UNKNOWN COUNT.PUBLIC.ASSISTANCE.TOTAL
## Min. :0 Min. : 0.00
## 1st Qu.:0 1st Qu.: 0.00
## Median :0 Median : 0.00
## Mean :0 Mean : 17.66
## 3rd Qu.:0 3rd Qu.: 13.00
## Max. :0 Max. :272.00
## PERCENT.PUBLIC.ASSISTANCE.TOTAL
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 44.49
## 3rd Qu.:100.00
## Max. :100.00
# View the top of the data
head(demographic)
## # A tibble: 6 × 46
## JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
## <int> <int> <int> <dbl>
## 1 10001 44 22 0.50
## 2 10002 35 19 0.54
## 3 10003 1 1 1.00
## 4 10004 0 0 0.00
## 5 10005 2 2 1.00
## 6 10006 6 2 0.33
## # ... with 42 more variables: COUNT.MALE <int>, PERCENT.MALE <dbl>,
## # COUNT.GENDER.UNKNOWN <int>, PERCENT.GENDER.UNKNOWN <int>,
## # COUNT.GENDER.TOTAL <int>, PERCENT.GENDER.TOTAL <int>,
## # COUNT.PACIFIC.ISLANDER <int>, PERCENT.PACIFIC.ISLANDER <dbl>,
## # COUNT.HISPANIC.LATINO <int>, PERCENT.HISPANIC.LATINO <dbl>,
## # COUNT.AMERICAN.INDIAN <int>, PERCENT.AMERICAN.INDIAN <dbl>,
## # COUNT.ASIAN.NON.HISPANIC <int>, PERCENT.ASIAN.NON.HISPANIC <dbl>,
## # COUNT.WHITE.NON.HISPANIC <int>, PERCENT.WHITE.NON.HISPANIC <dbl>,
## # COUNT.BLACK.NON.HISPANIC <int>, PERCENT.BLACK.NON.HISPANIC <dbl>,
## # COUNT.OTHER.ETHNICITY <int>, PERCENT.OTHER.ETHNICITY <dbl>,
## # COUNT.ETHNICITY.UNKNOWN <int>, PERCENT.ETHNICITY.UNKNOWN <dbl>,
## # COUNT.ETHNICITY.TOTAL <int>, PERCENT.ETHNICITY.TOTAL <int>,
## # COUNT.PERMANENT.RESIDENT.ALIEN <int>,
## # PERCENT.PERMANENT.RESIDENT.ALIEN <dbl>, COUNT.US.CITIZEN <int>,
## # PERCENT.US.CITIZEN <dbl>, COUNT.OTHER.CITIZEN.STATUS <int>,
## # PERCENT.OTHER.CITIZEN.STATUS <dbl>,
## # COUNT.CITIZEN.STATUS.UNKNOWN <int>,
## # PERCENT.CITIZEN.STATUS.UNKNOWN <int>,
## # COUNT.CITIZEN.STATUS.TOTAL <int>, PERCENT.CITIZEN.STATUS.TOTAL <int>,
## # COUNT.RECEIVES.PUBLIC.ASSISTANCE <int>,
## # PERCENT.RECEIVES.PUBLIC.ASSISTANCE <dbl>,
## # COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int>,
## # PERCENT.NRECEIVES.PUBLIC.ASSISTANCE <dbl>,
## # COUNT.PUBLIC.ASSISTANCE.UNKNOWN <int>,
## # PERCENT.PUBLIC.ASSISTANCE.UNKNOWN <int>,
## # COUNT.PUBLIC.ASSISTANCE.TOTAL <int>,
## # PERCENT.PUBLIC.ASSISTANCE.TOTAL <int>
# View end of the data
tail(demographic)
## # A tibble: 6 × 46
## JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
## <int> <int> <int> <dbl>
## 1 12786 4 3 0.75
## 2 12788 83 39 0.47
## 3 12789 272 115 0.42
## 4 13731 17 2 0.12
## 5 16091 0 0 0.00
## 6 20459 0 0 0.00
## # ... with 42 more variables: COUNT.MALE <int>, PERCENT.MALE <dbl>,
## # COUNT.GENDER.UNKNOWN <int>, PERCENT.GENDER.UNKNOWN <int>,
## # COUNT.GENDER.TOTAL <int>, PERCENT.GENDER.TOTAL <int>,
## # COUNT.PACIFIC.ISLANDER <int>, PERCENT.PACIFIC.ISLANDER <dbl>,
## # COUNT.HISPANIC.LATINO <int>, PERCENT.HISPANIC.LATINO <dbl>,
## # COUNT.AMERICAN.INDIAN <int>, PERCENT.AMERICAN.INDIAN <dbl>,
## # COUNT.ASIAN.NON.HISPANIC <int>, PERCENT.ASIAN.NON.HISPANIC <dbl>,
## # COUNT.WHITE.NON.HISPANIC <int>, PERCENT.WHITE.NON.HISPANIC <dbl>,
## # COUNT.BLACK.NON.HISPANIC <int>, PERCENT.BLACK.NON.HISPANIC <dbl>,
## # COUNT.OTHER.ETHNICITY <int>, PERCENT.OTHER.ETHNICITY <dbl>,
## # COUNT.ETHNICITY.UNKNOWN <int>, PERCENT.ETHNICITY.UNKNOWN <dbl>,
## # COUNT.ETHNICITY.TOTAL <int>, PERCENT.ETHNICITY.TOTAL <int>,
## # COUNT.PERMANENT.RESIDENT.ALIEN <int>,
## # PERCENT.PERMANENT.RESIDENT.ALIEN <dbl>, COUNT.US.CITIZEN <int>,
## # PERCENT.US.CITIZEN <dbl>, COUNT.OTHER.CITIZEN.STATUS <int>,
## # PERCENT.OTHER.CITIZEN.STATUS <dbl>,
## # COUNT.CITIZEN.STATUS.UNKNOWN <int>,
## # PERCENT.CITIZEN.STATUS.UNKNOWN <int>,
## # COUNT.CITIZEN.STATUS.TOTAL <int>, PERCENT.CITIZEN.STATUS.TOTAL <int>,
## # COUNT.RECEIVES.PUBLIC.ASSISTANCE <int>,
## # PERCENT.RECEIVES.PUBLIC.ASSISTANCE <dbl>,
## # COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int>,
## # PERCENT.NRECEIVES.PUBLIC.ASSISTANCE <dbl>,
## # COUNT.PUBLIC.ASSISTANCE.UNKNOWN <int>,
## # PERCENT.PUBLIC.ASSISTANCE.UNKNOWN <int>,
## # COUNT.PUBLIC.ASSISTANCE.TOTAL <int>,
## # PERCENT.PUBLIC.ASSISTANCE.TOTAL <int>
Visualize the data
# Histrogram of data
hist(demographic$PERCENT.FEMALE)
# Scatter plot comparing citizen status total and public assistance total
plot(demographic$COUNT.CITIZEN.STATUS.TOTAL, demographic$COUNT.PUBLIC.ASSISTANCE.TOTAL)
Select columns fromd data set required to perform the analysis
demoSplit <- select(demographic, JURISDICTION.NAME, COUNT.PARTICIPANTS, COUNT.FEMALE,
PERCENT.FEMALE, COUNT.MALE, COUNT.PACIFIC.ISLANDER,
COUNT.HISPANIC.LATINO, COUNT.AMERICAN.INDIAN,
COUNT.ASIAN.NON.HISPANIC,COUNT.WHITE.NON.HISPANIC,
COUNT.BLACK.NON.HISPANIC, COUNT.OTHER.ETHNICITY,
COUNT.PERMANENT.RESIDENT.ALIEN, COUNT.US.CITIZEN,
COUNT.RECEIVES.PUBLIC.ASSISTANCE, COUNT.NRECEIVES.PUBLIC.ASSISTANCE,
COUNT.PUBLIC.ASSISTANCE.TOTAL)
head(demoSplit)
## # A tibble: 6 × 17
## JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
## <int> <int> <int> <dbl>
## 1 10001 44 22 0.50
## 2 10002 35 19 0.54
## 3 10003 1 1 1.00
## 4 10004 0 0 0.00
## 5 10005 2 2 1.00
## 6 10006 6 2 0.33
## # ... with 13 more variables: COUNT.MALE <int>,
## # COUNT.PACIFIC.ISLANDER <int>, COUNT.HISPANIC.LATINO <int>,
## # COUNT.AMERICAN.INDIAN <int>, COUNT.ASIAN.NON.HISPANIC <int>,
## # COUNT.WHITE.NON.HISPANIC <int>, COUNT.BLACK.NON.HISPANIC <int>,
## # COUNT.OTHER.ETHNICITY <int>, COUNT.PERMANENT.RESIDENT.ALIEN <int>,
## # COUNT.US.CITIZEN <int>, COUNT.RECEIVES.PUBLIC.ASSISTANCE <int>,
## # COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int>,
## # COUNT.PUBLIC.ASSISTANCE.TOTAL <int>
Identiy zipcodes where female participants live – the owner of a women’s jewelry store would like to determine the ideal zipcode to open her store.
# Number of males and female in dataset
# Female by zipcodes
sum(demoSplit$COUNT.FEMALE, demoSplit$COUNT.MALE)
## [1] 4168
demoFemale <- select(demoSplit, JURISDICTION.NAME, COUNT.FEMALE, PERCENT.FEMALE,
COUNT.NRECEIVES.PUBLIC.ASSISTANCE)
demoFemale <- filter(demoFemale, JURISDICTION.NAME,
COUNT.NRECEIVES.PUBLIC.ASSISTANCE,
PERCENT.FEMALE > 0)
arrange(demoFemale, PERCENT.FEMALE)
## # A tibble: 94 × 4
## JURISDICTION.NAME COUNT.FEMALE PERCENT.FEMALE
## <int> <int> <dbl>
## 1 12768 1 0.02
## 2 12528 2 0.04
## 3 12423 1 0.05
## 4 12435 5 0.09
## 5 13731 2 0.12
## 6 10013 1 0.13
## 7 11220 1 0.13
## 8 11510 6 0.21
## 9 11206 3 0.23
## 10 11434 1 0.25
## # ... with 84 more rows, and 1 more variables:
## # COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int>
Conclusion
A profile of women by zipcode that do not receive public assistance. This would be important for a business owner interested in opening a business that caters to women.
Dataset 3 Description
People visit hospitals for treatment of medical conditions. Providers charge differently in different parts of the country for the same service. These data include information comparing the charges for the 100 most common inpatient services and 30 common outpatient services.
In this dataset you will notice that providers charge for items/services provided to patients. It will be interestingly to perform exploratory data analysis on this dataset to determine how charges differ based on the location of the provider.
Read data
provider <- read.csv('IPPS__Provider_Summary_FY2011.csv', header=TRUE,stringsAsFactors = FALSE)
head(provider)
## DRG.Definition Provider.Id
## 1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001
## 2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005
## 3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006
## 4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011
## 5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016
## 6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10023
## Provider.Name Provider.Street.Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE
## 2 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET
## 4 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH
## 6 BAPTIST MEDICAL CENTER SOUTH 2105 EAST SOUTH BOULEVARD
## Provider.City Provider.State Provider.Zip.Code
## 1 DOTHAN AL 36301
## 2 BOAZ AL 35957
## 3 FLORENCE AL 35631
## 4 BIRMINGHAM AL 35235
## 5 ALABASTER AL 35007
## 6 MONTGOMERY AL 36116
## Hospital.Referral.Region.Description Total.Discharges
## 1 AL - Dothan 91
## 2 AL - Birmingham 14
## 3 AL - Birmingham 24
## 4 AL - Birmingham 25
## 5 AL - Birmingham 18
## 6 AL - Montgomery 67
## Average.Covered.Charges Average.Total.Payments Average.Medicare.Payments
## 1 $32,963.07 $5,777.24 $4,763.73
## 2 $15,131.85 $5,787.57 $4,976.71
## 3 $37,560.37 $5,434.95 $4,453.79
## 4 $13,998.28 $5,417.56 $4,129.16
## 5 $31,633.27 $5,658.33 $4,851.44
## 6 $16,920.79 $6,653.80 $5,374.14
Convert data frame to a table
provider <- tbl_df(provider)
head(provider)
## # A tibble: 6 × 12
## DRG.Definition Provider.Id
## <chr> <int>
## 1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001
## 2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005
## 3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006
## 4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011
## 5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016
## 6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10023
## # ... with 10 more variables: Provider.Name <chr>,
## # Provider.Street.Address <chr>, Provider.City <chr>,
## # Provider.State <chr>, Provider.Zip.Code <int>,
## # Hospital.Referral.Region.Description <chr>, Total.Discharges <int>,
## # Average.Covered.Charges <chr>, Average.Total.Payments <chr>,
## # Average.Medicare.Payments <chr>
Examine the data
# Check data class
class(provider)
## [1] "tbl_df" "tbl" "data.frame"
# Check dimensions of the data
dim(provider)
## [1] 65499 12
# View the column names
names(provider)
## [1] "DRG.Definition"
## [2] "Provider.Id"
## [3] "Provider.Name"
## [4] "Provider.Street.Address"
## [5] "Provider.City"
## [6] "Provider.State"
## [7] "Provider.Zip.Code"
## [8] "Hospital.Referral.Region.Description"
## [9] "Total.Discharges"
## [10] "Average.Covered.Charges"
## [11] "Average.Total.Payments"
## [12] "Average.Medicare.Payments"
# View the end of the data
tail(provider)
## # A tibble: 6 × 12
## DRG.Definition Provider.Id
## <chr> <int>
## 1 292 - HEART FAILURE & SHOCK W CC 440194
## 2 292 - HEART FAILURE & SHOCK W CC 440197
## 3 292 - HEART FAILURE & SHOCK W CC 440200
## 4 292 - HEART FAILURE & SHOCK W CC 440227
## 5 292 - HEART FAILURE & SHOCK W CC 440228
## 6 292 - HEART FAILURE & SHOCK W CC 450002
## # ... with 10 more variables: Provider.Name <chr>,
## # Provider.Street.Address <chr>, Provider.City <chr>,
## # Provider.State <chr>, Provider.Zip.Code <int>,
## # Hospital.Referral.Region.Description <chr>, Total.Discharges <int>,
## # Average.Covered.Charges <chr>, Average.Total.Payments <chr>,
## # Average.Medicare.Payments <chr>
View the structure of the data
library(dplyr)
glimpse(provider)
## Observations: 65,499
## Variables: 12
## $ DRG.Definition <chr> "039 - EXTRACRANIAL PROCE...
## $ Provider.Id <int> 10001, 10005, 10006, 1001...
## $ Provider.Name <chr> "SOUTHEAST ALABAMA MEDICA...
## $ Provider.Street.Address <chr> "1108 ROSS CLARK CIRCLE",...
## $ Provider.City <chr> "DOTHAN", "BOAZ", "FLOREN...
## $ Provider.State <chr> "AL", "AL", "AL", "AL", "...
## $ Provider.Zip.Code <int> 36301, 35957, 35631, 3523...
## $ Hospital.Referral.Region.Description <chr> "AL - Dothan", "AL - Birm...
## $ Total.Discharges <int> 91, 14, 24, 25, 18, 67, 5...
## $ Average.Covered.Charges <chr> "$32,963.07", "$15,131.85...
## $ Average.Total.Payments <chr> "$5,777.24", "$5,787.57",...
## $ Average.Medicare.Payments <chr> "$4,763.73", "$4,976.71",...
summary(provider)
## DRG.Definition Provider.Id Provider.Name
## Length:65499 Min. : 10001 Length:65499
## Class :character 1st Qu.:110083 Class :character
## Mode :character Median :240166 Mode :character
## Mean :253788
## 3rd Qu.:380017
## Max. :670075
## Provider.Street.Address Provider.City Provider.State
## Length:65499 Length:65499 Length:65499
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Provider.Zip.Code Hospital.Referral.Region.Description Total.Discharges
## Min. : 1040 Length:65499 Min. : 11.00
## 1st Qu.:27511 Class :character 1st Qu.: 17.00
## Median :44203 Mode :character Median : 28.00
## Mean :47811 Mean : 41.23
## 3rd Qu.:72308 3rd Qu.: 50.00
## Max. :99801 Max. :981.00
## Average.Covered.Charges Average.Total.Payments Average.Medicare.Payments
## Length:65499 Length:65499 Length:65499
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
names(provider)
## [1] "DRG.Definition"
## [2] "Provider.Id"
## [3] "Provider.Name"
## [4] "Provider.Street.Address"
## [5] "Provider.City"
## [6] "Provider.State"
## [7] "Provider.Zip.Code"
## [8] "Hospital.Referral.Region.Description"
## [9] "Total.Discharges"
## [10] "Average.Covered.Charges"
## [11] "Average.Total.Payments"
## [12] "Average.Medicare.Payments"
Look at missing values
sum(is.na(provider))
## [1] 0
Create dataset with variables to be used for analysis
providerSel <- select(provider, DRG.Definition, Provider.Id, Provider.Name,
Provider.State, Average.Covered.Charges, Average.Total.Payments,
Average.Medicare.Payments)
names(providerSel)
## [1] "DRG.Definition" "Provider.Id"
## [3] "Provider.Name" "Provider.State"
## [5] "Average.Covered.Charges" "Average.Total.Payments"
## [7] "Average.Medicare.Payments"
Examine maximum and minimum average covered charges by state
# Select columns for analysis
providerSelPay <- select(providerSel, Provider.State, Average.Covered.Charges,
Average.Total.Payments, Average.Medicare.Payments)
# Remove commas and $ from data and convert to numeric
providerSelPay$Average.Covered.Charges2 = as.numeric(gsub("[$,]", "",
providerSelPay$Average.Covered.Charges))
providerSelPay$Average.Total.Payments2 = as.numeric(gsub("[$,]", "",
providerSelPay$Average.Total.Payments))
providerSelPay$Average.Medicare.Payments2 = as.numeric(gsub("[$,]", "",
providerSelPay$Average.Medicare.Payments))
Calculate means
# Mean of Average.Covered.Charges
mean(providerSelPay$Average.Covered.Charges2)
## [1] 39000.78
# Mean of Average Total Payments
mean(providerSelPay$Average.Total.Payments2)
## [1] 10169.43
# Mean Average Medicare Payments
mean(providerSelPay$Average.Medicare.Payments2)
## [1] 8932.729
Aggregate data for max value and mean by state
# Max by state
temp <- aggregate(Average.Total.Payments2 ~ Provider.State, providerSelPay, function(x) max(x))
arrange(temp, Average.Total.Payments2)
## Provider.State Average.Total.Payments2
## 1 MT 31074.73
## 2 ND 36440.81
## 3 NM 37069.00
## 4 WY 37486.00
## 5 NE 41186.40
## 6 AR 41460.77
## 7 DE 41724.28
## 8 ME 41980.32
## 9 WV 42507.68
## 10 AL 42550.21
## 11 MS 42693.15
## 12 TN 45062.00
## 13 KY 46301.00
## 14 VT 47048.91
## 15 UT 47584.64
## 16 RI 47912.69
## 17 OR 48050.83
## 18 OK 48065.66
## 19 IA 48488.18
## 20 KS 49510.75
## 21 MO 50506.85
## 22 NH 51065.75
## 23 ID 51141.92
## 24 GA 51756.66
## 25 OH 52050.45
## 26 LA 53360.92
## 27 TX 54004.36
## 28 IN 54253.25
## 29 HI 54559.00
## 30 NC 57591.39
## 31 VA 57963.38
## 32 CO 58690.73
## 33 WI 61774.00
## 34 AZ 62268.76
## 35 NJ 62362.50
## 36 AK 62665.37
## 37 MN 62770.42
## 38 FL 62807.04
## 39 MA 64026.55
## 40 NV 64959.26
## 41 CT 66397.45
## 42 DC 67720.37
## 43 MD 76657.81
## 44 SC 76686.46
## 45 PA 80549.67
## 46 MI 83332.21
## 47 SD 83484.14
## 48 WA 84499.26
## 49 IL 85987.00
## 50 NY 99929.46
## 51 CA 156158.18
# Mean by state
temp2 <-aggregate(Average.Total.Payments2 ~ Provider.State, providerSelPay, function(x) mean(x))
arrange(temp2, Average.Total.Payments2)
## Provider.State Average.Total.Payments2
## 1 AL 7964.114
## 2 TN 8530.343
## 3 AR 8579.398
## 4 IA 8617.359
## 5 WV 8650.539
## 6 OK 8677.653
## 7 MS 8687.839
## 8 KS 8786.129
## 9 KY 8909.264
## 10 ME 8921.078
## 11 GA 9114.459
## 12 MO 9151.982
## 13 IN 9161.962
## 14 MT 9244.758
## 15 LA 9310.777
## 16 VA 9356.982
## 17 FL 9362.872
## 18 OH 9370.281
## 19 NC 9377.806
## 20 CO 9517.121
## 21 NH 9538.795
## 22 NE 9545.152
## 23 SC 9605.290
## 24 PA 9610.712
## 25 TX 9710.581
## 26 ID 9723.252
## 27 WI 9831.411
## 28 NM 9947.577
## 29 UT 10175.722
## 30 MI 10184.972
## 31 IL 10343.051
## 32 SD 10364.195
## 33 MN 10382.305
## 34 OR 10432.892
## 35 ND 10436.264
## 36 AZ 10592.956
## 37 RI 10856.980
## 38 MA 10858.709
## 39 DE 10953.379
## 40 NV 10957.796
## 41 WA 11073.249
## 42 NJ 11273.043
## 43 CT 11880.119
## 44 WY 12102.500
## 45 VT 12109.385
## 46 NY 12364.821
## 47 CA 13194.969
## 48 HI 13272.549
## 49 MD 13319.723
## 50 DC 13741.709
## 51 AK 14721.429
Conclusion Charges by provider vary by states. Average total payments are the highest in California and lowest in Montana.