Introduction

The purpose of this assignment is to prepare data from several sources for analysis.

For each of the three chosen datasets: 1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]  Perform the analysis requested in the discussion item.  Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Dataset 1 Description

This data set is New York City Citibike usage during June 2016.

Analysis: 1. Determine the profile of a Citibike rider 2. Time of day the longest rides occur

Read data into R

citibike <- read.csv('201606_citibike_tripdata.csv',header=TRUE,stringsAsFactors = FALSE)

head(citibike)
##   tripduration         starttime          stoptime start.station.id
## 1         1470 6/1/2016 00:00:18 6/1/2016 00:24:48              380
## 2          229 6/1/2016 00:00:20 6/1/2016 00:04:09             3092
## 3          344 6/1/2016 00:00:21 6/1/2016 00:06:06              449
## 4         1120 6/1/2016 00:00:28 6/1/2016 00:19:09              522
## 5          229 6/1/2016 00:00:53 6/1/2016 00:04:42              335
## 6          946 6/1/2016 00:01:01 6/1/2016 00:16:48              503
##         start.station.name start.station.latitude start.station.longitude
## 1         W 4 St & 7 Ave S               40.73401               -74.00294
## 2        Berry St & N 8 St               40.71901               -73.95853
## 3          W 52 St & 9 Ave               40.76462               -73.98789
## 4  E 51 St & Lexington Ave               40.75715               -73.97208
## 5 Washington Pl & Broadway               40.72904               -73.99405
## 6       E 20 St & Park Ave               40.73827               -73.98752
##   end.station.id        end.station.name end.station.latitude
## 1           3236      W 42 St & Dyer Ave             40.75898
## 2           3103     N 11 St & Wythe Ave             40.72153
## 3            469      Broadway & W 53 St             40.76344
## 4            401 Allen St & Rivington St             40.72020
## 5            285      Broadway & E 14 St             40.73455
## 6            495        W 47 St & 10 Ave             40.76270
##   end.station.longitude bikeid   usertype birth.year gender
## 1             -73.99380  19859 Subscriber       1972      1
## 2             -73.95782  16233 Subscriber       1967      1
## 3             -73.98268  22397 Subscriber       1989      1
## 4             -73.98998  16231 Subscriber       1991      1
## 5             -73.99074  15400 Subscriber       1989      1
## 6             -73.99301  25193 Subscriber       1974      1

Explore the data set

# Verify whether is a data.frame
class(citibike)
## [1] "data.frame"
# Check the dimensions
dim(citibike)
## [1] 65499    15
# View the column names
names(citibike)
##  [1] "tripduration"            "starttime"              
##  [3] "stoptime"                "start.station.id"       
##  [5] "start.station.name"      "start.station.latitude" 
##  [7] "start.station.longitude" "end.station.id"         
##  [9] "end.station.name"        "end.station.latitude"   
## [11] "end.station.longitude"   "bikeid"                 
## [13] "usertype"                "birth.year"             
## [15] "gender"
# View the beginning of the data
head(citibike)
##   tripduration         starttime          stoptime start.station.id
## 1         1470 6/1/2016 00:00:18 6/1/2016 00:24:48              380
## 2          229 6/1/2016 00:00:20 6/1/2016 00:04:09             3092
## 3          344 6/1/2016 00:00:21 6/1/2016 00:06:06              449
## 4         1120 6/1/2016 00:00:28 6/1/2016 00:19:09              522
## 5          229 6/1/2016 00:00:53 6/1/2016 00:04:42              335
## 6          946 6/1/2016 00:01:01 6/1/2016 00:16:48              503
##         start.station.name start.station.latitude start.station.longitude
## 1         W 4 St & 7 Ave S               40.73401               -74.00294
## 2        Berry St & N 8 St               40.71901               -73.95853
## 3          W 52 St & 9 Ave               40.76462               -73.98789
## 4  E 51 St & Lexington Ave               40.75715               -73.97208
## 5 Washington Pl & Broadway               40.72904               -73.99405
## 6       E 20 St & Park Ave               40.73827               -73.98752
##   end.station.id        end.station.name end.station.latitude
## 1           3236      W 42 St & Dyer Ave             40.75898
## 2           3103     N 11 St & Wythe Ave             40.72153
## 3            469      Broadway & W 53 St             40.76344
## 4            401 Allen St & Rivington St             40.72020
## 5            285      Broadway & E 14 St             40.73455
## 6            495        W 47 St & 10 Ave             40.76270
##   end.station.longitude bikeid   usertype birth.year gender
## 1             -73.99380  19859 Subscriber       1972      1
## 2             -73.95782  16233 Subscriber       1967      1
## 3             -73.98268  22397 Subscriber       1989      1
## 4             -73.98998  16231 Subscriber       1991      1
## 5             -73.99074  15400 Subscriber       1989      1
## 6             -73.99301  25193 Subscriber       1974      1
# View end of the data
tail(citibike)
##       tripduration         starttime          stoptime start.station.id
## 65494          930 6/2/2016 08:58:40 6/2/2016 09:14:10              459
## 65495          482 6/2/2016 08:58:38 6/2/2016 09:06:41              237
## 65496          455 6/2/2016 08:58:39 6/2/2016 09:06:15              362
## 65497         2369 6/2/2016 08:58:41 6/2/2016 09:38:10              486
## 65498          575 6/2/2016 08:58:41 6/2/2016 09:08:17              250
## 65499         1892 6/2/2016 08:58:47 6/2/2016 09:30:19              426
##             start.station.name start.station.latitude
## 65494         W 20 St & 11 Ave               40.74674
## 65495          E 11 St & 2 Ave               40.73047
## 65496       Broadway & W 37 St               40.75173
## 65497       Broadway & W 29 St               40.74620
## 65498 Lafayette St & Jersey St               40.72456
## 65499    West St & Chambers St               40.71755
##       start.station.longitude end.station.id         end.station.name
## 65494               -74.00776            151 Cleveland Pl & Spring St
## 65495               -73.98672            168          W 18 St & 6 Ave
## 65496               -73.98754            402       Broadway & E 22 St
## 65497               -73.98856           3136          5 Ave & E 63 St
## 65498               -73.99565            309      Murray St & West St
## 65499               -74.01322            322  Clinton St & Tillary St
##       end.station.latitude end.station.longitude bikeid   usertype
## 65494             40.72210             -73.99725  17801   Customer
## 65495             40.73971             -73.99456  15696 Subscriber
## 65496             40.74034             -73.98955  20228 Subscriber
## 65497             40.76650             -73.97148  16743 Subscriber
## 65498             40.71498             -74.01301  23672 Subscriber
## 65499             40.69619             -73.99122  18399   Customer
##       birth.year gender
## 65494         NA      0
## 65495       1996      1
## 65496       1961      1
## 65497       1988      2
## 65498       1986      1
## 65499         NA      0

Summarize the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# View the structure of the data
glimpse(citibike)
## Observations: 65,499
## Variables: 15
## $ tripduration            <int> 1470, 229, 344, 1120, 229, 946, 2351, ...
## $ starttime               <chr> "6/1/2016 00:00:18", "6/1/2016 00:00:2...
## $ stoptime                <chr> "6/1/2016 00:24:48", "6/1/2016 00:04:0...
## $ start.station.id        <int> 380, 3092, 449, 522, 335, 503, 533, 49...
## $ start.station.name      <chr> "W 4 St & 7 Ave S", "Berry St & N 8 St...
## $ start.station.latitude  <dbl> 40.73401, 40.71901, 40.76462, 40.75715...
## $ start.station.longitude <dbl> -74.00294, -73.95853, -73.98789, -73.9...
## $ end.station.id          <int> 3236, 3103, 469, 401, 285, 495, 386, 4...
## $ end.station.name        <chr> "W 42 St & Dyer Ave", "N 11 St & Wythe...
## $ end.station.latitude    <dbl> 40.75898, 40.72153, 40.76344, 40.72020...
## $ end.station.longitude   <dbl> -73.99380, -73.95782, -73.98268, -73.9...
## $ bikeid                  <int> 19859, 16233, 22397, 16231, 15400, 251...
## $ usertype                <chr> "Subscriber", "Subscriber", "Subscribe...
## $ birth.year              <int> 1972, 1967, 1989, 1991, 1989, 1974, 19...
## $ gender                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,...
# View a summary of the data
summary(citibike)
##   tripduration      starttime           stoptime         start.station.id
##  Min.   :     61   Length:65499       Length:65499       Min.   :  72.0  
##  1st Qu.:    404   Class :character   Class :character   1st Qu.: 329.0  
##  Median :    662   Mode  :character   Mode  :character   Median : 456.0  
##  Mean   :   1052                                         Mean   : 954.5  
##  3rd Qu.:   1094                                         3rd Qu.: 530.0  
##  Max.   :3129291                                         Max.   :3255.0  
##                                                                          
##  start.station.name start.station.latitude start.station.longitude
##  Length:65499       Min.   :40.68          Min.   :-74.02         
##  Class :character   1st Qu.:40.72          1st Qu.:-74.00         
##  Mode  :character   Median :40.74          Median :-73.99         
##                     Mean   :40.74          Mean   :-73.99         
##                     3rd Qu.:40.76          3rd Qu.:-73.98         
##                     Max.   :40.79          Max.   :-73.93         
##                                                                   
##  end.station.id   end.station.name   end.station.latitude
##  Min.   :  72.0   Length:65499       Min.   :40.65       
##  1st Qu.: 327.0   Class :character   1st Qu.:40.72       
##  Median : 450.0   Mode  :character   Median :40.74       
##  Mean   : 898.6                      Mean   :40.74       
##  3rd Qu.: 524.0                      3rd Qu.:40.75       
##  Max.   :3255.0                      Max.   :40.79       
##                                                          
##  end.station.longitude     bikeid        usertype           birth.year  
##  Min.   :-74.03        Min.   :14529   Length:65499       Min.   :1885  
##  1st Qu.:-74.00        1st Qu.:17125   Class :character   1st Qu.:1969  
##  Median :-73.99        Median :19788   Mode  :character   Median :1979  
##  Mean   :-73.99        Mean   :20006                      Mean   :1977  
##  3rd Qu.:-73.98        3rd Qu.:22894                      3rd Qu.:1986  
##  Max.   :-73.93        Max.   :25409                      Max.   :2000  
##                                                           NA's   :5626  
##      gender     
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.135  
##  3rd Qu.:1.000  
##  Max.   :2.000  
## 

Take a random sample of 300 since dataset is too large

citisample <- citibike[sample(nrow(citibike), 1000), ]
dim(citisample)
## [1] 1000   15

Filter usertype = customer from the dataset since there is no gender or year of birth information is avaialble they are coded as 0. Customers are non-scribers

citisampfilt <- filter(citisample, usertype == 'Subscriber')

# Remove NA
citina <- na.omit(citisampfilt)

Select colums to be used for analysis

library(dplyr)
citisubset <- select(citina, tripduration, starttime, stoptime, start.station.id, start.station.name, end.station.id, end.station.name, bikeid, usertype, birth.year, gender)
names(citisubset)
##  [1] "tripduration"       "starttime"          "stoptime"          
##  [4] "start.station.id"   "start.station.name" "end.station.id"    
##  [7] "end.station.name"   "bikeid"             "usertype"          
## [10] "birth.year"         "gender"
head(citisubset)
##   tripduration         starttime          stoptime start.station.id
## 1         2383 6/1/2016 07:15:07 6/1/2016 07:54:50              467
## 2          431 6/1/2016 19:06:34 6/1/2016 19:13:46              304
## 3          237 6/2/2016 07:55:36 6/2/2016 07:59:33              458
## 4          365 6/1/2016 07:47:53 6/1/2016 07:53:59              349
## 5           92 6/1/2016 18:16:38 6/1/2016 18:18:10              527
## 6         2234 6/1/2016 10:40:52 6/1/2016 11:18:07              462
##        start.station.name end.station.id           end.station.name bikeid
## 1         Dean St & 4 Ave            466            W 25 St & 6 Ave  24945
## 2   Broadway & Battery Pl           3002 South End Ave & Liberty St  19688
## 3        11 Ave & W 27 St            453            W 22 St & 8 Ave  22553
## 4 Rivington St & Ridge St            428             E 3 St & 1 Ave  17980
## 5         E 33 St & 2 Ave            528            2 Ave & E 31 St  16026
## 6        W 22 St & 10 Ave           3153            E 71 St & 2 Ave  17974
##     usertype birth.year gender
## 1 Subscriber       1980      1
## 2 Subscriber       1988      1
## 3 Subscriber       1977      1
## 4 Subscriber       1989      2
## 5 Subscriber       1963      2
## 6 Subscriber       1963      1

Separate starttime and stoptime columnsdate and time into 2 columns to perform analysis

# Split stattime colum into 2 STime and SDate
startDate <- format(as.POSIXct(strptime(citisubset$starttime,"%m/%d/%Y %H:%M:%S",tz="")),  format = "%m/%d/%Y")
startTime <- format(as.POSIXct(strptime(citisubset$starttime,"%m/%d/%Y %H:%M:%S",tz="")), format = "%H:%M:%S")
citisubset$startDate <- startDate
citisubset$startTime <- startTime

# Split stoptime colum into 2 STime and SDate
stopDate <- format(as.POSIXct(strptime(citisubset$stoptime,"%m/%d/%Y %H:%M:%S",tz="")),  format = "%m/%d/%Y")
stopTime <- format(as.POSIXct(strptime(citisubset$stoptime,"%m/%d/%Y %H:%M:%S",tz="")), format = "%H:%M:%S")
citisubset$stopDate <- stopDate
citisubset$stopTime <- stopTime

citisplit <- citisubset
head(citisplit)
##   tripduration         starttime          stoptime start.station.id
## 1         2383 6/1/2016 07:15:07 6/1/2016 07:54:50              467
## 2          431 6/1/2016 19:06:34 6/1/2016 19:13:46              304
## 3          237 6/2/2016 07:55:36 6/2/2016 07:59:33              458
## 4          365 6/1/2016 07:47:53 6/1/2016 07:53:59              349
## 5           92 6/1/2016 18:16:38 6/1/2016 18:18:10              527
## 6         2234 6/1/2016 10:40:52 6/1/2016 11:18:07              462
##        start.station.name end.station.id           end.station.name bikeid
## 1         Dean St & 4 Ave            466            W 25 St & 6 Ave  24945
## 2   Broadway & Battery Pl           3002 South End Ave & Liberty St  19688
## 3        11 Ave & W 27 St            453            W 22 St & 8 Ave  22553
## 4 Rivington St & Ridge St            428             E 3 St & 1 Ave  17980
## 5         E 33 St & 2 Ave            528            2 Ave & E 31 St  16026
## 6        W 22 St & 10 Ave           3153            E 71 St & 2 Ave  17974
##     usertype birth.year gender  startDate startTime   stopDate stopTime
## 1 Subscriber       1980      1 06/01/2016  07:15:07 06/01/2016 07:54:50
## 2 Subscriber       1988      1 06/01/2016  19:06:34 06/01/2016 19:13:46
## 3 Subscriber       1977      1 06/02/2016  07:55:36 06/02/2016 07:59:33
## 4 Subscriber       1989      2 06/01/2016  07:47:53 06/01/2016 07:53:59
## 5 Subscriber       1963      2 06/01/2016  18:16:38 06/01/2016 18:18:10
## 6 Subscriber       1963      1 06/01/2016  10:40:52 06/01/2016 11:18:07

Drop starttime and endtime columns

citisplit <- subset(citisubset, select = c(1,4,5,6,7,8,9,10,11,12,13,14,15))

head(citisplit)
##   tripduration start.station.id      start.station.name end.station.id
## 1         2383              467         Dean St & 4 Ave            466
## 2          431              304   Broadway & Battery Pl           3002
## 3          237              458        11 Ave & W 27 St            453
## 4          365              349 Rivington St & Ridge St            428
## 5           92              527         E 33 St & 2 Ave            528
## 6         2234              462        W 22 St & 10 Ave           3153
##             end.station.name bikeid   usertype birth.year gender
## 1            W 25 St & 6 Ave  24945 Subscriber       1980      1
## 2 South End Ave & Liberty St  19688 Subscriber       1988      1
## 3            W 22 St & 8 Ave  22553 Subscriber       1977      1
## 4             E 3 St & 1 Ave  17980 Subscriber       1989      2
## 5            2 Ave & E 31 St  16026 Subscriber       1963      2
## 6            E 71 St & 2 Ave  17974 Subscriber       1963      1
##    startDate startTime   stopDate stopTime
## 1 06/01/2016  07:15:07 06/01/2016 07:54:50
## 2 06/01/2016  19:06:34 06/01/2016 19:13:46
## 3 06/02/2016  07:55:36 06/02/2016 07:59:33
## 4 06/01/2016  07:47:53 06/01/2016 07:53:59
## 5 06/01/2016  18:16:38 06/01/2016 18:18:10
## 6 06/01/2016  10:40:52 06/01/2016 11:18:07

Calculate age and add column to dataset

citisplit$stopDated = as.Date(citisplit$stopDate, format="%m/%d/%Y")
citisplit$birth.yearn = as.numeric(substr(citisplit$birth.year,1,999))
citisplit$age = as.numeric(format(citisplit$stopDated,"%Y")) - citisplit$birth.yearn
    
head(citisplit)
##   tripduration start.station.id      start.station.name end.station.id
## 1         2383              467         Dean St & 4 Ave            466
## 2          431              304   Broadway & Battery Pl           3002
## 3          237              458        11 Ave & W 27 St            453
## 4          365              349 Rivington St & Ridge St            428
## 5           92              527         E 33 St & 2 Ave            528
## 6         2234              462        W 22 St & 10 Ave           3153
##             end.station.name bikeid   usertype birth.year gender
## 1            W 25 St & 6 Ave  24945 Subscriber       1980      1
## 2 South End Ave & Liberty St  19688 Subscriber       1988      1
## 3            W 22 St & 8 Ave  22553 Subscriber       1977      1
## 4             E 3 St & 1 Ave  17980 Subscriber       1989      2
## 5            2 Ave & E 31 St  16026 Subscriber       1963      2
## 6            E 71 St & 2 Ave  17974 Subscriber       1963      1
##    startDate startTime   stopDate stopTime  stopDated birth.yearn age
## 1 06/01/2016  07:15:07 06/01/2016 07:54:50 2016-06-01        1980  36
## 2 06/01/2016  19:06:34 06/01/2016 19:13:46 2016-06-01        1988  28
## 3 06/02/2016  07:55:36 06/02/2016 07:59:33 2016-06-02        1977  39
## 4 06/01/2016  07:47:53 06/01/2016 07:53:59 2016-06-01        1989  27
## 5 06/01/2016  18:16:38 06/01/2016 18:18:10 2016-06-01        1963  53
## 6 06/01/2016  10:40:52 06/01/2016 11:18:07 2016-06-01        1963  53

Convert age variable into intervals and add column of intervals to dataset

citisplit$agerange <- cut(citisplit$age, breaks= c(18, 24, 30, 35, 40, 50, 60, 80), 
                 labels = c("18-24 years", "25-30 years", "31-35 years", "36-40 years",
                            "41-50 years", "51-60 years", "65 plus"),
                right = FALSE)

Examine riders

This histogram is skewed to the right, the majority of citibike rides tend to be younger and male

hist(citisplit$age)

hist(citisplit$gender)

If a user is a customer and not a subscriber there is no age or gender recorded

# Mean age of riders
mean(citisplit$age)
## [1] 38.74183
# Recode gender variable 1=Male, 2=Female
citisplit$gender2 <- ifelse(citisplit$gender == 1, "Male", ifelse(citisplit$gender == 2, "Female", "Other"))
head(citisplit)
##   tripduration start.station.id      start.station.name end.station.id
## 1         2383              467         Dean St & 4 Ave            466
## 2          431              304   Broadway & Battery Pl           3002
## 3          237              458        11 Ave & W 27 St            453
## 4          365              349 Rivington St & Ridge St            428
## 5           92              527         E 33 St & 2 Ave            528
## 6         2234              462        W 22 St & 10 Ave           3153
##             end.station.name bikeid   usertype birth.year gender
## 1            W 25 St & 6 Ave  24945 Subscriber       1980      1
## 2 South End Ave & Liberty St  19688 Subscriber       1988      1
## 3            W 22 St & 8 Ave  22553 Subscriber       1977      1
## 4             E 3 St & 1 Ave  17980 Subscriber       1989      2
## 5            2 Ave & E 31 St  16026 Subscriber       1963      2
## 6            E 71 St & 2 Ave  17974 Subscriber       1963      1
##    startDate startTime   stopDate stopTime  stopDated birth.yearn age
## 1 06/01/2016  07:15:07 06/01/2016 07:54:50 2016-06-01        1980  36
## 2 06/01/2016  19:06:34 06/01/2016 19:13:46 2016-06-01        1988  28
## 3 06/02/2016  07:55:36 06/02/2016 07:59:33 2016-06-02        1977  39
## 4 06/01/2016  07:47:53 06/01/2016 07:53:59 2016-06-01        1989  27
## 5 06/01/2016  18:16:38 06/01/2016 18:18:10 2016-06-01        1963  53
## 6 06/01/2016  10:40:52 06/01/2016 11:18:07 2016-06-01        1963  53
##      agerange gender2
## 1 36-40 years    Male
## 2 25-30 years    Male
## 3 36-40 years    Male
## 4 25-30 years  Female
## 5 51-60 years  Female
## 6 51-60 years    Male
# Crosstab agerange by subscriber
require(eeptools)
## Loading required package: eeptools
## Loading required package: ggplot2
varnames<-c('Male','Female')
crosstabs(citisplit, rowvar = "agerange",colvar = "gender2",
          varnames=varnames, digits = 2)
## $TABS
##              Female
## Male          Female Male Other
##   18-24 years      6   29     0
##   25-30 years     60  147     0
##   31-35 years     41  111     0
##   36-40 years     28  112     1
##   41-50 years     47  154     1
##   51-60 years     27  110     0
##   65 plus         13   30     0
## 
## $PROPORTIONS
##              Female
## Male          Female Male Other
##   18-24 years      1    3     0
##   25-30 years      7   16     0
##   31-35 years      4   12     0
##   36-40 years      3   12     0
##   41-50 years      5   17     0
##   51-60 years      3   12     0
##   65 plus          1    3     0
## 
## $TABSPROPORTIONS
##              Female
## Male          Female   Male       Other  
##   18-24 years 1%\n(6)  3%\n(29)   0%\n(0)
##   25-30 years 7%\n(60) 16%\n(147) 0%\n(0)
##   31-35 years 4%\n(41) 12%\n(111) 0%\n(0)
##   36-40 years 3%\n(28) 12%\n(112) 0%\n(1)
##   41-50 years 5%\n(47) 17%\n(154) 0%\n(1)
##   51-60 years 3%\n(27) 12%\n(110) 0%\n(0)
##   65 plus     1%\n(13) 3%\n(30)   0%\n(0)

Conclusion

The majority of citibike users are subscribers. The average age of citibike subscribers is 39 years old. The majority are male.

Dataset 2 Description New York City demographic data

One can use this dataset for targeted Marketing Campaign based on Demography. For example, a manufacturer of Sporting Goods or a Sport Store can choose a county for their campaign based on residents age. Another example would be to do targeted campaign on the basis of second/native language.

Read data

demographic <- read.csv('Demographic_Statistics_By_Zip_Code.csv',
                        header=TRUE,stringsAsFactors = FALSE)

head(demographic)
##   JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
## 1             10001                 44           22           0.50
## 2             10002                 35           19           0.54
## 3             10003                  1            1           1.00
## 4             10004                  0            0           0.00
## 5             10005                  2            2           1.00
## 6             10006                  6            2           0.33
##   COUNT.MALE PERCENT.MALE COUNT.GENDER.UNKNOWN PERCENT.GENDER.UNKNOWN
## 1         22         0.50                    0                      0
## 2         16         0.46                    0                      0
## 3          0         0.00                    0                      0
## 4          0         0.00                    0                      0
## 5          0         0.00                    0                      0
## 6          4         0.67                    0                      0
##   COUNT.GENDER.TOTAL PERCENT.GENDER.TOTAL COUNT.PACIFIC.ISLANDER
## 1                 44                  100                      0
## 2                 35                  100                      0
## 3                  1                  100                      0
## 4                  0                    0                      0
## 5                  2                  100                      0
## 6                  6                  100                      0
##   PERCENT.PACIFIC.ISLANDER COUNT.HISPANIC.LATINO PERCENT.HISPANIC.LATINO
## 1                        0                    16                    0.36
## 2                        0                     1                    0.03
## 3                        0                     0                    0.00
## 4                        0                     0                    0.00
## 5                        0                     0                    0.00
## 6                        0                     2                    0.33
##   COUNT.AMERICAN.INDIAN PERCENT.AMERICAN.INDIAN COUNT.ASIAN.NON.HISPANIC
## 1                     0                       0                        3
## 2                     0                       0                       28
## 3                     0                       0                        1
## 4                     0                       0                        0
## 5                     0                       0                        1
## 6                     0                       0                        0
##   PERCENT.ASIAN.NON.HISPANIC COUNT.WHITE.NON.HISPANIC
## 1                       0.07                        1
## 2                       0.80                        6
## 3                       1.00                        0
## 4                       0.00                        0
## 5                       0.50                        0
## 6                       0.00                        1
##   PERCENT.WHITE.NON.HISPANIC COUNT.BLACK.NON.HISPANIC
## 1                       0.02                       21
## 2                       0.17                        0
## 3                       0.00                        0
## 4                       0.00                        0
## 5                       0.00                        1
## 6                       0.17                        3
##   PERCENT.BLACK.NON.HISPANIC COUNT.OTHER.ETHNICITY PERCENT.OTHER.ETHNICITY
## 1                       0.48                     3                    0.07
## 2                       0.00                     0                    0.00
## 3                       0.00                     0                    0.00
## 4                       0.00                     0                    0.00
## 5                       0.50                     0                    0.00
## 6                       0.50                     0                    0.00
##   COUNT.ETHNICITY.UNKNOWN PERCENT.ETHNICITY.UNKNOWN COUNT.ETHNICITY.TOTAL
## 1                       0                         0                    44
## 2                       0                         0                    35
## 3                       0                         0                     1
## 4                       0                         0                     0
## 5                       0                         0                     2
## 6                       0                         0                     6
##   PERCENT.ETHNICITY.TOTAL COUNT.PERMANENT.RESIDENT.ALIEN
## 1                     100                              2
## 2                     100                              2
## 3                     100                              0
## 4                       0                              0
## 5                     100                              1
## 6                     100                              0
##   PERCENT.PERMANENT.RESIDENT.ALIEN COUNT.US.CITIZEN PERCENT.US.CITIZEN
## 1                             0.05               42               0.95
## 2                             0.06               33               0.94
## 3                             0.00                1               1.00
## 4                             0.00                0               0.00
## 5                             0.50                1               0.50
## 6                             0.00                6               1.00
##   COUNT.OTHER.CITIZEN.STATUS PERCENT.OTHER.CITIZEN.STATUS
## 1                          0                            0
## 2                          0                            0
## 3                          0                            0
## 4                          0                            0
## 5                          0                            0
## 6                          0                            0
##   COUNT.CITIZEN.STATUS.UNKNOWN PERCENT.CITIZEN.STATUS.UNKNOWN
## 1                            0                              0
## 2                            0                              0
## 3                            0                              0
## 4                            0                              0
## 5                            0                              0
## 6                            0                              0
##   COUNT.CITIZEN.STATUS.TOTAL PERCENT.CITIZEN.STATUS.TOTAL
## 1                         44                          100
## 2                         35                          100
## 3                          1                          100
## 4                          0                            0
## 5                          2                          100
## 6                          6                          100
##   COUNT.RECEIVES.PUBLIC.ASSISTANCE PERCENT.RECEIVES.PUBLIC.ASSISTANCE
## 1                               20                               0.45
## 2                                2                               0.06
## 3                                0                               0.00
## 4                                0                               0.00
## 5                                0                               0.00
## 6                                0                               0.00
##   COUNT.NRECEIVES.PUBLIC.ASSISTANCE PERCENT.NRECEIVES.PUBLIC.ASSISTANCE
## 1                                24                                0.55
## 2                                33                                0.94
## 3                                 1                                1.00
## 4                                 0                                0.00
## 5                                 2                                1.00
## 6                                 6                                1.00
##   COUNT.PUBLIC.ASSISTANCE.UNKNOWN PERCENT.PUBLIC.ASSISTANCE.UNKNOWN
## 1                               0                                 0
## 2                               0                                 0
## 3                               0                                 0
## 4                               0                                 0
## 5                               0                                 0
## 6                               0                                 0
##   COUNT.PUBLIC.ASSISTANCE.TOTAL PERCENT.PUBLIC.ASSISTANCE.TOTAL
## 1                            44                             100
## 2                            35                             100
## 3                             1                             100
## 4                             0                               0
## 5                             2                             100
## 6                             6                             100

Convert data frame to table to make it easier

demographic <- tbl_df(demographic)

Explore the data

# Verify whether is a dataframe
class(demographic)
## [1] "tbl_df"     "tbl"        "data.frame"
# Check the dimensions
dim(demographic)
## [1] 236  46
# View the column names
names(demographic)
##  [1] "JURISDICTION.NAME"                  
##  [2] "COUNT.PARTICIPANTS"                 
##  [3] "COUNT.FEMALE"                       
##  [4] "PERCENT.FEMALE"                     
##  [5] "COUNT.MALE"                         
##  [6] "PERCENT.MALE"                       
##  [7] "COUNT.GENDER.UNKNOWN"               
##  [8] "PERCENT.GENDER.UNKNOWN"             
##  [9] "COUNT.GENDER.TOTAL"                 
## [10] "PERCENT.GENDER.TOTAL"               
## [11] "COUNT.PACIFIC.ISLANDER"             
## [12] "PERCENT.PACIFIC.ISLANDER"           
## [13] "COUNT.HISPANIC.LATINO"              
## [14] "PERCENT.HISPANIC.LATINO"            
## [15] "COUNT.AMERICAN.INDIAN"              
## [16] "PERCENT.AMERICAN.INDIAN"            
## [17] "COUNT.ASIAN.NON.HISPANIC"           
## [18] "PERCENT.ASIAN.NON.HISPANIC"         
## [19] "COUNT.WHITE.NON.HISPANIC"           
## [20] "PERCENT.WHITE.NON.HISPANIC"         
## [21] "COUNT.BLACK.NON.HISPANIC"           
## [22] "PERCENT.BLACK.NON.HISPANIC"         
## [23] "COUNT.OTHER.ETHNICITY"              
## [24] "PERCENT.OTHER.ETHNICITY"            
## [25] "COUNT.ETHNICITY.UNKNOWN"            
## [26] "PERCENT.ETHNICITY.UNKNOWN"          
## [27] "COUNT.ETHNICITY.TOTAL"              
## [28] "PERCENT.ETHNICITY.TOTAL"            
## [29] "COUNT.PERMANENT.RESIDENT.ALIEN"     
## [30] "PERCENT.PERMANENT.RESIDENT.ALIEN"   
## [31] "COUNT.US.CITIZEN"                   
## [32] "PERCENT.US.CITIZEN"                 
## [33] "COUNT.OTHER.CITIZEN.STATUS"         
## [34] "PERCENT.OTHER.CITIZEN.STATUS"       
## [35] "COUNT.CITIZEN.STATUS.UNKNOWN"       
## [36] "PERCENT.CITIZEN.STATUS.UNKNOWN"     
## [37] "COUNT.CITIZEN.STATUS.TOTAL"         
## [38] "PERCENT.CITIZEN.STATUS.TOTAL"       
## [39] "COUNT.RECEIVES.PUBLIC.ASSISTANCE"   
## [40] "PERCENT.RECEIVES.PUBLIC.ASSISTANCE" 
## [41] "COUNT.NRECEIVES.PUBLIC.ASSISTANCE"  
## [42] "PERCENT.NRECEIVES.PUBLIC.ASSISTANCE"
## [43] "COUNT.PUBLIC.ASSISTANCE.UNKNOWN"    
## [44] "PERCENT.PUBLIC.ASSISTANCE.UNKNOWN"  
## [45] "COUNT.PUBLIC.ASSISTANCE.TOTAL"      
## [46] "PERCENT.PUBLIC.ASSISTANCE.TOTAL"

Summarize the data

library(dplyr)
glimpse(demographic)
## Observations: 236
## Variables: 46
## $ JURISDICTION.NAME                   <int> 10001, 10002, 10003, 10004...
## $ COUNT.PARTICIPANTS                  <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ COUNT.FEMALE                        <int> 22, 19, 1, 0, 2, 2, 0, 0, ...
## $ PERCENT.FEMALE                      <dbl> 0.50, 0.54, 1.00, 0.00, 1....
## $ COUNT.MALE                          <int> 22, 16, 0, 0, 0, 4, 1, 2, ...
## $ PERCENT.MALE                        <dbl> 0.50, 0.46, 0.00, 0.00, 0....
## $ COUNT.GENDER.UNKNOWN                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.GENDER.UNKNOWN              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.GENDER.TOTAL                  <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ PERCENT.GENDER.TOTAL                <int> 100, 100, 100, 0, 100, 100...
## $ COUNT.PACIFIC.ISLANDER              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.PACIFIC.ISLANDER            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.HISPANIC.LATINO               <int> 16, 1, 0, 0, 0, 2, 0, 0, 0...
## $ PERCENT.HISPANIC.LATINO             <dbl> 0.36, 0.03, 0.00, 0.00, 0....
## $ COUNT.AMERICAN.INDIAN               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.AMERICAN.INDIAN             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.ASIAN.NON.HISPANIC            <int> 3, 28, 1, 0, 1, 0, 1, 2, 0...
## $ PERCENT.ASIAN.NON.HISPANIC          <dbl> 0.07, 0.80, 1.00, 0.00, 0....
## $ COUNT.WHITE.NON.HISPANIC            <int> 1, 6, 0, 0, 0, 1, 0, 0, 0,...
## $ PERCENT.WHITE.NON.HISPANIC          <dbl> 0.02, 0.17, 0.00, 0.00, 0....
## $ COUNT.BLACK.NON.HISPANIC            <int> 21, 0, 0, 0, 1, 3, 0, 0, 0...
## $ PERCENT.BLACK.NON.HISPANIC          <dbl> 0.48, 0.00, 0.00, 0.00, 0....
## $ COUNT.OTHER.ETHNICITY               <int> 3, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.OTHER.ETHNICITY             <dbl> 0.07, 0.00, 0.00, 0.00, 0....
## $ COUNT.ETHNICITY.UNKNOWN             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.ETHNICITY.UNKNOWN           <dbl> 0.00, 0.00, 0.00, 0.00, 0....
## $ COUNT.ETHNICITY.TOTAL               <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ PERCENT.ETHNICITY.TOTAL             <int> 100, 100, 100, 0, 100, 100...
## $ COUNT.PERMANENT.RESIDENT.ALIEN      <int> 2, 2, 0, 0, 1, 0, 0, 0, 0,...
## $ PERCENT.PERMANENT.RESIDENT.ALIEN    <dbl> 0.05, 0.06, 0.00, 0.00, 0....
## $ COUNT.US.CITIZEN                    <int> 42, 33, 1, 0, 1, 6, 1, 2, ...
## $ PERCENT.US.CITIZEN                  <dbl> 0.95, 0.94, 1.00, 0.00, 0....
## $ COUNT.OTHER.CITIZEN.STATUS          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.OTHER.CITIZEN.STATUS        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.CITIZEN.STATUS.UNKNOWN        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.CITIZEN.STATUS.UNKNOWN      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.CITIZEN.STATUS.TOTAL          <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ PERCENT.CITIZEN.STATUS.TOTAL        <int> 100, 100, 100, 0, 100, 100...
## $ COUNT.RECEIVES.PUBLIC.ASSISTANCE    <int> 20, 2, 0, 0, 0, 0, 1, 0, 0...
## $ PERCENT.RECEIVES.PUBLIC.ASSISTANCE  <dbl> 0.45, 0.06, 0.00, 0.00, 0....
## $ COUNT.NRECEIVES.PUBLIC.ASSISTANCE   <int> 24, 33, 1, 0, 2, 6, 0, 2, ...
## $ PERCENT.NRECEIVES.PUBLIC.ASSISTANCE <dbl> 0.55, 0.94, 1.00, 0.00, 1....
## $ COUNT.PUBLIC.ASSISTANCE.UNKNOWN     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PERCENT.PUBLIC.ASSISTANCE.UNKNOWN   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNT.PUBLIC.ASSISTANCE.TOTAL       <int> 44, 35, 1, 0, 2, 6, 1, 2, ...
## $ PERCENT.PUBLIC.ASSISTANCE.TOTAL     <int> 100, 100, 100, 0, 100, 100...
# Summurize data
summary(demographic)
##  JURISDICTION.NAME COUNT.PARTICIPANTS  COUNT.FEMALE   PERCENT.FEMALE 
##  Min.   :10001     Min.   :  0.00     Min.   :  0.0   Min.   :0.000  
##  1st Qu.:10452     1st Qu.:  0.00     1st Qu.:  0.0   1st Qu.:0.000  
##  Median :11216     Median :  0.00     Median :  0.0   Median :0.000  
##  Mean   :11127     Mean   : 17.66     Mean   : 10.3   Mean   :0.244  
##  3rd Qu.:11422     3rd Qu.: 13.00     3rd Qu.:  6.0   3rd Qu.:0.515  
##  Max.   :20459     Max.   :272.00     Max.   :194.0   Max.   :1.000  
##    COUNT.MALE       PERCENT.MALE   COUNT.GENDER.UNKNOWN
##  Min.   :  0.000   Min.   :0.000   Min.   :0           
##  1st Qu.:  0.000   1st Qu.:0.000   1st Qu.:0           
##  Median :  0.000   Median :0.000   Median :0           
##  Mean   :  7.364   Mean   :0.201   Mean   :0           
##  3rd Qu.:  4.000   3rd Qu.:0.400   3rd Qu.:0           
##  Max.   :157.000   Max.   :1.000   Max.   :0           
##  PERCENT.GENDER.UNKNOWN COUNT.GENDER.TOTAL PERCENT.GENDER.TOTAL
##  Min.   :0              Min.   :  0.00     Min.   :  0.00      
##  1st Qu.:0              1st Qu.:  0.00     1st Qu.:  0.00      
##  Median :0              Median :  0.00     Median :  0.00      
##  Mean   :0              Mean   : 17.66     Mean   : 44.49      
##  3rd Qu.:0              3rd Qu.: 13.00     3rd Qu.:100.00      
##  Max.   :0              Max.   :272.00     Max.   :100.00      
##  COUNT.PACIFIC.ISLANDER PERCENT.PACIFIC.ISLANDER COUNT.HISPANIC.LATINO
##  Min.   :0.00000        Min.   :0.0000000        Min.   : 0.000       
##  1st Qu.:0.00000        1st Qu.:0.0000000        1st Qu.: 0.000       
##  Median :0.00000        Median :0.0000000        Median : 0.000       
##  Mean   :0.02542        Mean   :0.0002966        Mean   : 1.856       
##  3rd Qu.:0.00000        3rd Qu.:0.0000000        3rd Qu.: 0.000       
##  Max.   :2.00000        Max.   :0.0200000        Max.   :51.000       
##  PERCENT.HISPANIC.LATINO COUNT.AMERICAN.INDIAN PERCENT.AMERICAN.INDIAN
##  Min.   :0.00000         Min.   :0.00000       Min.   :0.000000       
##  1st Qu.:0.00000         1st Qu.:0.00000       1st Qu.:0.000000       
##  Median :0.00000         Median :0.00000       Median :0.000000       
##  Mean   :0.07983         Mean   :0.02119       Mean   :0.001017       
##  3rd Qu.:0.00000         3rd Qu.:0.00000       3rd Qu.:0.000000       
##  Max.   :1.00000         Max.   :2.00000       Max.   :0.200000       
##  COUNT.ASIAN.NON.HISPANIC PERCENT.ASIAN.NON.HISPANIC
##  Min.   : 0.0000          Min.   :0.00000           
##  1st Qu.: 0.0000          1st Qu.:0.00000           
##  Median : 0.0000          Median :0.00000           
##  Mean   : 0.5254          Mean   :0.05657           
##  3rd Qu.: 0.0000          3rd Qu.:0.00000           
##  Max.   :28.0000          Max.   :1.00000           
##  COUNT.WHITE.NON.HISPANIC PERCENT.WHITE.NON.HISPANIC
##  Min.   :  0.00           Min.   :0.0000            
##  1st Qu.:  0.00           1st Qu.:0.0000            
##  Median :  0.00           Median :0.0000            
##  Mean   : 12.19           Mean   :0.1778            
##  3rd Qu.:  1.00           3rd Qu.:0.0225            
##  Max.   :262.00           Max.   :1.0000            
##  COUNT.BLACK.NON.HISPANIC PERCENT.BLACK.NON.HISPANIC COUNT.OTHER.ETHNICITY
##  Min.   : 0.000           Min.   :0.0000             Min.   : 0.0000      
##  1st Qu.: 0.000           1st Qu.:0.0000             1st Qu.: 0.0000      
##  Median : 0.000           Median :0.0000             Median : 0.0000      
##  Mean   : 2.233           Mean   :0.1042             Mean   : 0.6864      
##  3rd Qu.: 0.000           3rd Qu.:0.0000             3rd Qu.: 0.0000      
##  Max.   :60.000           Max.   :1.0000             Max.   :17.0000      
##  PERCENT.OTHER.ETHNICITY COUNT.ETHNICITY.UNKNOWN PERCENT.ETHNICITY.UNKNOWN
##  Min.   :0.00000         Min.   :0.0000          Min.   :0.000000         
##  1st Qu.:0.00000         1st Qu.:0.0000          1st Qu.:0.000000         
##  Median :0.00000         Median :0.0000          Median :0.000000         
##  Mean   :0.02131         Mean   :0.1229          Mean   :0.003941         
##  3rd Qu.:0.00000         3rd Qu.:0.0000          3rd Qu.:0.000000         
##  Max.   :0.50000         Max.   :5.0000          Max.   :0.250000         
##  COUNT.ETHNICITY.TOTAL PERCENT.ETHNICITY.TOTAL
##  Min.   :  0.00        Min.   :  0.00         
##  1st Qu.:  0.00        1st Qu.:  0.00         
##  Median :  0.00        Median :  0.00         
##  Mean   : 17.66        Mean   : 44.45         
##  3rd Qu.: 13.00        3rd Qu.:100.00         
##  Max.   :272.00        Max.   :100.00         
##  COUNT.PERMANENT.RESIDENT.ALIEN PERCENT.PERMANENT.RESIDENT.ALIEN
##  Min.   : 0.0000                Min.   :0.00000                 
##  1st Qu.: 0.0000                1st Qu.:0.00000                 
##  Median : 0.0000                Median :0.00000                 
##  Mean   : 0.4449                Mean   :0.02419                 
##  3rd Qu.: 0.0000                3rd Qu.:0.00000                 
##  Max.   :10.0000                Max.   :1.00000                 
##  COUNT.US.CITIZEN PERCENT.US.CITIZEN COUNT.OTHER.CITIZEN.STATUS
##  Min.   :  0.00   Min.   :0.0000     Min.   :0.00000           
##  1st Qu.:  0.00   1st Qu.:0.0000     1st Qu.:0.00000           
##  Median :  0.00   Median :0.0000     Median :0.00000           
##  Mean   : 17.17   Mean   :0.4187     Mean   :0.04661           
##  3rd Qu.: 12.00   3rd Qu.:0.9900     3rd Qu.:0.00000           
##  Max.   :271.00   Max.   :1.0000     Max.   :2.00000           
##  PERCENT.OTHER.CITIZEN.STATUS COUNT.CITIZEN.STATUS.UNKNOWN
##  Min.   :0.000000             Min.   :0                   
##  1st Qu.:0.000000             1st Qu.:0                   
##  Median :0.000000             Median :0                   
##  Mean   :0.002076             Mean   :0                   
##  3rd Qu.:0.000000             3rd Qu.:0                   
##  Max.   :0.330000             Max.   :0                   
##  PERCENT.CITIZEN.STATUS.UNKNOWN COUNT.CITIZEN.STATUS.TOTAL
##  Min.   :0                      Min.   :  0.00            
##  1st Qu.:0                      1st Qu.:  0.00            
##  Median :0                      Median :  0.00            
##  Mean   :0                      Mean   : 17.66            
##  3rd Qu.:0                      3rd Qu.: 13.00            
##  Max.   :0                      Max.   :272.00            
##  PERCENT.CITIZEN.STATUS.TOTAL COUNT.RECEIVES.PUBLIC.ASSISTANCE
##  Min.   :  0.00               Min.   :  0.000                 
##  1st Qu.:  0.00               1st Qu.:  0.000                 
##  Median :  0.00               Median :  0.000                 
##  Mean   : 44.49               Mean   :  5.975                 
##  3rd Qu.:100.00               3rd Qu.:  3.250                 
##  Max.   :100.00               Max.   :155.000                 
##  PERCENT.RECEIVES.PUBLIC.ASSISTANCE COUNT.NRECEIVES.PUBLIC.ASSISTANCE
##  Min.   :0.0000                     Min.   :  0.00                   
##  1st Qu.:0.0000                     1st Qu.:  0.00                   
##  Median :0.0000                     Median :  0.00                   
##  Mean   :0.1392                     Mean   : 11.69                   
##  3rd Qu.:0.2525                     3rd Qu.:  8.00                   
##  Max.   :1.0000                     Max.   :206.00                   
##  PERCENT.NRECEIVES.PUBLIC.ASSISTANCE COUNT.PUBLIC.ASSISTANCE.UNKNOWN
##  Min.   :0.0000                      Min.   :0                      
##  1st Qu.:0.0000                      1st Qu.:0                      
##  Median :0.0000                      Median :0                      
##  Mean   :0.3058                      Mean   :0                      
##  3rd Qu.:0.6625                      3rd Qu.:0                      
##  Max.   :1.0000                      Max.   :0                      
##  PERCENT.PUBLIC.ASSISTANCE.UNKNOWN COUNT.PUBLIC.ASSISTANCE.TOTAL
##  Min.   :0                         Min.   :  0.00               
##  1st Qu.:0                         1st Qu.:  0.00               
##  Median :0                         Median :  0.00               
##  Mean   :0                         Mean   : 17.66               
##  3rd Qu.:0                         3rd Qu.: 13.00               
##  Max.   :0                         Max.   :272.00               
##  PERCENT.PUBLIC.ASSISTANCE.TOTAL
##  Min.   :  0.00                 
##  1st Qu.:  0.00                 
##  Median :  0.00                 
##  Mean   : 44.49                 
##  3rd Qu.:100.00                 
##  Max.   :100.00
# View the top of the data
head(demographic)
## # A tibble: 6 × 46
##   JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
##               <int>              <int>        <int>          <dbl>
## 1             10001                 44           22           0.50
## 2             10002                 35           19           0.54
## 3             10003                  1            1           1.00
## 4             10004                  0            0           0.00
## 5             10005                  2            2           1.00
## 6             10006                  6            2           0.33
## # ... with 42 more variables: COUNT.MALE <int>, PERCENT.MALE <dbl>,
## #   COUNT.GENDER.UNKNOWN <int>, PERCENT.GENDER.UNKNOWN <int>,
## #   COUNT.GENDER.TOTAL <int>, PERCENT.GENDER.TOTAL <int>,
## #   COUNT.PACIFIC.ISLANDER <int>, PERCENT.PACIFIC.ISLANDER <dbl>,
## #   COUNT.HISPANIC.LATINO <int>, PERCENT.HISPANIC.LATINO <dbl>,
## #   COUNT.AMERICAN.INDIAN <int>, PERCENT.AMERICAN.INDIAN <dbl>,
## #   COUNT.ASIAN.NON.HISPANIC <int>, PERCENT.ASIAN.NON.HISPANIC <dbl>,
## #   COUNT.WHITE.NON.HISPANIC <int>, PERCENT.WHITE.NON.HISPANIC <dbl>,
## #   COUNT.BLACK.NON.HISPANIC <int>, PERCENT.BLACK.NON.HISPANIC <dbl>,
## #   COUNT.OTHER.ETHNICITY <int>, PERCENT.OTHER.ETHNICITY <dbl>,
## #   COUNT.ETHNICITY.UNKNOWN <int>, PERCENT.ETHNICITY.UNKNOWN <dbl>,
## #   COUNT.ETHNICITY.TOTAL <int>, PERCENT.ETHNICITY.TOTAL <int>,
## #   COUNT.PERMANENT.RESIDENT.ALIEN <int>,
## #   PERCENT.PERMANENT.RESIDENT.ALIEN <dbl>, COUNT.US.CITIZEN <int>,
## #   PERCENT.US.CITIZEN <dbl>, COUNT.OTHER.CITIZEN.STATUS <int>,
## #   PERCENT.OTHER.CITIZEN.STATUS <dbl>,
## #   COUNT.CITIZEN.STATUS.UNKNOWN <int>,
## #   PERCENT.CITIZEN.STATUS.UNKNOWN <int>,
## #   COUNT.CITIZEN.STATUS.TOTAL <int>, PERCENT.CITIZEN.STATUS.TOTAL <int>,
## #   COUNT.RECEIVES.PUBLIC.ASSISTANCE <int>,
## #   PERCENT.RECEIVES.PUBLIC.ASSISTANCE <dbl>,
## #   COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int>,
## #   PERCENT.NRECEIVES.PUBLIC.ASSISTANCE <dbl>,
## #   COUNT.PUBLIC.ASSISTANCE.UNKNOWN <int>,
## #   PERCENT.PUBLIC.ASSISTANCE.UNKNOWN <int>,
## #   COUNT.PUBLIC.ASSISTANCE.TOTAL <int>,
## #   PERCENT.PUBLIC.ASSISTANCE.TOTAL <int>
# View end of the data
tail(demographic)
## # A tibble: 6 × 46
##   JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
##               <int>              <int>        <int>          <dbl>
## 1             12786                  4            3           0.75
## 2             12788                 83           39           0.47
## 3             12789                272          115           0.42
## 4             13731                 17            2           0.12
## 5             16091                  0            0           0.00
## 6             20459                  0            0           0.00
## # ... with 42 more variables: COUNT.MALE <int>, PERCENT.MALE <dbl>,
## #   COUNT.GENDER.UNKNOWN <int>, PERCENT.GENDER.UNKNOWN <int>,
## #   COUNT.GENDER.TOTAL <int>, PERCENT.GENDER.TOTAL <int>,
## #   COUNT.PACIFIC.ISLANDER <int>, PERCENT.PACIFIC.ISLANDER <dbl>,
## #   COUNT.HISPANIC.LATINO <int>, PERCENT.HISPANIC.LATINO <dbl>,
## #   COUNT.AMERICAN.INDIAN <int>, PERCENT.AMERICAN.INDIAN <dbl>,
## #   COUNT.ASIAN.NON.HISPANIC <int>, PERCENT.ASIAN.NON.HISPANIC <dbl>,
## #   COUNT.WHITE.NON.HISPANIC <int>, PERCENT.WHITE.NON.HISPANIC <dbl>,
## #   COUNT.BLACK.NON.HISPANIC <int>, PERCENT.BLACK.NON.HISPANIC <dbl>,
## #   COUNT.OTHER.ETHNICITY <int>, PERCENT.OTHER.ETHNICITY <dbl>,
## #   COUNT.ETHNICITY.UNKNOWN <int>, PERCENT.ETHNICITY.UNKNOWN <dbl>,
## #   COUNT.ETHNICITY.TOTAL <int>, PERCENT.ETHNICITY.TOTAL <int>,
## #   COUNT.PERMANENT.RESIDENT.ALIEN <int>,
## #   PERCENT.PERMANENT.RESIDENT.ALIEN <dbl>, COUNT.US.CITIZEN <int>,
## #   PERCENT.US.CITIZEN <dbl>, COUNT.OTHER.CITIZEN.STATUS <int>,
## #   PERCENT.OTHER.CITIZEN.STATUS <dbl>,
## #   COUNT.CITIZEN.STATUS.UNKNOWN <int>,
## #   PERCENT.CITIZEN.STATUS.UNKNOWN <int>,
## #   COUNT.CITIZEN.STATUS.TOTAL <int>, PERCENT.CITIZEN.STATUS.TOTAL <int>,
## #   COUNT.RECEIVES.PUBLIC.ASSISTANCE <int>,
## #   PERCENT.RECEIVES.PUBLIC.ASSISTANCE <dbl>,
## #   COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int>,
## #   PERCENT.NRECEIVES.PUBLIC.ASSISTANCE <dbl>,
## #   COUNT.PUBLIC.ASSISTANCE.UNKNOWN <int>,
## #   PERCENT.PUBLIC.ASSISTANCE.UNKNOWN <int>,
## #   COUNT.PUBLIC.ASSISTANCE.TOTAL <int>,
## #   PERCENT.PUBLIC.ASSISTANCE.TOTAL <int>

Visualize the data

# Histrogram of data
hist(demographic$PERCENT.FEMALE)

# Scatter plot comparing citizen status total and public assistance total
plot(demographic$COUNT.CITIZEN.STATUS.TOTAL, demographic$COUNT.PUBLIC.ASSISTANCE.TOTAL)

Select columns fromd data set required to perform the analysis

demoSplit <- select(demographic, JURISDICTION.NAME, COUNT.PARTICIPANTS, COUNT.FEMALE,
                    PERCENT.FEMALE, COUNT.MALE, COUNT.PACIFIC.ISLANDER, 
                    COUNT.HISPANIC.LATINO, COUNT.AMERICAN.INDIAN, 
                    COUNT.ASIAN.NON.HISPANIC,COUNT.WHITE.NON.HISPANIC, 
                    COUNT.BLACK.NON.HISPANIC, COUNT.OTHER.ETHNICITY,
                    COUNT.PERMANENT.RESIDENT.ALIEN, COUNT.US.CITIZEN,
                    COUNT.RECEIVES.PUBLIC.ASSISTANCE, COUNT.NRECEIVES.PUBLIC.ASSISTANCE,
                    COUNT.PUBLIC.ASSISTANCE.TOTAL)
head(demoSplit)
## # A tibble: 6 × 17
##   JURISDICTION.NAME COUNT.PARTICIPANTS COUNT.FEMALE PERCENT.FEMALE
##               <int>              <int>        <int>          <dbl>
## 1             10001                 44           22           0.50
## 2             10002                 35           19           0.54
## 3             10003                  1            1           1.00
## 4             10004                  0            0           0.00
## 5             10005                  2            2           1.00
## 6             10006                  6            2           0.33
## # ... with 13 more variables: COUNT.MALE <int>,
## #   COUNT.PACIFIC.ISLANDER <int>, COUNT.HISPANIC.LATINO <int>,
## #   COUNT.AMERICAN.INDIAN <int>, COUNT.ASIAN.NON.HISPANIC <int>,
## #   COUNT.WHITE.NON.HISPANIC <int>, COUNT.BLACK.NON.HISPANIC <int>,
## #   COUNT.OTHER.ETHNICITY <int>, COUNT.PERMANENT.RESIDENT.ALIEN <int>,
## #   COUNT.US.CITIZEN <int>, COUNT.RECEIVES.PUBLIC.ASSISTANCE <int>,
## #   COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int>,
## #   COUNT.PUBLIC.ASSISTANCE.TOTAL <int>

Identiy zipcodes where female participants live – the owner of a women’s jewelry store would like to determine the ideal zipcode to open her store.

# Number of males and female in dataset

# Female by zipcodes
sum(demoSplit$COUNT.FEMALE, demoSplit$COUNT.MALE)
## [1] 4168
demoFemale <- select(demoSplit, JURISDICTION.NAME, COUNT.FEMALE, PERCENT.FEMALE,
                     COUNT.NRECEIVES.PUBLIC.ASSISTANCE)
               demoFemale <- filter(demoFemale, JURISDICTION.NAME, 
                                    COUNT.NRECEIVES.PUBLIC.ASSISTANCE, 
                                    PERCENT.FEMALE > 0)

arrange(demoFemale, PERCENT.FEMALE)
## # A tibble: 94 × 4
##    JURISDICTION.NAME COUNT.FEMALE PERCENT.FEMALE
##                <int>        <int>          <dbl>
## 1              12768            1           0.02
## 2              12528            2           0.04
## 3              12423            1           0.05
## 4              12435            5           0.09
## 5              13731            2           0.12
## 6              10013            1           0.13
## 7              11220            1           0.13
## 8              11510            6           0.21
## 9              11206            3           0.23
## 10             11434            1           0.25
## # ... with 84 more rows, and 1 more variables:
## #   COUNT.NRECEIVES.PUBLIC.ASSISTANCE <int>

Conclusion

A profile of women by zipcode that do not receive public assistance. This would be important for a business owner interested in opening a business that caters to women.

Dataset 3 Description

People visit hospitals for treatment of medical conditions. Providers charge differently in different parts of the country for the same service. These data include information comparing the charges for the 100 most common inpatient services and 30 common outpatient services.

In this dataset you will notice that providers charge for items/services provided to patients. It will be interestingly to perform exploratory data analysis on this dataset to determine how charges differ based on the location of the provider.

Read data

provider <- read.csv('IPPS__Provider_Summary_FY2011.csv', header=TRUE,stringsAsFactors = FALSE)

head(provider)
##                             DRG.Definition Provider.Id
## 1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10001
## 2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10005
## 3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10006
## 4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10011
## 5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10016
## 6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10023
##                      Provider.Name    Provider.Street.Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER     1108 ROSS CLARK CIRCLE
## 2    MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3   ELIZA COFFEE MEMORIAL HOSPITAL         205 MARENGO STREET
## 4                ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5    SHELBY BAPTIST MEDICAL CENTER    1000 FIRST STREET NORTH
## 6     BAPTIST MEDICAL CENTER SOUTH  2105 EAST SOUTH BOULEVARD
##   Provider.City Provider.State Provider.Zip.Code
## 1        DOTHAN             AL             36301
## 2          BOAZ             AL             35957
## 3      FLORENCE             AL             35631
## 4    BIRMINGHAM             AL             35235
## 5     ALABASTER             AL             35007
## 6    MONTGOMERY             AL             36116
##   Hospital.Referral.Region.Description Total.Discharges
## 1                          AL - Dothan               91
## 2                      AL - Birmingham               14
## 3                      AL - Birmingham               24
## 4                      AL - Birmingham               25
## 5                      AL - Birmingham               18
## 6                      AL - Montgomery               67
##   Average.Covered.Charges Average.Total.Payments Average.Medicare.Payments
## 1              $32,963.07              $5,777.24                 $4,763.73
## 2              $15,131.85              $5,787.57                 $4,976.71
## 3              $37,560.37              $5,434.95                 $4,453.79
## 4              $13,998.28              $5,417.56                 $4,129.16
## 5              $31,633.27              $5,658.33                 $4,851.44
## 6              $16,920.79              $6,653.80                 $5,374.14

Convert data frame to a table

provider <- tbl_df(provider)
head(provider)
## # A tibble: 6 × 12
##                             DRG.Definition Provider.Id
##                                      <chr>       <int>
## 1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10001
## 2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10005
## 3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10006
## 4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10011
## 5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10016
## 6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC       10023
## # ... with 10 more variables: Provider.Name <chr>,
## #   Provider.Street.Address <chr>, Provider.City <chr>,
## #   Provider.State <chr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region.Description <chr>, Total.Discharges <int>,
## #   Average.Covered.Charges <chr>, Average.Total.Payments <chr>,
## #   Average.Medicare.Payments <chr>

Examine the data

# Check data class
class(provider)
## [1] "tbl_df"     "tbl"        "data.frame"
# Check dimensions of the data
dim(provider)
## [1] 65499    12
# View the column names
names(provider)
##  [1] "DRG.Definition"                      
##  [2] "Provider.Id"                         
##  [3] "Provider.Name"                       
##  [4] "Provider.Street.Address"             
##  [5] "Provider.City"                       
##  [6] "Provider.State"                      
##  [7] "Provider.Zip.Code"                   
##  [8] "Hospital.Referral.Region.Description"
##  [9] "Total.Discharges"                    
## [10] "Average.Covered.Charges"             
## [11] "Average.Total.Payments"              
## [12] "Average.Medicare.Payments"
# View the end of the data
tail(provider)
## # A tibble: 6 × 12
##                     DRG.Definition Provider.Id
##                              <chr>       <int>
## 1 292 - HEART FAILURE & SHOCK W CC      440194
## 2 292 - HEART FAILURE & SHOCK W CC      440197
## 3 292 - HEART FAILURE & SHOCK W CC      440200
## 4 292 - HEART FAILURE & SHOCK W CC      440227
## 5 292 - HEART FAILURE & SHOCK W CC      440228
## 6 292 - HEART FAILURE & SHOCK W CC      450002
## # ... with 10 more variables: Provider.Name <chr>,
## #   Provider.Street.Address <chr>, Provider.City <chr>,
## #   Provider.State <chr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region.Description <chr>, Total.Discharges <int>,
## #   Average.Covered.Charges <chr>, Average.Total.Payments <chr>,
## #   Average.Medicare.Payments <chr>

View the structure of the data

library(dplyr)

glimpse(provider)
## Observations: 65,499
## Variables: 12
## $ DRG.Definition                       <chr> "039 - EXTRACRANIAL PROCE...
## $ Provider.Id                          <int> 10001, 10005, 10006, 1001...
## $ Provider.Name                        <chr> "SOUTHEAST ALABAMA MEDICA...
## $ Provider.Street.Address              <chr> "1108 ROSS CLARK CIRCLE",...
## $ Provider.City                        <chr> "DOTHAN", "BOAZ", "FLOREN...
## $ Provider.State                       <chr> "AL", "AL", "AL", "AL", "...
## $ Provider.Zip.Code                    <int> 36301, 35957, 35631, 3523...
## $ Hospital.Referral.Region.Description <chr> "AL - Dothan", "AL - Birm...
## $ Total.Discharges                     <int> 91, 14, 24, 25, 18, 67, 5...
## $ Average.Covered.Charges              <chr> "$32,963.07", "$15,131.85...
## $ Average.Total.Payments               <chr> "$5,777.24", "$5,787.57",...
## $ Average.Medicare.Payments            <chr> "$4,763.73", "$4,976.71",...
summary(provider)
##  DRG.Definition      Provider.Id     Provider.Name     
##  Length:65499       Min.   : 10001   Length:65499      
##  Class :character   1st Qu.:110083   Class :character  
##  Mode  :character   Median :240166   Mode  :character  
##                     Mean   :253788                     
##                     3rd Qu.:380017                     
##                     Max.   :670075                     
##  Provider.Street.Address Provider.City      Provider.State    
##  Length:65499            Length:65499       Length:65499      
##  Class :character        Class :character   Class :character  
##  Mode  :character        Mode  :character   Mode  :character  
##                                                               
##                                                               
##                                                               
##  Provider.Zip.Code Hospital.Referral.Region.Description Total.Discharges
##  Min.   : 1040     Length:65499                         Min.   : 11.00  
##  1st Qu.:27511     Class :character                     1st Qu.: 17.00  
##  Median :44203     Mode  :character                     Median : 28.00  
##  Mean   :47811                                          Mean   : 41.23  
##  3rd Qu.:72308                                          3rd Qu.: 50.00  
##  Max.   :99801                                          Max.   :981.00  
##  Average.Covered.Charges Average.Total.Payments Average.Medicare.Payments
##  Length:65499            Length:65499           Length:65499             
##  Class :character        Class :character       Class :character         
##  Mode  :character        Mode  :character       Mode  :character         
##                                                                          
##                                                                          
## 
names(provider)
##  [1] "DRG.Definition"                      
##  [2] "Provider.Id"                         
##  [3] "Provider.Name"                       
##  [4] "Provider.Street.Address"             
##  [5] "Provider.City"                       
##  [6] "Provider.State"                      
##  [7] "Provider.Zip.Code"                   
##  [8] "Hospital.Referral.Region.Description"
##  [9] "Total.Discharges"                    
## [10] "Average.Covered.Charges"             
## [11] "Average.Total.Payments"              
## [12] "Average.Medicare.Payments"

Look at missing values

sum(is.na(provider))
## [1] 0

Create dataset with variables to be used for analysis

providerSel <- select(provider, DRG.Definition, Provider.Id, Provider.Name,
                      Provider.State, Average.Covered.Charges, Average.Total.Payments,
                      Average.Medicare.Payments)
names(providerSel)
## [1] "DRG.Definition"            "Provider.Id"              
## [3] "Provider.Name"             "Provider.State"           
## [5] "Average.Covered.Charges"   "Average.Total.Payments"   
## [7] "Average.Medicare.Payments"

Examine maximum and minimum average covered charges by state

# Select columns for analysis

providerSelPay <- select(providerSel, Provider.State, Average.Covered.Charges,
                      Average.Total.Payments, Average.Medicare.Payments)

# Remove commas and $ from data and convert to numeric
providerSelPay$Average.Covered.Charges2 = as.numeric(gsub("[$,]", "", 
providerSelPay$Average.Covered.Charges))
providerSelPay$Average.Total.Payments2 = as.numeric(gsub("[$,]", "", 
providerSelPay$Average.Total.Payments))
providerSelPay$Average.Medicare.Payments2 = as.numeric(gsub("[$,]", "", 
providerSelPay$Average.Medicare.Payments))

Calculate means

# Mean of Average.Covered.Charges
mean(providerSelPay$Average.Covered.Charges2)
## [1] 39000.78
# Mean of Average Total Payments
mean(providerSelPay$Average.Total.Payments2)
## [1] 10169.43
# Mean Average Medicare Payments
mean(providerSelPay$Average.Medicare.Payments2)
## [1] 8932.729

Aggregate data for max value and mean by state

# Max by state
temp <- aggregate(Average.Total.Payments2 ~ Provider.State, providerSelPay, function(x) max(x))

arrange(temp, Average.Total.Payments2)
##    Provider.State Average.Total.Payments2
## 1              MT                31074.73
## 2              ND                36440.81
## 3              NM                37069.00
## 4              WY                37486.00
## 5              NE                41186.40
## 6              AR                41460.77
## 7              DE                41724.28
## 8              ME                41980.32
## 9              WV                42507.68
## 10             AL                42550.21
## 11             MS                42693.15
## 12             TN                45062.00
## 13             KY                46301.00
## 14             VT                47048.91
## 15             UT                47584.64
## 16             RI                47912.69
## 17             OR                48050.83
## 18             OK                48065.66
## 19             IA                48488.18
## 20             KS                49510.75
## 21             MO                50506.85
## 22             NH                51065.75
## 23             ID                51141.92
## 24             GA                51756.66
## 25             OH                52050.45
## 26             LA                53360.92
## 27             TX                54004.36
## 28             IN                54253.25
## 29             HI                54559.00
## 30             NC                57591.39
## 31             VA                57963.38
## 32             CO                58690.73
## 33             WI                61774.00
## 34             AZ                62268.76
## 35             NJ                62362.50
## 36             AK                62665.37
## 37             MN                62770.42
## 38             FL                62807.04
## 39             MA                64026.55
## 40             NV                64959.26
## 41             CT                66397.45
## 42             DC                67720.37
## 43             MD                76657.81
## 44             SC                76686.46
## 45             PA                80549.67
## 46             MI                83332.21
## 47             SD                83484.14
## 48             WA                84499.26
## 49             IL                85987.00
## 50             NY                99929.46
## 51             CA               156158.18
 # Mean by state
temp2 <-aggregate(Average.Total.Payments2 ~ Provider.State, providerSelPay, function(x) mean(x))
arrange(temp2, Average.Total.Payments2)
##    Provider.State Average.Total.Payments2
## 1              AL                7964.114
## 2              TN                8530.343
## 3              AR                8579.398
## 4              IA                8617.359
## 5              WV                8650.539
## 6              OK                8677.653
## 7              MS                8687.839
## 8              KS                8786.129
## 9              KY                8909.264
## 10             ME                8921.078
## 11             GA                9114.459
## 12             MO                9151.982
## 13             IN                9161.962
## 14             MT                9244.758
## 15             LA                9310.777
## 16             VA                9356.982
## 17             FL                9362.872
## 18             OH                9370.281
## 19             NC                9377.806
## 20             CO                9517.121
## 21             NH                9538.795
## 22             NE                9545.152
## 23             SC                9605.290
## 24             PA                9610.712
## 25             TX                9710.581
## 26             ID                9723.252
## 27             WI                9831.411
## 28             NM                9947.577
## 29             UT               10175.722
## 30             MI               10184.972
## 31             IL               10343.051
## 32             SD               10364.195
## 33             MN               10382.305
## 34             OR               10432.892
## 35             ND               10436.264
## 36             AZ               10592.956
## 37             RI               10856.980
## 38             MA               10858.709
## 39             DE               10953.379
## 40             NV               10957.796
## 41             WA               11073.249
## 42             NJ               11273.043
## 43             CT               11880.119
## 44             WY               12102.500
## 45             VT               12109.385
## 46             NY               12364.821
## 47             CA               13194.969
## 48             HI               13272.549
## 49             MD               13319.723
## 50             DC               13741.709
## 51             AK               14721.429

Conclusion Charges by provider vary by states. Average total payments are the highest in California and lowest in Montana.