Workshop 2: Import, Tidy and Visualization in R

Michael

2018-07-15

Import the data

This section will show how to import different formts of datasets into R, and tidy them and then visualize them.

CSV

In computing, a comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

ny_job <- read_csv(file = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/ny_job.csv')
ny_job
## # A tibble: 1,260 x 5
##     Year Region         `NAICS Code` Industry                         Jobs
##    <int> <chr>          <chr>        <chr>                           <int>
##  1  2012 Capital Region 11           Agriculture, Forestry, Fishing…  2183
##  2  2012 Capital Region 21           Mining, Quarrying, and Oil and…   733
##  3  2012 Capital Region 22           Utilities                        1838
##  4  2012 Capital Region 23           Construction                    19514
##  5  2012 Capital Region 31-33        Manufacturing                   30812
##  6  2012 Capital Region 42           Wholesale Trade                 14606
##  7  2012 Capital Region 44-45        Retail Trade                    58341
##  8  2012 Capital Region 48-49        Transportation and Warehousing  11041
##  9  2012 Capital Region 51           Information                      9577
## 10  2012 Capital Region 52           Finance and Insurance           21556
## # ... with 1,250 more rows
unique(ny_job$Year)
## [1] 2012 2013 2014 2015 2017 2016
unique(ny_job$Industry)
##  [1] "Agriculture, Forestry, Fishing and Hunting"                              
##  [2] "Mining, Quarrying, and Oil and Gas Extraction"                           
##  [3] "Utilities"                                                               
##  [4] "Construction"                                                            
##  [5] "Manufacturing"                                                           
##  [6] "Wholesale Trade"                                                         
##  [7] "Retail Trade"                                                            
##  [8] "Transportation and Warehousing"                                          
##  [9] "Information"                                                             
## [10] "Finance and Insurance"                                                   
## [11] "Real Estate and Rental and Leasing"                                      
## [12] "Professional, Scientific, and Technical Services"                        
## [13] "Management of Companies and Enterprises"                                 
## [14] "Administrative and Support and Waste Management and Remediation Services"
## [15] "Educational Services (Private)"                                          
## [16] "Health Care and Social Assistance"                                       
## [17] "Arts, Entertainment, and Recreation"                                     
## [18] "Accommodation and Food Services"                                         
## [19] "Other Services (except Public Administration)"                           
## [20] "Government"                                                              
## [21] "Unclassified Industry"                                                   
## [22] "Educational Services"

Now, we will use the main functions from package tidyverse, to analyze the data.

ny_job %>% # pipe function to help you progress the analysis 
        filter(Year == 2016) %>% 
        arrange(desc(Jobs))
## # A tibble: 210 x 5
##     Year Region        `NAICS Code` Industry                          Jobs
##    <int> <chr>         <chr>        <chr>                            <int>
##  1  2016 New York City 62           Health Care and Social Assista… 720834
##  2  2016 New York City 90           Government                      572528
##  3  2016 New York City 54           Professional, Scientific, and … 438503
##  4  2016 New York City 44           Retail Trade                    360549
##  5  2016 New York City 72           Accommodation and Food Services 356058
##  6  2016 New York City 52           Finance and Insurance           334808
##  7  2016 New York City 56           Administrative and Support and… 239765
##  8  2016 New York City 81           Other Services (except Public … 238033
##  9  2016 New York City 61           Educational Services            234167
## 10  2016 Long Island   62           Health Care and Social Assista… 226854
## # ... with 200 more rows

We can collect the different years and find average jobs for each year

ny_job %>%
        group_by(Year) %>%
        summarise(avg = mean(Jobs)) %>%
        arrange(desc(avg))
## # A tibble: 6 x 2
##    Year    avg
##   <int>  <dbl>
## 1  2015 46429.
## 2  2016 45248.
## 3  2017 44596.
## 4  2014 41234.
## 5  2013 40654.
## 6  2012 40109.

Now, let’s collect the different industries and find the average jobs for each industry

library(DT)
ny_job %>%
        group_by(Industry) %>%
        summarise(avg = round(mean(Jobs))) %>%
        arrange(avg) %>%
        datatable()

Excel

Now, we will read excel filt into the R

library(readxl)
lab <- read_excel(path = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/Data lab 1.xls')
lab
## # A tibble: 4,290 x 12
##    er30001 er30002 er32000 er32022 er32049 er30733 er30734 er30735 er30736
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1       4       3       1       2       1    2482       1      10      40
##  2       4     170       1       2       1    6974       1      10      37
##  3       4     172       1       2       4    6662       1      10      43
##  4       4     175       1       2       4    6201       1      10      32
##  5       5       1       1       5       1    4600       1      10      48
##  6       5       3       1       2       1    4425       1      10      30
##  7       5     171       1      99       1    4335       1      10      45
##  8       6       6       1       3       1    1688       1      10      37
##  9       6     170       1       3       1    1077       1      10      47
## 10       7       3       1       3       1    1584       1      10      38
## # ... with 4,280 more rows, and 3 more variables: er30748 <dbl>,
## #   er30750 <dbl>, er30754 <dbl>

Stata

Now, we will read stata data into the R

library(readstata13)
asg <- read.dta13(file = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/bhps_assignment.dta')
asg <- tbl_df(asg)

Tidy the data

Visualize the data