Workshop 2: Import, Tidy and Visualization in R

Import the data
- CSV
- Excel
- Stata
Tidy the data
Visualize the data

Import the data

This section will show how to import different formts of datasets into R, and tidy them and then visualize them.

CSV

In computing, a comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

ny_job <- read_csv(file = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/ny_job.csv')
ny_job

## # A tibble: 1,260 x 5
##     Year Region         `NAICS Code` Industry                         Jobs
##    <int> <chr>          <chr>        <chr>                           <int>
##  1  2012 Capital Region 11           Agriculture, Forestry, Fishing…  2183
##  2  2012 Capital Region 21           Mining, Quarrying, and Oil and…   733
##  3  2012 Capital Region 22           Utilities                        1838
##  4  2012 Capital Region 23           Construction                    19514
##  5  2012 Capital Region 31-33        Manufacturing                   30812
##  6  2012 Capital Region 42           Wholesale Trade                 14606
##  7  2012 Capital Region 44-45        Retail Trade                    58341
##  8  2012 Capital Region 48-49        Transportation and Warehousing  11041
##  9  2012 Capital Region 51           Information                      9577
## 10  2012 Capital Region 52           Finance and Insurance           21556
## # ... with 1,250 more rows

unique(ny_job$Year)

## [1] 2012 2013 2014 2015 2017 2016

unique(ny_job$Industry)

##  [1] "Agriculture, Forestry, Fishing and Hunting"                              
##  [2] "Mining, Quarrying, and Oil and Gas Extraction"                           
##  [3] "Utilities"                                                               
##  [4] "Construction"                                                            
##  [5] "Manufacturing"                                                           
##  [6] "Wholesale Trade"                                                         
##  [7] "Retail Trade"                                                            
##  [8] "Transportation and Warehousing"                                          
##  [9] "Information"                                                             
## [10] "Finance and Insurance"                                                   
## [11] "Real Estate and Rental and Leasing"                                      
## [12] "Professional, Scientific, and Technical Services"                        
## [13] "Management of Companies and Enterprises"                                 
## [14] "Administrative and Support and Waste Management and Remediation Services"
## [15] "Educational Services (Private)"                                          
## [16] "Health Care and Social Assistance"                                       
## [17] "Arts, Entertainment, and Recreation"                                     
## [18] "Accommodation and Food Services"                                         
## [19] "Other Services (except Public Administration)"                           
## [20] "Government"                                                              
## [21] "Unclassified Industry"                                                   
## [22] "Educational Services"

Now, we will use the main functions from package tidyverse, to analyze the data.

ny_job %>% # pipe function to help you progress the analysis 
        filter(Year == 2016) %>% 
        arrange(desc(Jobs))

## # A tibble: 210 x 5
##     Year Region        `NAICS Code` Industry                          Jobs
##    <int> <chr>         <chr>        <chr>                            <int>
##  1  2016 New York City 62           Health Care and Social Assista… 720834
##  2  2016 New York City 90           Government                      572528
##  3  2016 New York City 54           Professional, Scientific, and … 438503
##  4  2016 New York City 44           Retail Trade                    360549
##  5  2016 New York City 72           Accommodation and Food Services 356058
##  6  2016 New York City 52           Finance and Insurance           334808
##  7  2016 New York City 56           Administrative and Support and… 239765
##  8  2016 New York City 81           Other Services (except Public … 238033
##  9  2016 New York City 61           Educational Services            234167
## 10  2016 Long Island   62           Health Care and Social Assista… 226854
## # ... with 200 more rows

We can collect the different years and find average jobs for each year

ny_job %>%
        group_by(Year) %>%
        summarise(avg = mean(Jobs)) %>%
        arrange(desc(avg))

## # A tibble: 6 x 2
##    Year    avg
##   <int>  <dbl>
## 1  2015 46429.
## 2  2016 45248.
## 3  2017 44596.
## 4  2014 41234.
## 5  2013 40654.
## 6  2012 40109.

Now, let’s collect the different industries and find the average jobs for each industry

library(DT)
ny_job %>%
        group_by(Industry) %>%
        summarise(avg = round(mean(Jobs))) %>%
        arrange(avg) %>%
        datatable()

Excel

Now, we will read excel filt into the R

library(readxl)
lab <- read_excel(path = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/Data lab 1.xls')
lab

## # A tibble: 4,290 x 12
##    er30001 er30002 er32000 er32022 er32049 er30733 er30734 er30735 er30736
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1       4       3       1       2       1    2482       1      10      40
##  2       4     170       1       2       1    6974       1      10      37
##  3       4     172       1       2       4    6662       1      10      43
##  4       4     175       1       2       4    6201       1      10      32
##  5       5       1       1       5       1    4600       1      10      48
##  6       5       3       1       2       1    4425       1      10      30
##  7       5     171       1      99       1    4335       1      10      45
##  8       6       6       1       3       1    1688       1      10      37
##  9       6     170       1       3       1    1077       1      10      47
## 10       7       3       1       3       1    1584       1      10      38
## # ... with 4,280 more rows, and 3 more variables: er30748 <dbl>,
## #   er30750 <dbl>, er30754 <dbl>

Stata

Now, we will read stata data into the R

library(readstata13)
asg <- read.dta13(file = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/bhps_assignment.dta')
asg <- tbl_df(asg)

Import the data

CSV

Excel

Stata

Tidy the data

Visualize the data