Import the data
This section will show how to import different formts of datasets into R, and tidy them and then visualize them.
CSV
In computing, a comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
ny_job <- read_csv(file = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/ny_job.csv')
ny_job
## # A tibble: 1,260 x 5
## Year Region `NAICS Code` Industry Jobs
## <int> <chr> <chr> <chr> <int>
## 1 2012 Capital Region 11 Agriculture, Forestry, Fishing… 2183
## 2 2012 Capital Region 21 Mining, Quarrying, and Oil and… 733
## 3 2012 Capital Region 22 Utilities 1838
## 4 2012 Capital Region 23 Construction 19514
## 5 2012 Capital Region 31-33 Manufacturing 30812
## 6 2012 Capital Region 42 Wholesale Trade 14606
## 7 2012 Capital Region 44-45 Retail Trade 58341
## 8 2012 Capital Region 48-49 Transportation and Warehousing 11041
## 9 2012 Capital Region 51 Information 9577
## 10 2012 Capital Region 52 Finance and Insurance 21556
## # ... with 1,250 more rows
unique(ny_job$Year)
## [1] 2012 2013 2014 2015 2017 2016
unique(ny_job$Industry)
## [1] "Agriculture, Forestry, Fishing and Hunting"
## [2] "Mining, Quarrying, and Oil and Gas Extraction"
## [3] "Utilities"
## [4] "Construction"
## [5] "Manufacturing"
## [6] "Wholesale Trade"
## [7] "Retail Trade"
## [8] "Transportation and Warehousing"
## [9] "Information"
## [10] "Finance and Insurance"
## [11] "Real Estate and Rental and Leasing"
## [12] "Professional, Scientific, and Technical Services"
## [13] "Management of Companies and Enterprises"
## [14] "Administrative and Support and Waste Management and Remediation Services"
## [15] "Educational Services (Private)"
## [16] "Health Care and Social Assistance"
## [17] "Arts, Entertainment, and Recreation"
## [18] "Accommodation and Food Services"
## [19] "Other Services (except Public Administration)"
## [20] "Government"
## [21] "Unclassified Industry"
## [22] "Educational Services"
Now, we will use the main functions from package tidyverse
, to analyze the data.
ny_job %>% # pipe function to help you progress the analysis
filter(Year == 2016) %>%
arrange(desc(Jobs))
## # A tibble: 210 x 5
## Year Region `NAICS Code` Industry Jobs
## <int> <chr> <chr> <chr> <int>
## 1 2016 New York City 62 Health Care and Social Assista… 720834
## 2 2016 New York City 90 Government 572528
## 3 2016 New York City 54 Professional, Scientific, and … 438503
## 4 2016 New York City 44 Retail Trade 360549
## 5 2016 New York City 72 Accommodation and Food Services 356058
## 6 2016 New York City 52 Finance and Insurance 334808
## 7 2016 New York City 56 Administrative and Support and… 239765
## 8 2016 New York City 81 Other Services (except Public … 238033
## 9 2016 New York City 61 Educational Services 234167
## 10 2016 Long Island 62 Health Care and Social Assista… 226854
## # ... with 200 more rows
We can collect the different years and find average jobs for each year
ny_job %>%
group_by(Year) %>%
summarise(avg = mean(Jobs)) %>%
arrange(desc(avg))
## # A tibble: 6 x 2
## Year avg
## <int> <dbl>
## 1 2015 46429.
## 2 2016 45248.
## 3 2017 44596.
## 4 2014 41234.
## 5 2013 40654.
## 6 2012 40109.
Now, let’s collect the different industries and find the average jobs for each industry
library(DT)
ny_job %>%
group_by(Industry) %>%
summarise(avg = round(mean(Jobs))) %>%
arrange(avg) %>%
datatable()
Excel
Now, we will read excel filt into the R
library(readxl)
lab <- read_excel(path = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/Data lab 1.xls')
lab
## # A tibble: 4,290 x 12
## er30001 er30002 er32000 er32022 er32049 er30733 er30734 er30735 er30736
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 3 1 2 1 2482 1 10 40
## 2 4 170 1 2 1 6974 1 10 37
## 3 4 172 1 2 4 6662 1 10 43
## 4 4 175 1 2 4 6201 1 10 32
## 5 5 1 1 5 1 4600 1 10 48
## 6 5 3 1 2 1 4425 1 10 30
## 7 5 171 1 99 1 4335 1 10 45
## 8 6 6 1 3 1 1688 1 10 37
## 9 6 170 1 3 1 1077 1 10 47
## 10 7 3 1 3 1 1584 1 10 38
## # ... with 4,280 more rows, and 3 more variables: er30748 <dbl>,
## # er30750 <dbl>, er30754 <dbl>
Stata
Now, we will read stata data into the R
library(readstata13)
asg <- read.dta13(file = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/bhps_assignment.dta')
asg <- tbl_df(asg)