1.Create a .CSV file (or optionally, a MySQL database!) that includes one of the “wide” datasets identified in the Week 6 Discussion items. You’re encouraged to use a “wide” structure similar to how the information appears above, sothat you can practice tidying and transformations as described below.

I chose dataset that was posted by Josh Iden and that includes the selected characteristics of artists and the total labor force between 2015 and 2019.

force <- rbind(c("", "Entire U.S. labor force", "All artists",  "Architects",   "Fine artists, art directors, animators",   "Designers",    "Actors",   "Producers and directors",  "Dancers and choreographers",   "Musicians",    "Entertainers", "Announcers",   "Writers and authors",  "Photographers"),
             c("Number in the labor force", "164 465 375",  "2 403 842",    "208 008",  "232 092",  "933 156",  "60 986",   "171 273",  "23 385",   "216 065",  "49 025",   "72 128",   "247 570",  "190 154"),
             c("","","","","","","","","","","","","",""),
             c("Percent non-white or Hispanic", "37.9%",    "26.5%",    "25.1%",    "24.6%",    "27.6%",    "33.2%",    "25.2%",    "44.1%",    "28.9%", "31.7%",   "36.1%",    "17.3%",    "26.1%"),
             c("","","","","","","","","","","","","",""),
             c("Percent female",    "47.2%",    "47.6%",    "27.6%",    "46.1%",    "55.3%",    "45.4%",    "37.6%",    "77.4%",    "33.6%",    "45.7%",    "28.8%",    "59.8%",    "47.6%"),
             c("","","","","","","","","","","","","",""),
             c("Median age",    42, 40, 45, 42, 40, 36, 39, 27, 45, 33, 40, 44, 37),
             c("","","","","","","","","","","","","",""),
             c("Percent with a disability", "6.2%", "5.3%", "3.2%", "7.3%", "4.6%", "5.5%", "3.3%", "6.6%", "7.4%", "7.8%", "7.6%", "6.1%", "5.2%"),
             c("","","","","","","","","","","","","",""),
             c("Percent self-employed", "9.3%", "33.6%",    "23.6%",    "53.8%",    "23.3%",    "37.8%",    "22.7%",    "23.1%",    "45.0%",    "45.9%",    "25.6%",    "41.8%",    "56.8%"),
             c("","","","","","","","","","","","","",""),
             c("Percent working full-year/full-time",   "66.0%",    "59.8%",    "82.3%",    "56.1%",    "68.3%",    "23.5%",    "72.3%",    "36.8%",    "32.6%",    "37.5%",    "52.6%",    "55.1%",    "46.5%"),
             c("","","","","","","","","","","","","",""),
             c("Percent with a bachelor's degree or higher level of training",  "37.3%",    "64.4%",    "90.7%",    "58.3%",    "59.9%",    "61.4%",    "74.9%",    "31.4%",    "57.0%",    "44.5%",    "46.0%","83.2%",    "51.2%"),
             c("","","","","","","","","","","","","",""),
             c("Median earnings of workers employed full-year/full time",   "$48 305",  "$58 005",  "$83 195",  "$50 505",  "$55 560",  "$41 060",  "$69 605",  "$36 365",  "$45 875",  "$41 260",  "$51 675",  "$60 115",  "$42 940"),
             c("Men",   "$53 670",  "$64 405",  "$87 675",  "$54 025",  "$63 275",  "$43 220",  "$70 710",  "*",    "$46 620",  "$48 620",  "$52 730",  "$64 830",  "$47 455"),
             c("Women", "$42 940",  "$51 575",  "$70 710",  "$42 425",  "$50 505",  "$37 570",  "$64 405",  "*",    "$42 940",  "$31 640",  "$48 485",  "$56 950",  "$36 465"),
             c("Women's-to-men's earnings ratio",   "$0.80",    "$0.80",    "$0.81",    "$0.79",    "$0.80",    "$0.87",    "$0.91",    "*",    "$0.92",    "$0.65",    "$0.92",    "$0.88",    "$0.77"),
             c("","","","","","","","","","","","","",""),
             c("White", "$53 670",  "$59 040",  "$84 370",  "$50 505",  "$55 820",  "$45 455", "$71 715",   "*",    "$47 475",  "$42 940",  "$53 635",  "$60 610",  "$43 220"),
             c("Non-white", "$38 895",  "$55 560",  "$79 420",  "$50 130",  "$55 105",  "$35 355",  "$60 115",  "*",    "$40 075",  "$37 275",  "$47 445",  "$59 425",  "$40 225"),
             c("Non-white-to-white earnings ratio", "$0.72",    "$0.94",    "$0.94",    "$0.99",    "$0.99",    "$0.78",    "$0.84",    "*",    "$0.84",    "$0.87",    "$0.88",    "$0.98",    "$0.93"),
             c("","","","","","","","","","","","","",""),
             c("In sample", "7 818 941",    "119 389",  "10 868",   "11 962",   "45 788",   "2 825",    "8 500",    "939",  "10 972",   "2 243",    "3 257",    "12 927",   "9 108")
             )

write.table(force, file = "force.csv", sep = ",", col.names=F, row.names=F)

2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

2.1 Load a csv file from Github

The file will be downloaded from the Github repository to csv file using read.csv function.

force <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/Project-2/force.csv', header=TRUE, sep=",", check.names=FALSE)

By checking the data downloaded, we have 14 column with 26 rows.

head(force,n=3)
##                                 Entire U.S. labor force All artists Architects
## 1     Number in the labor force             164 465 375   2 403 842    208 008
## 2                                                                             
## 3 Percent non-white or Hispanic                   37.9%       26.5%      25.1%
##   Fine artists, art directors, animators Designers Actors
## 1                                232 092   933 156 60 986
## 2                                                        
## 3                                  24.6%     27.6%  33.2%
##   Producers and directors Dancers and choreographers Musicians Entertainers
## 1                 171 273                     23 385   216 065       49 025
## 2                                                                          
## 3                   25.2%                      44.1%     28.9%        31.7%
##   Announcers Writers and authors Photographers
## 1     72 128             247 570       190 154
## 2                                             
## 3      36.1%               17.3%         26.1%
summary(force)
##                     Entire U.S. labor force All artists       
##  Length:26          Length:26               Length:26         
##  Class :character   Class :character        Class :character  
##  Mode  :character   Mode  :character        Mode  :character  
##   Architects        Fine artists, art directors, animators  Designers        
##  Length:26          Length:26                              Length:26         
##  Class :character   Class :character                       Class :character  
##  Mode  :character   Mode  :character                       Mode  :character  
##     Actors          Producers and directors Dancers and choreographers
##  Length:26          Length:26               Length:26                 
##  Class :character   Class :character        Class :character          
##  Mode  :character   Mode  :character        Mode  :character          
##   Musicians         Entertainers        Announcers        Writers and authors
##  Length:26          Length:26          Length:26          Length:26          
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##  Photographers     
##  Length:26         
##  Class :character  
##  Mode  :character
dim(force)
## [1] 26 14

2.2 Transform data

Tidy data should follow the rules: each variable is its own column, each observation is its own row, and each value is its own cell. At the current moment, we have “wide data” an can transform it to “long data” as it is better to work with.
We are going to work with data from csv file and use tidyverse package to transform it.
First of all, we will remove empty rows.

force <- force[!apply(force == "", 1, all), ] 
head(force, n=3)
##                                 Entire U.S. labor force All artists Architects
## 1     Number in the labor force             164 465 375   2 403 842    208 008
## 3 Percent non-white or Hispanic                   37.9%       26.5%      25.1%
## 5                Percent female                   47.2%       47.6%      27.6%
##   Fine artists, art directors, animators Designers Actors
## 1                                232 092   933 156 60 986
## 3                                  24.6%     27.6%  33.2%
## 5                                  46.1%     55.3%  45.4%
##   Producers and directors Dancers and choreographers Musicians Entertainers
## 1                 171 273                     23 385   216 065       49 025
## 3                   25.2%                      44.1%     28.9%        31.7%
## 5                   37.6%                      77.4%     33.6%        45.7%
##   Announcers Writers and authors Photographers
## 1     72 128             247 570       190 154
## 3      36.1%               17.3%         26.1%
## 5      28.8%               59.8%         47.6%

On the second step, we will work remove % and $ sign. No % sign is needed as the word “percent” is already in the names of the rows.

force[]<-lapply(force,gsub,pattern="$",fixed=TRUE,replacement="")
force[]<-lapply(force,gsub,pattern="%",fixed=TRUE,replacement="")

As there are only numeric value but presented as characters, we will transform them to numeric. The first column contains spaces in numbers, the code below will remove them.

force[,2:14]<-lapply(force[,2:14],gsub,pattern=" ",fixed=TRUE,replacement="")
head(force, n=3)
##                                 Entire U.S. labor force All artists Architects
## 1     Number in the labor force               164465375     2403842     208008
## 3 Percent non-white or Hispanic                    37.9        26.5       25.1
## 5                Percent female                    47.2        47.6       27.6
##   Fine artists, art directors, animators Designers Actors
## 1                                 232092    933156  60986
## 3                                   24.6      27.6   33.2
## 5                                   46.1      55.3   45.4
##   Producers and directors Dancers and choreographers Musicians Entertainers
## 1                  171273                      23385    216065        49025
## 3                    25.2                       44.1      28.9         31.7
## 5                    37.6                       77.4      33.6         45.7
##   Announcers Writers and authors Photographers
## 1      72128              247570        190154
## 3       36.1                17.3          26.1
## 5       28.8                59.8          47.6
force[, 2:14] <- apply(force[, 2:14], 2,       
                    function(x) as.numeric(as.character(x)))
str(force)
## 'data.frame':    16 obs. of  14 variables:
##  $                                       : chr  "Number in the labor force" "Percent non-white or Hispanic" "Percent female" "Median age" ...
##  $ Entire U.S. labor force               : num  1.64e+08 3.79e+01 4.72e+01 4.20e+01 6.20 ...
##  $ All artists                           : num  2.40e+06 2.65e+01 4.76e+01 4.00e+01 5.30 ...
##  $ Architects                            : num  208008 25.1 27.6 45 3.2 ...
##  $ Fine artists, art directors, animators: num  232092 24.6 46.1 42 7.3 ...
##  $ Designers                             : num  933156 27.6 55.3 40 4.6 ...
##  $ Actors                                : num  60986 33.2 45.4 36 5.5 ...
##  $ Producers and directors               : num  171273 25.2 37.6 39 3.3 ...
##  $ Dancers and choreographers            : num  23385 44.1 77.4 27 6.6 ...
##  $ Musicians                             : num  216065 28.9 33.6 45 7.4 ...
##  $ Entertainers                          : num  49025 31.7 45.7 33 7.8 ...
##  $ Announcers                            : num  72128 36.1 28.8 40 7.6 ...
##  $ Writers and authors                   : num  247570 17.3 59.8 44 6.1 ...
##  $ Photographers                         : num  190154 26.1 47.6 37 5.2 ...

Column 1 will become row names.

rownames(force) <- force[,1]

We will rename first column, make it as the names of rows and transpose dataframe to make columns as rows and rows as columns.

names(force)[1] <- 'Info'
force <- force %>% select(-Info)
force_new <- transpose(force)


rownames(force_new) <- colnames(force)
colnames(force_new) <- rownames(force)

head(force_new,n=3)
##                         Number in the labor force Percent non-white or Hispanic
## Entire U.S. labor force                 164465375                          37.9
## All artists                               2403842                          26.5
## Architects                                 208008                          25.1
##                         Percent female Median age Percent with a disability
## Entire U.S. labor force           47.2         42                       6.2
## All artists                       47.6         40                       5.3
## Architects                        27.6         45                       3.2
##                         Percent self-employed
## Entire U.S. labor force                   9.3
## All artists                              33.6
## Architects                               23.6
##                         Percent working full-year/full-time
## Entire U.S. labor force                                66.0
## All artists                                            59.8
## Architects                                             82.3
##                         Percent with a bachelor's degree or higher level of training
## Entire U.S. labor force                                                         37.3
## All artists                                                                     64.4
## Architects                                                                      90.7
##                         Median earnings of workers employed full-year/full time
## Entire U.S. labor force                                                   48305
## All artists                                                               58005
## Architects                                                                83195
##                           Men Women Women's-to-men's earnings ratio White
## Entire U.S. labor force 53670 42940                            0.80 53670
## All artists             64405 51575                            0.80 59040
## Architects              87675 70710                            0.81 84370
##                         Non-white Non-white-to-white earnings ratio In sample
## Entire U.S. labor force     38895                              0.72   7818941
## All artists                 55560                              0.94    119389
## Architects                  79420                              0.94     10868

As the next step, “Men” and “Women” columns will become one column.

force_new$force_type <- rownames(force_new)
force_new <- force_new %>%           
  dplyr::select("force_type", everything())

rownames(force_new) <- 1:nrow(force_new)
force <- force_new %>% 
  pivot_longer(c(`Men`, `Women`), names_to="Sex", values_to="Salary_sex")
head(force, n=3)
## # A tibble: 3 x 17
##   force_type     Numbe~1 Perce~2 Perce~3 Media~4 Perce~5 Perce~6 Perce~7 Perce~8
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Entire U.S. l~  1.64e8    37.9    47.2      42     6.2     9.3    66      37.3
## 2 Entire U.S. l~  1.64e8    37.9    47.2      42     6.2     9.3    66      37.3
## 3 All artists     2.40e6    26.5    47.6      40     5.3    33.6    59.8    64.4
## # ... with 8 more variables:
## #   `Median earnings of workers employed full-year/full time` <dbl>,
## #   `Women's-to-men's earnings ratio` <dbl>, White <dbl>, `Non-white` <dbl>,
## #   `Non-white-to-white earnings ratio` <dbl>, `In sample` <dbl>, Sex <chr>,
## #   Salary_sex <dbl>, and abbreviated variable names
## #   1: `Number in the labor force`, 2: `Percent non-white or Hispanic`,
## #   3: `Percent female`, 4: `Median age`, 5: `Percent with a disability`, ...

3. Perform the analysis requested in the discussion item.

Let’s check artists’ income by type and overall U.S. labor force.
From the plot below, architects earn more than other labor forces, dancers and choreographers less than everyone else.

ggplot(force[5:26,], aes(force_type, `Median earnings of workers employed full-year/full time`)) + geom_col(fill='blue') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5)) +
  labs(title="Median salary by type of force", x= "Type of Force", y = "Median salary")


Next, we will check the salary for each type of force and sex. Dancers and choreographers are not included as there is no information about their salaries based on sex.
There is observation that in general, men earn more than women in each type of labor force.

ggplot(force[-(15:16),],aes(x = force_type,y = Salary_sex)) + 
    geom_bar(aes(fill = Sex),stat = "identity",position = "dodge") + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5)) +
    labs(title="Median salary by sex", x= "Type of Force", y = "Median salary")


Entertainers has the most inequality in term of salary for women, men are paid much more. Women earn only 65% of men salary.

force[force[,11] == min(force[,11],na.rm = TRUE),]
## # A tibble: 4 x 17
##   force_type   Number ~1 Perce~2 Perce~3 Media~4 Perce~5 Perce~6 Perce~7 Perce~8
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 <NA>                NA    NA      NA        NA    NA      NA      NA      NA  
## 2 <NA>                NA    NA      NA        NA    NA      NA      NA      NA  
## 3 Entertainers     49025    31.7    45.7      33     7.8    45.9    37.5    44.5
## 4 Entertainers     49025    31.7    45.7      33     7.8    45.9    37.5    44.5
## # ... with 8 more variables:
## #   `Median earnings of workers employed full-year/full time` <dbl>,
## #   `Women's-to-men's earnings ratio` <dbl>, White <dbl>, `Non-white` <dbl>,
## #   `Non-white-to-white earnings ratio` <dbl>, `In sample` <dbl>, Sex <chr>,
## #   Salary_sex <dbl>, and abbreviated variable names
## #   1: `Number in the labor force`, 2: `Percent non-white or Hispanic`,
## #   3: `Percent female`, 4: `Median age`, 5: `Percent with a disability`, ...


Just to check again, we will plot boxplot to confirm the previous observation, men earn more that women in general.

ggplot(force, aes(x=Sex, y=Salary_sex, color=Sex)) + 
  geom_boxplot() +
  theme_light() +
  labs(title="Salary based on sex", x= "Sex", y = "Salary") +
  theme(plot.title = element_text(hjust = 0.5))


It would be interesting to know which field has the most freelancers. Based on the results of scatter plot, photographers are self-employed in general.

ggplot(force[5:26,],aes(x =force_type,y = `Percent self-employed`)) + 
    geom_point(size=3, color='red') + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5)) +
    labs(title="Percent self-employed based on force type", x= "Force type", y = "% self-employed") 


4. Conclusion.

During the work, we have reviewed how to organize data in a way called “tidy data” using package tidyverse, how to transform wide format to long format.
By analyzing data, we found that if you would like to join labor forces, it is lucrative to join team “Architects” as they earn more than $80k and it will difficult to earn in team “Dancers and choreographers” as they have salary below 40k.
If you would like to work for yourself, it is the best to join Photographers, the majority of them are freelancers and have a good salary. Unfortunately, there is inequality in salaries based on gender, men earn more than women in every field of labor forces. The most unequal force is entertainers, women earn there only 65% of men salary.