I chose dataset that was posted by Josh Iden and that includes the selected characteristics of artists and the total labor force between 2015 and 2019.
force <- rbind(c("", "Entire U.S. labor force", "All artists", "Architects", "Fine artists, art directors, animators", "Designers", "Actors", "Producers and directors", "Dancers and choreographers", "Musicians", "Entertainers", "Announcers", "Writers and authors", "Photographers"),
c("Number in the labor force", "164 465 375", "2 403 842", "208 008", "232 092", "933 156", "60 986", "171 273", "23 385", "216 065", "49 025", "72 128", "247 570", "190 154"),
c("","","","","","","","","","","","","",""),
c("Percent non-white or Hispanic", "37.9%", "26.5%", "25.1%", "24.6%", "27.6%", "33.2%", "25.2%", "44.1%", "28.9%", "31.7%", "36.1%", "17.3%", "26.1%"),
c("","","","","","","","","","","","","",""),
c("Percent female", "47.2%", "47.6%", "27.6%", "46.1%", "55.3%", "45.4%", "37.6%", "77.4%", "33.6%", "45.7%", "28.8%", "59.8%", "47.6%"),
c("","","","","","","","","","","","","",""),
c("Median age", 42, 40, 45, 42, 40, 36, 39, 27, 45, 33, 40, 44, 37),
c("","","","","","","","","","","","","",""),
c("Percent with a disability", "6.2%", "5.3%", "3.2%", "7.3%", "4.6%", "5.5%", "3.3%", "6.6%", "7.4%", "7.8%", "7.6%", "6.1%", "5.2%"),
c("","","","","","","","","","","","","",""),
c("Percent self-employed", "9.3%", "33.6%", "23.6%", "53.8%", "23.3%", "37.8%", "22.7%", "23.1%", "45.0%", "45.9%", "25.6%", "41.8%", "56.8%"),
c("","","","","","","","","","","","","",""),
c("Percent working full-year/full-time", "66.0%", "59.8%", "82.3%", "56.1%", "68.3%", "23.5%", "72.3%", "36.8%", "32.6%", "37.5%", "52.6%", "55.1%", "46.5%"),
c("","","","","","","","","","","","","",""),
c("Percent with a bachelor's degree or higher level of training", "37.3%", "64.4%", "90.7%", "58.3%", "59.9%", "61.4%", "74.9%", "31.4%", "57.0%", "44.5%", "46.0%","83.2%", "51.2%"),
c("","","","","","","","","","","","","",""),
c("Median earnings of workers employed full-year/full time", "$48 305", "$58 005", "$83 195", "$50 505", "$55 560", "$41 060", "$69 605", "$36 365", "$45 875", "$41 260", "$51 675", "$60 115", "$42 940"),
c("Men", "$53 670", "$64 405", "$87 675", "$54 025", "$63 275", "$43 220", "$70 710", "*", "$46 620", "$48 620", "$52 730", "$64 830", "$47 455"),
c("Women", "$42 940", "$51 575", "$70 710", "$42 425", "$50 505", "$37 570", "$64 405", "*", "$42 940", "$31 640", "$48 485", "$56 950", "$36 465"),
c("Women's-to-men's earnings ratio", "$0.80", "$0.80", "$0.81", "$0.79", "$0.80", "$0.87", "$0.91", "*", "$0.92", "$0.65", "$0.92", "$0.88", "$0.77"),
c("","","","","","","","","","","","","",""),
c("White", "$53 670", "$59 040", "$84 370", "$50 505", "$55 820", "$45 455", "$71 715", "*", "$47 475", "$42 940", "$53 635", "$60 610", "$43 220"),
c("Non-white", "$38 895", "$55 560", "$79 420", "$50 130", "$55 105", "$35 355", "$60 115", "*", "$40 075", "$37 275", "$47 445", "$59 425", "$40 225"),
c("Non-white-to-white earnings ratio", "$0.72", "$0.94", "$0.94", "$0.99", "$0.99", "$0.78", "$0.84", "*", "$0.84", "$0.87", "$0.88", "$0.98", "$0.93"),
c("","","","","","","","","","","","","",""),
c("In sample", "7 818 941", "119 389", "10 868", "11 962", "45 788", "2 825", "8 500", "939", "10 972", "2 243", "3 257", "12 927", "9 108")
)
write.table(force, file = "force.csv", sep = ",", col.names=F, row.names=F)
The file will be downloaded from the Github repository to csv file using read.csv function.
force <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/Project-2/force.csv', header=TRUE, sep=",", check.names=FALSE)
By checking the data downloaded, we have 14 column with 26 rows.
head(force,n=3)
## Entire U.S. labor force All artists Architects
## 1 Number in the labor force 164 465 375 2 403 842 208 008
## 2
## 3 Percent non-white or Hispanic 37.9% 26.5% 25.1%
## Fine artists, art directors, animators Designers Actors
## 1 232 092 933 156 60 986
## 2
## 3 24.6% 27.6% 33.2%
## Producers and directors Dancers and choreographers Musicians Entertainers
## 1 171 273 23 385 216 065 49 025
## 2
## 3 25.2% 44.1% 28.9% 31.7%
## Announcers Writers and authors Photographers
## 1 72 128 247 570 190 154
## 2
## 3 36.1% 17.3% 26.1%
summary(force)
## Entire U.S. labor force All artists
## Length:26 Length:26 Length:26
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Architects Fine artists, art directors, animators Designers
## Length:26 Length:26 Length:26
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Actors Producers and directors Dancers and choreographers
## Length:26 Length:26 Length:26
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Musicians Entertainers Announcers Writers and authors
## Length:26 Length:26 Length:26 Length:26
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Photographers
## Length:26
## Class :character
## Mode :character
dim(force)
## [1] 26 14
Tidy data should follow the rules: each variable is its own column, each observation is its own row, and each value is its own cell. At the current moment, we have “wide data” an can transform it to “long data” as it is better to work with.
We are going to work with data from csv file and use tidyverse package to transform it.
First of all, we will remove empty rows.
force <- force[!apply(force == "", 1, all), ]
head(force, n=3)
## Entire U.S. labor force All artists Architects
## 1 Number in the labor force 164 465 375 2 403 842 208 008
## 3 Percent non-white or Hispanic 37.9% 26.5% 25.1%
## 5 Percent female 47.2% 47.6% 27.6%
## Fine artists, art directors, animators Designers Actors
## 1 232 092 933 156 60 986
## 3 24.6% 27.6% 33.2%
## 5 46.1% 55.3% 45.4%
## Producers and directors Dancers and choreographers Musicians Entertainers
## 1 171 273 23 385 216 065 49 025
## 3 25.2% 44.1% 28.9% 31.7%
## 5 37.6% 77.4% 33.6% 45.7%
## Announcers Writers and authors Photographers
## 1 72 128 247 570 190 154
## 3 36.1% 17.3% 26.1%
## 5 28.8% 59.8% 47.6%
On the second step, we will work remove % and $ sign. No % sign is needed as the word “percent” is already in the names of the rows.
force[]<-lapply(force,gsub,pattern="$",fixed=TRUE,replacement="")
force[]<-lapply(force,gsub,pattern="%",fixed=TRUE,replacement="")
As there are only numeric value but presented as characters, we will transform them to numeric. The first column contains spaces in numbers, the code below will remove them.
force[,2:14]<-lapply(force[,2:14],gsub,pattern=" ",fixed=TRUE,replacement="")
head(force, n=3)
## Entire U.S. labor force All artists Architects
## 1 Number in the labor force 164465375 2403842 208008
## 3 Percent non-white or Hispanic 37.9 26.5 25.1
## 5 Percent female 47.2 47.6 27.6
## Fine artists, art directors, animators Designers Actors
## 1 232092 933156 60986
## 3 24.6 27.6 33.2
## 5 46.1 55.3 45.4
## Producers and directors Dancers and choreographers Musicians Entertainers
## 1 171273 23385 216065 49025
## 3 25.2 44.1 28.9 31.7
## 5 37.6 77.4 33.6 45.7
## Announcers Writers and authors Photographers
## 1 72128 247570 190154
## 3 36.1 17.3 26.1
## 5 28.8 59.8 47.6
force[, 2:14] <- apply(force[, 2:14], 2,
function(x) as.numeric(as.character(x)))
str(force)
## 'data.frame': 16 obs. of 14 variables:
## $ : chr "Number in the labor force" "Percent non-white or Hispanic" "Percent female" "Median age" ...
## $ Entire U.S. labor force : num 1.64e+08 3.79e+01 4.72e+01 4.20e+01 6.20 ...
## $ All artists : num 2.40e+06 2.65e+01 4.76e+01 4.00e+01 5.30 ...
## $ Architects : num 208008 25.1 27.6 45 3.2 ...
## $ Fine artists, art directors, animators: num 232092 24.6 46.1 42 7.3 ...
## $ Designers : num 933156 27.6 55.3 40 4.6 ...
## $ Actors : num 60986 33.2 45.4 36 5.5 ...
## $ Producers and directors : num 171273 25.2 37.6 39 3.3 ...
## $ Dancers and choreographers : num 23385 44.1 77.4 27 6.6 ...
## $ Musicians : num 216065 28.9 33.6 45 7.4 ...
## $ Entertainers : num 49025 31.7 45.7 33 7.8 ...
## $ Announcers : num 72128 36.1 28.8 40 7.6 ...
## $ Writers and authors : num 247570 17.3 59.8 44 6.1 ...
## $ Photographers : num 190154 26.1 47.6 37 5.2 ...
Column 1 will become row names.
rownames(force) <- force[,1]
We will rename first column, make it as the names of rows and transpose dataframe to make columns as rows and rows as columns.
names(force)[1] <- 'Info'
force <- force %>% select(-Info)
force_new <- transpose(force)
rownames(force_new) <- colnames(force)
colnames(force_new) <- rownames(force)
head(force_new,n=3)
## Number in the labor force Percent non-white or Hispanic
## Entire U.S. labor force 164465375 37.9
## All artists 2403842 26.5
## Architects 208008 25.1
## Percent female Median age Percent with a disability
## Entire U.S. labor force 47.2 42 6.2
## All artists 47.6 40 5.3
## Architects 27.6 45 3.2
## Percent self-employed
## Entire U.S. labor force 9.3
## All artists 33.6
## Architects 23.6
## Percent working full-year/full-time
## Entire U.S. labor force 66.0
## All artists 59.8
## Architects 82.3
## Percent with a bachelor's degree or higher level of training
## Entire U.S. labor force 37.3
## All artists 64.4
## Architects 90.7
## Median earnings of workers employed full-year/full time
## Entire U.S. labor force 48305
## All artists 58005
## Architects 83195
## Men Women Women's-to-men's earnings ratio White
## Entire U.S. labor force 53670 42940 0.80 53670
## All artists 64405 51575 0.80 59040
## Architects 87675 70710 0.81 84370
## Non-white Non-white-to-white earnings ratio In sample
## Entire U.S. labor force 38895 0.72 7818941
## All artists 55560 0.94 119389
## Architects 79420 0.94 10868
As the next step, “Men” and “Women” columns will become one column.
force_new$force_type <- rownames(force_new)
force_new <- force_new %>%
dplyr::select("force_type", everything())
rownames(force_new) <- 1:nrow(force_new)
force <- force_new %>%
pivot_longer(c(`Men`, `Women`), names_to="Sex", values_to="Salary_sex")
head(force, n=3)
## # A tibble: 3 x 17
## force_type Numbe~1 Perce~2 Perce~3 Media~4 Perce~5 Perce~6 Perce~7 Perce~8
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Entire U.S. l~ 1.64e8 37.9 47.2 42 6.2 9.3 66 37.3
## 2 Entire U.S. l~ 1.64e8 37.9 47.2 42 6.2 9.3 66 37.3
## 3 All artists 2.40e6 26.5 47.6 40 5.3 33.6 59.8 64.4
## # ... with 8 more variables:
## # `Median earnings of workers employed full-year/full time` <dbl>,
## # `Women's-to-men's earnings ratio` <dbl>, White <dbl>, `Non-white` <dbl>,
## # `Non-white-to-white earnings ratio` <dbl>, `In sample` <dbl>, Sex <chr>,
## # Salary_sex <dbl>, and abbreviated variable names
## # 1: `Number in the labor force`, 2: `Percent non-white or Hispanic`,
## # 3: `Percent female`, 4: `Median age`, 5: `Percent with a disability`, ...
Let’s check artists’ income by type and overall U.S. labor force.
From the plot below, architects earn more than other labor forces, dancers and choreographers less than everyone else.
ggplot(force[5:26,], aes(force_type, `Median earnings of workers employed full-year/full time`)) + geom_col(fill='blue') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5)) +
labs(title="Median salary by type of force", x= "Type of Force", y = "Median salary")
Next, we will check the salary for each type of force and sex. Dancers and choreographers are not included as there is no information about their salaries based on sex.
There is observation that in general, men earn more than women in each type of labor force.
ggplot(force[-(15:16),],aes(x = force_type,y = Salary_sex)) +
geom_bar(aes(fill = Sex),stat = "identity",position = "dodge") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5)) +
labs(title="Median salary by sex", x= "Type of Force", y = "Median salary")
Entertainers has the most inequality in term of salary for women, men are paid much more. Women earn only 65% of men salary.
force[force[,11] == min(force[,11],na.rm = TRUE),]
## # A tibble: 4 x 17
## force_type Number ~1 Perce~2 Perce~3 Media~4 Perce~5 Perce~6 Perce~7 Perce~8
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> NA NA NA NA NA NA NA NA
## 2 <NA> NA NA NA NA NA NA NA NA
## 3 Entertainers 49025 31.7 45.7 33 7.8 45.9 37.5 44.5
## 4 Entertainers 49025 31.7 45.7 33 7.8 45.9 37.5 44.5
## # ... with 8 more variables:
## # `Median earnings of workers employed full-year/full time` <dbl>,
## # `Women's-to-men's earnings ratio` <dbl>, White <dbl>, `Non-white` <dbl>,
## # `Non-white-to-white earnings ratio` <dbl>, `In sample` <dbl>, Sex <chr>,
## # Salary_sex <dbl>, and abbreviated variable names
## # 1: `Number in the labor force`, 2: `Percent non-white or Hispanic`,
## # 3: `Percent female`, 4: `Median age`, 5: `Percent with a disability`, ...
Just to check again, we will plot boxplot to confirm the previous observation, men earn more that women in general.
ggplot(force, aes(x=Sex, y=Salary_sex, color=Sex)) +
geom_boxplot() +
theme_light() +
labs(title="Salary based on sex", x= "Sex", y = "Salary") +
theme(plot.title = element_text(hjust = 0.5))
It would be interesting to know which field has the most freelancers. Based on the results of scatter plot, photographers are self-employed in general.
ggplot(force[5:26,],aes(x =force_type,y = `Percent self-employed`)) +
geom_point(size=3, color='red') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5)) +
labs(title="Percent self-employed based on force type", x= "Force type", y = "% self-employed")
During the work, we have reviewed how to organize data in a way called “tidy data” using package tidyverse, how to transform wide format to long format.
By analyzing data, we found that if you would like to join labor forces, it is lucrative to join team “Architects” as they earn more than $80k and it will difficult to earn in team “Dancers and choreographers” as they have salary below 40k.
If you would like to work for yourself, it is the best to join Photographers, the majority of them are freelancers and have a good salary. Unfortunately, there is inequality in salaries based on gender, men earn more than women in every field of labor forces. The most unequal force is entertainers, women earn there only 65% of men salary.