The goal of this assignment is to practice preparing untidy datasets for downstream analysis.
Load packages:
library(tidyverse)
library(stringr)
library(readxl)
For the first dataset, I chose the following data selected by Jhalak Das source::
Read the file into R:
file <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/Project%202/jhalak_das_untidy.csv"
df1 <- read.csv(file)
head(df1)
## id name phone sex.and.age test.number term.1 term.2 term.3
## 1 1 Mike 134 m_12 test 1 76 84 87
## 2 2 Linda 270 f_13 test 1 88 90 73
## 3 3 Sam 210 m_11 test 1 78 74 80
## 4 4 Esther 617 f_12 test 1 68 75 74
## 5 5 Mary 114 f_14 test 1 65 67 64
## 6 1 Mike 134 m_12 test 2 85 80 90
# split up sex and age
# test column to single int
# pivot term columns to single column, term, column: grade
df1_clean <- df1 |>
separate(col=sex.and.age, into=c("sex","age"), sep="_", convert=TRUE) |>
rename(test = test.number) |>
mutate(test = as.integer(str_extract_all(test, "([0-3])"))) |>
pivot_longer(cols=c("term.1","term.2", "term.3"), names_to="term",
names_pattern="([0-3])", values_to = "grade") |>
arrange(name)
df1_clean
## # A tibble: 30 × 8
## id name phone sex age test term grade
## <int> <chr> <int> <chr> <int> <int> <chr> <int>
## 1 4 Esther 617 f 12 1 1 68
## 2 4 Esther 617 f 12 1 2 75
## 3 4 Esther 617 f 12 1 3 74
## 4 4 Esther 617 f 12 2 1 70
## 5 4 Esther 617 f 12 2 2 75
## 6 4 Esther 617 f 12 2 3 78
## 7 2 Linda 270 f 13 1 1 88
## 8 2 Linda 270 f 13 1 2 90
## 9 2 Linda 270 f 13 1 3 73
## 10 2 Linda 270 f 13 2 1 87
## # … with 20 more rows
What is the average test score by area code (which also happens to be by student)?
# average test score by area code
area_code <- df1_clean |>
group_by(name, phone) |>
summarize(avg_test_score = mean(grade)) |>
arrange(desc(avg_test_score))
area_code
## # A tibble: 5 × 3
## # Groups: name [5]
## name phone avg_test_score
## <chr> <int> <dbl>
## 1 Linda 270 85.7
## 2 Mike 134 83.7
## 3 Sam 210 79.8
## 4 Esther 617 73.3
## 5 Mary 114 66.2
Finally, saving the clean dataframe to .csv for later use.
path1 <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/df1_clean.csv"
write.csv(df1_clean, path1)
For the second datasest, I chose data that I found and provided from the National Endowment for the Arts comparing median income for arts with the overall U.S. Labor Force:
This data was challenging, as it existed only in .xlsx format. As such, I decided to the data into R using readxl.
# read file in from local machine
filepath <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/Table1aArtistProfile.xlsx"
# read sheet names
sheets <- excel_sheets(filepath)
df2 <- read_excel(filepath, sheet=sheets[1], skip=1)
head(df2)
## # A tibble: 6 × 14
## ...1 `Entire U.S. l…` `All artists` Architects1 `Fine artists,…` Designers
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Number … 164465375 2403842 208008 232092 9.33e+5
## 2 <NA> NA NA NA NA NA
## 3 Percent… 0.379 0.265 0.251 0.246 2.76e-1
## 4 <NA> NA NA NA NA NA
## 5 Percent… 0.472 0.476 0.276 0.461 5.53e-1
## 6 <NA> NA NA NA NA NA
## # … with 8 more variables: Actors <dbl>, `Producers and directors` <dbl>,
## # `Dancers and choreographers` <chr>, Musicians <dbl>, Entertainers <dbl>,
## # Announcers <dbl>, `Writers and authors` <dbl>, Photographers <dbl>
This analysis focuses on comparing median income between different types of artists with the overall U.S. Labor Force, so I determined to select the observations needed, which also allowed me to eliminate NA rows and columns. I also determined that the desired data would need to go from wide to long format, as the columns all represented observations I am interested in. However, in order to pivot, I needed to define the first column.
# get the desired column names
colnames <- colnames(df2)
# create the list of desired column names
cols <- toupper(colnames(df2)) |>
replace(1, "DEMOGRAPHIC")
# re-read the sheet
df2 <- read_excel(filepath, sheet=sheets[1], skip=3, col_names=cols)
# extract desired rows
df2_transformed <- df2[17:20,]
head(df2_transformed)
## # A tibble: 4 × 14
## DEMOGRAPHIC `ENTIRE U.S. L…` `ALL ARTISTS` ARCHITECTS1 `FINE ARTISTS,…`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Median earnings o… 48305 58005 83195 50505
## 2 Men 53670 64405 87675 54025
## 3 Women 42940 51575 70710 42425
## 4 Women's-to-men's … 0.8 0.801 0.807 0.785
## # … with 9 more variables: DESIGNERS <dbl>, ACTORS <dbl>,
## # `PRODUCERS AND DIRECTORS` <dbl>, `DANCERS AND CHOREOGRAPHERS` <chr>,
## # MUSICIANS <dbl>, ENTERTAINERS <dbl>, ANNOUNCERS <dbl>,
## # `WRITERS AND AUTHORS` <dbl>, PHOTOGRAPHERS <dbl>
Now that the data exists in a structure I can work with, I write the file to .csv and upload to GitHub in order to work with the file as a .csv per the project instructions, and read the new .csv file into R.
filepath <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/artistearnings.csv"
write.csv(df2_transformed, filepath, row.names=FALSE)
# read csv from GitHub
gh_csv <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/Project%202/artistearnings.csv"
df2t <- read.csv(gh_csv)
head(df2t)
## DEMOGRAPHIC
## 1 Median earnings of workers employed full-year/full time4
## 2 Men
## 3 Women
## 4 Women's-to-men's earnings ratio
## ENTIRE.U.S..LABOR.FORCE ALL.ARTISTS ARCHITECTS1
## 1 48305.0 5.800500e+04 8.319500e+04
## 2 53670.0 6.440500e+04 8.767500e+04
## 3 42940.0 5.157500e+04 7.071000e+04
## 4 0.8 8.007305e-01 8.065175e-01
## FINE.ARTISTS..ART.DIRECTORS..ANIMATORS DESIGNERS ACTORS
## 1 5.050500e+04 5.556000e+04 4.10600e+04
## 2 5.402500e+04 6.327500e+04 4.32200e+04
## 3 4.242500e+04 5.050500e+04 3.75700e+04
## 4 7.853266e-01 7.982009e-01 8.69314e-01
## PRODUCERS.AND.DIRECTORS DANCERS.AND.CHOREOGRAPHERS MUSICIANS ENTERTAINERS
## 1 6.960500e+04 36365 4.58750e+04 4.126000e+04
## 2 7.071000e+04 * 4.66200e+04 4.862000e+04
## 3 6.440500e+04 * 4.29400e+04 3.164000e+04
## 4 9.108585e-01 * 9.20988e-01 6.507082e-01
## ANNOUNCERS WRITERS.AND.AUTHORS PHOTOGRAPHERS
## 1 5.167500e+04 6.01150e+04 4.294000e+04
## 2 5.273000e+04 6.48300e+04 4.745500e+04
## 3 4.848500e+04 5.69500e+04 3.646500e+04
## 4 9.195275e-01 8.78456e-01 7.684013e-01
Data tidying:
# replace value names in demographic column
df2t$DEMOGRAPHIC <- df2t$DEMOGRAPHIC |>
replace(1, "MEDIAN") |>
replace(4, "RATIO") |>
toupper()
# dancers column to integer
df2t$DANCERS.AND.CHOREOGRAPHERS <- gsub("\\*", NaN, df2t$DANCERS.AND.CHOREOGRAPHERS) |>
as.integer()
# pivot long
df2t_pivoted <- df2t |>
pivot_longer(cols=2:14, names_to="TYPE", values_to="MEDIAN_INCOME") |>
filter(DEMOGRAPHIC == "MEDIAN") |>
select(2:3) |>
arrange(desc(MEDIAN_INCOME))
head(df2t_pivoted)
## # A tibble: 6 × 2
## TYPE MEDIAN_INCOME
## <chr> <dbl>
## 1 ARCHITECTS1 83195
## 2 PRODUCERS.AND.DIRECTORS 69605
## 3 WRITERS.AND.AUTHORS 60115
## 4 ALL.ARTISTS 58005
## 5 DESIGNERS 55560
## 6 ANNOUNCERS 51675
Saving the pivoted data for further analysis,
path2 <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/df2_clean.csv"
write.csv(df2t_pivoted, path2)
Now I plot the data comparing median artist income by type with the overall U.S. Labor Force:
ggplot(df2t_pivoted, aes(x=reorder(TYPE, MEDIAN_INCOME), y=MEDIAN_INCOME)) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(x = "TYPE")
For the third dataset, I chose data selected by Seung Min Song:
This data was pretty straightforward, with the goal of comparing admissions by gender:
url <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/Project%202/seungminsong_das_unitdy.csv"
df3 <- read.csv(url)
head(df3)
## Gender Dept Admitted Rejected
## 1 Male A 512 313
## 2 Female A 89 19
## 3 Male B 353 207
## 4 Female B 17 8
## 5 Male C 120 205
## 6 Female C 202 391
This data required simple pivoting to achieve the desired output:
df3_pivoted <- df3 |>
pivot_longer(cols=c("Admitted", "Rejected"), names_to="Admit",
values_to="Count") |>
group_by(Gender, Admit) |>
summarize(Total = sum(Count))
df3_pivoted
## # A tibble: 4 × 3
## # Groups: Gender [2]
## Gender Admit Total
## <chr> <chr> <int>
## 1 Female Admitted 557
## 2 Female Rejected 1278
## 3 Male Admitted 1198
## 4 Male Rejected 1493
Writing the pivoted data to file:
path3 <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/df3_clean.csv"
write.csv(df3_pivoted, path3)
Visualizing the admitted and rejected outcomes by gender:
ggplot(df3_pivoted, aes(x=Admit, y=Total, fill=Gender)) +
geom_col(position="dodge")