Project 2

The goal of this assignment is to practice preparing untidy datasets for downstream analysis.

Load packages:

library(tidyverse)
library(stringr) 
library(readxl)

Dataset 1

For the first dataset, I chose the following data selected by Jhalak Das source::

Read the file into R:

file <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/Project%202/jhalak_das_untidy.csv"
df1 <- read.csv(file)
head(df1)
##   id   name phone sex.and.age test.number term.1 term.2 term.3
## 1  1   Mike   134        m_12      test 1     76     84     87
## 2  2  Linda   270        f_13      test 1     88     90     73
## 3  3    Sam   210        m_11      test 1     78     74     80
## 4  4 Esther   617        f_12      test 1     68     75     74
## 5  5   Mary   114        f_14      test 1     65     67     64
## 6  1   Mike   134        m_12      test 2     85     80     90
# split up sex and age
# test column to single int
# pivot term columns to single column, term, column: grade

df1_clean <- df1 |>
  separate(col=sex.and.age, into=c("sex","age"), sep="_", convert=TRUE) |>
  rename(test = test.number) |>
  mutate(test = as.integer(str_extract_all(test, "([0-3])"))) |>
  pivot_longer(cols=c("term.1","term.2", "term.3"), names_to="term",
               names_pattern="([0-3])", values_to = "grade") |>
  arrange(name)

df1_clean
## # A tibble: 30 × 8
##       id name   phone sex     age  test term  grade
##    <int> <chr>  <int> <chr> <int> <int> <chr> <int>
##  1     4 Esther   617 f        12     1 1        68
##  2     4 Esther   617 f        12     1 2        75
##  3     4 Esther   617 f        12     1 3        74
##  4     4 Esther   617 f        12     2 1        70
##  5     4 Esther   617 f        12     2 2        75
##  6     4 Esther   617 f        12     2 3        78
##  7     2 Linda    270 f        13     1 1        88
##  8     2 Linda    270 f        13     1 2        90
##  9     2 Linda    270 f        13     1 3        73
## 10     2 Linda    270 f        13     2 1        87
## # … with 20 more rows

What is the average test score by area code (which also happens to be by student)?

# average test score by area code

area_code <- df1_clean |>
  group_by(name, phone) |>
  summarize(avg_test_score = mean(grade)) |> 
  arrange(desc(avg_test_score))

area_code
## # A tibble: 5 × 3
## # Groups:   name [5]
##   name   phone avg_test_score
##   <chr>  <int>          <dbl>
## 1 Linda    270           85.7
## 2 Mike     134           83.7
## 3 Sam      210           79.8
## 4 Esther   617           73.3
## 5 Mary     114           66.2

Finally, saving the clean dataframe to .csv for later use.

path1 <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/df1_clean.csv"
write.csv(df1_clean, path1)

Dataset 2

For the second datasest, I chose data that I found and provided from the National Endowment for the Arts comparing median income for arts with the overall U.S. Labor Force:

This data was challenging, as it existed only in .xlsx format. As such, I decided to the data into R using readxl.

# read file in from local machine
filepath <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/Table1aArtistProfile.xlsx"
# read sheet names
sheets <- excel_sheets(filepath)
df2 <- read_excel(filepath, sheet=sheets[1], skip=1)
head(df2)
## # A tibble: 6 × 14
##   ...1     `Entire U.S. l…` `All artists` Architects1 `Fine artists,…` Designers
##   <chr>               <dbl>         <dbl>       <dbl>            <dbl>     <dbl>
## 1 Number …    164465375       2403842      208008           232092       9.33e+5
## 2 <NA>               NA            NA          NA               NA      NA      
## 3 Percent…            0.379         0.265       0.251            0.246   2.76e-1
## 4 <NA>               NA            NA          NA               NA      NA      
## 5 Percent…            0.472         0.476       0.276            0.461   5.53e-1
## 6 <NA>               NA            NA          NA               NA      NA      
## # … with 8 more variables: Actors <dbl>, `Producers and directors` <dbl>,
## #   `Dancers and choreographers` <chr>, Musicians <dbl>, Entertainers <dbl>,
## #   Announcers <dbl>, `Writers and authors` <dbl>, Photographers <dbl>

This analysis focuses on comparing median income between different types of artists with the overall U.S. Labor Force, so I determined to select the observations needed, which also allowed me to eliminate NA rows and columns. I also determined that the desired data would need to go from wide to long format, as the columns all represented observations I am interested in. However, in order to pivot, I needed to define the first column.

# get the desired column names
colnames <- colnames(df2)

# create the list of desired column names 
cols <- toupper(colnames(df2)) |>
  replace(1, "DEMOGRAPHIC")

# re-read the sheet
df2 <- read_excel(filepath, sheet=sheets[1], skip=3, col_names=cols)

# extract desired rows
df2_transformed <- df2[17:20,]
head(df2_transformed)
## # A tibble: 4 × 14
##   DEMOGRAPHIC        `ENTIRE U.S. L…` `ALL ARTISTS` ARCHITECTS1 `FINE ARTISTS,…`
##   <chr>                         <dbl>         <dbl>       <dbl>            <dbl>
## 1 Median earnings o…          48305       58005       83195            50505    
## 2 Men                         53670       64405       87675            54025    
## 3 Women                       42940       51575       70710            42425    
## 4 Women's-to-men's …              0.8         0.801       0.807            0.785
## # … with 9 more variables: DESIGNERS <dbl>, ACTORS <dbl>,
## #   `PRODUCERS AND DIRECTORS` <dbl>, `DANCERS AND CHOREOGRAPHERS` <chr>,
## #   MUSICIANS <dbl>, ENTERTAINERS <dbl>, ANNOUNCERS <dbl>,
## #   `WRITERS AND AUTHORS` <dbl>, PHOTOGRAPHERS <dbl>

Now that the data exists in a structure I can work with, I write the file to .csv and upload to GitHub in order to work with the file as a .csv per the project instructions, and read the new .csv file into R.

filepath <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/artistearnings.csv"
write.csv(df2_transformed, filepath, row.names=FALSE)

# read csv from GitHub
gh_csv <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/Project%202/artistearnings.csv"
df2t <- read.csv(gh_csv)
head(df2t)
##                                                DEMOGRAPHIC
## 1 Median earnings of workers employed full-year/full time4
## 2                                                      Men
## 3                                                    Women
## 4                          Women's-to-men's earnings ratio
##   ENTIRE.U.S..LABOR.FORCE  ALL.ARTISTS  ARCHITECTS1
## 1                 48305.0 5.800500e+04 8.319500e+04
## 2                 53670.0 6.440500e+04 8.767500e+04
## 3                 42940.0 5.157500e+04 7.071000e+04
## 4                     0.8 8.007305e-01 8.065175e-01
##   FINE.ARTISTS..ART.DIRECTORS..ANIMATORS    DESIGNERS      ACTORS
## 1                           5.050500e+04 5.556000e+04 4.10600e+04
## 2                           5.402500e+04 6.327500e+04 4.32200e+04
## 3                           4.242500e+04 5.050500e+04 3.75700e+04
## 4                           7.853266e-01 7.982009e-01 8.69314e-01
##   PRODUCERS.AND.DIRECTORS DANCERS.AND.CHOREOGRAPHERS   MUSICIANS ENTERTAINERS
## 1            6.960500e+04                      36365 4.58750e+04 4.126000e+04
## 2            7.071000e+04                          * 4.66200e+04 4.862000e+04
## 3            6.440500e+04                          * 4.29400e+04 3.164000e+04
## 4            9.108585e-01                          * 9.20988e-01 6.507082e-01
##     ANNOUNCERS WRITERS.AND.AUTHORS PHOTOGRAPHERS
## 1 5.167500e+04         6.01150e+04  4.294000e+04
## 2 5.273000e+04         6.48300e+04  4.745500e+04
## 3 4.848500e+04         5.69500e+04  3.646500e+04
## 4 9.195275e-01         8.78456e-01  7.684013e-01

Data tidying:

# replace value names in demographic column
df2t$DEMOGRAPHIC <- df2t$DEMOGRAPHIC |>
  replace(1, "MEDIAN") |>
  replace(4, "RATIO") |>
  toupper()

# dancers column to integer
df2t$DANCERS.AND.CHOREOGRAPHERS <- gsub("\\*", NaN, df2t$DANCERS.AND.CHOREOGRAPHERS) |>
  as.integer()

# pivot long
df2t_pivoted <- df2t |> 
  pivot_longer(cols=2:14, names_to="TYPE", values_to="MEDIAN_INCOME") |>
  filter(DEMOGRAPHIC == "MEDIAN") |>
  select(2:3) |>
  arrange(desc(MEDIAN_INCOME))
  
head(df2t_pivoted)
## # A tibble: 6 × 2
##   TYPE                    MEDIAN_INCOME
##   <chr>                           <dbl>
## 1 ARCHITECTS1                     83195
## 2 PRODUCERS.AND.DIRECTORS         69605
## 3 WRITERS.AND.AUTHORS             60115
## 4 ALL.ARTISTS                     58005
## 5 DESIGNERS                       55560
## 6 ANNOUNCERS                      51675

Saving the pivoted data for further analysis,

path2 <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/df2_clean.csv"
write.csv(df2t_pivoted, path2)

Now I plot the data comparing median artist income by type with the overall U.S. Labor Force:

ggplot(df2t_pivoted, aes(x=reorder(TYPE, MEDIAN_INCOME), y=MEDIAN_INCOME)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  labs(x = "TYPE")

Dataset 3

For the third dataset, I chose data selected by Seung Min Song:

This data was pretty straightforward, with the goal of comparing admissions by gender:

url <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/Project%202/seungminsong_das_unitdy.csv"
df3 <- read.csv(url)

head(df3)
##   Gender Dept Admitted Rejected
## 1   Male    A      512      313
## 2 Female    A       89       19
## 3   Male    B      353      207
## 4 Female    B       17        8
## 5   Male    C      120      205
## 6 Female    C      202      391

This data required simple pivoting to achieve the desired output:

df3_pivoted <- df3 |>
  pivot_longer(cols=c("Admitted", "Rejected"), names_to="Admit",
               values_to="Count") |>
  group_by(Gender, Admit) |>
  summarize(Total = sum(Count))

df3_pivoted
## # A tibble: 4 × 3
## # Groups:   Gender [2]
##   Gender Admit    Total
##   <chr>  <chr>    <int>
## 1 Female Admitted   557
## 2 Female Rejected  1278
## 3 Male   Admitted  1198
## 4 Male   Rejected  1493

Writing the pivoted data to file:

path3 <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/DATA-607/Project 2/df3_clean.csv"
write.csv(df3_pivoted, path3)

Visualizing the admitted and rejected outcomes by gender:

ggplot(df3_pivoted, aes(x=Admit, y=Total, fill=Gender)) +
  geom_col(position="dodge")