DATA 607 Data Acquisition & Management

Project 2 - Data Transformation

Silma Khan SPRING 2025

Your task is to:

Choose any three of the “wide” datasets identified in the Week 4 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 4 assignment!) For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

Perform the analysis requested in the discussion item

Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

In this project, we will work with three datasets that are not considered to be “tidy” data or in a “tidy” format. Tidy data follows the principle that each variable should have its own column and each observation has its own row. When data is untidy, it can lead to challenges when trying to perform some analysis on the data because the information is spread across multiple columns or compressed into a single cell.

For this project, I decided to use these three untidy datasets: - rolling_stone.csv Rolling Stone Dataset - cola.xlxs - Coca Cola Sales Dataset - Uncleaned_DS_jobs.csv - Data Science Jobs Dataset (from Glassdor)

Each of these datasets is untidy for different reason and we will cover the reasons why as we walk through our cleaning or tidying of the data.

rolling_stone.csv: - This dataset came from the “tidytuesday” githib repo that collects untidy datasets that have information about artists and their albums and how the albums have ranked throughout the years 2002, 2012, and 2020. The dataset also contains information about thr artists and type of music produced

Link: https://github.com/rfordatascience/tidytuesday/blob/main/data/2024/2024-05-07/rolling_stone.csv

cola.xlxs: - This dataset is collected from kaggle and contains information regarding Coca Cola sales and their profit and loss

Link: https://www.kaggle.com/datasets/shivavashishtha/dirty-excel-data?resource=download Uncleaned_DS_jobs.csv: - This dataset contains information regarding data science jobs that were scrapped from Glassdoor’s website. This is the uncleaned version, so the data that was direcly pulled from the website, without any cleaning work done to it

Link: https://www.kaggle.com/datasets/rashikrahmanpritom/data-science-job-posting-on-glassdoor?select=Uncleaned_DS_jobs.csv

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.4.3

Rolling Stone Dataset

The rolling_stone.csv file is accessed directly from my github repo for this project and can also be found using the github link provided above.

This dataset is considered to be untidy because it includes duplicate columns for the artist’s name in two different columns. Additionally, it holds ranking data in three separate columns. Rather than having separate columns for each years rank, tidy data should have a single “Rank” column with a corresponding “Year” column. This structure makes it challenging to compare rankings across years without reshaping the data.

rolling <- read_csv("https://raw.githubusercontent.com/silmaxk/DATA607_CUNYSPS/refs/heads/main/Project%232/rolling_stone.csv")
## Rows: 691 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): sort_name, clean_name, album, genre, type, spotify_url, artist_gen...
## dbl (13): rank_2003, rank_2012, rank_2020, differential, release_year, weeks...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(rolling)
## # A tibble: 6 × 21
##   sort_name      clean_name     album rank_2003 rank_2012 rank_2020 differential
##   <chr>          <chr>          <chr>     <dbl>     <dbl>     <dbl>        <dbl>
## 1 Sinatra, Frank Frank Sinatra  In t…       100       101       282         -182
## 2 Diddley, Bo    Bo Diddley     Bo D…       214       216       455         -241
## 3 Presley, Elvis Elvis Presley  Elvi…        55        56       332         -277
## 4 Sinatra, Frank Frank Sinatra  Song…       306       308        NA         -195
## 5 Little Richard Little Richard Here…        50        50       227         -177
## 6 Beyonce        Beyonce        Lemo…        NA        NA        32          469
## # ℹ 14 more variables: release_year <dbl>, genre <chr>, type <chr>,
## #   weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## #   spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## #   artist_gender <chr>, artist_birth_year_sum <dbl>,
## #   debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## #   years_between <dbl>, album_id <chr>
  • Tidying Data:
  1. Remove duplicate Artists Name: Since “sort_name” and “clean_name” represent the same information, we can choose to keep one, while removing the other. In this case, we will keep “clean_name” as it sets the name already to be in a readable format without unecessary punctuation

  2. Pivot rank column: We reshape the three rank columns into a long format that produces a “Year: column that will hold values 2003, 2012, and 2020) and a”Rank” column

rolling_tidy <- rolling %>%
  select(-sort_name) %>%   #removing the duplicate column and just keeping "clean_name"
  pivot_longer(
    cols = starts_with("rank_"),   #select columns that begin with "rank_" to make it faster to deal with
    names_to = "year",             #creating a new column that will hold the year information
    names_prefix = "rank_",        #removing the "rank_" prefix from the year values
    values_to = "rank"             #creating a new column for the rank values
  ) %>%
  filter(!is.na(rank))             #removing any rows with missing rank values

head(rolling_tidy)
## # A tibble: 6 × 19
##   clean_name    album   differential release_year genre type  weeks_on_billboard
##   <chr>         <chr>          <dbl>        <dbl> <chr> <chr>              <dbl>
## 1 Frank Sinatra In the…         -182         1955 Big … Stud…                 14
## 2 Frank Sinatra In the…         -182         1955 Big … Stud…                 14
## 3 Frank Sinatra In the…         -182         1955 Big … Stud…                 14
## 4 Bo Diddley    Bo Did…         -241         1955 Rock… Stud…                 NA
## 5 Bo Diddley    Bo Did…         -241         1955 Rock… Stud…                 NA
## 6 Bo Diddley    Bo Did…         -241         1955 Rock… Stud…                 NA
## # ℹ 12 more variables: peak_billboard_position <dbl>, spotify_popularity <dbl>,
## #   spotify_url <chr>, artist_member_count <dbl>, artist_gender <chr>,
## #   artist_birth_year_sum <dbl>, debut_album_release_year <dbl>,
## #   ave_age_at_top_500 <dbl>, years_between <dbl>, album_id <chr>, year <chr>,
## #   rank <dbl>

Now, we can see that this new updated tidy dataset only has one column with the artists name and now two columns, one titled “year” and another titled “rank” that has broken down the three columns holding rank values and have transformed it into a long data format.

  • Analysis:

Using this new tidy version of data, we can now begin got perform some analysis on the data. For this project, I will be answering these questions: - What is the average rank per year? - Which genres are consistently ranked the highest? - Which artists have the most albums on the list?

#What is the average rank per year?

avg_rank <- rolling_tidy %>%
  group_by(year) %>%
  summarize(avg_rank = mean(rank, na.rm = TRUE))

avg_rank_year <- ggplot(avg_rank, aes(x = as.numeric(year), y = avg_rank)) +
  geom_line() +
  geom_point() +
  labs(title = "Average Rank Over Years",
       x = "year",
       y = "Average Rank")

autoplot(avg_rank_year)

#Which genres are consistently ranked the highest?

genre_rank <- rolling_tidy %>%
  group_by(genre) %>%
  summarize(avg_rank = mean(rank, na.rm = TRUE)) %>%
  arrange(avg_rank)

genre_rank
## # A tibble: 17 × 2
##    genre                               avg_rank
##    <chr>                                  <dbl>
##  1 Blues/Blues ROck                        169.
##  2 Reggae                                  182.
##  3 Rock n' Roll/Rhythm & Blues             189.
##  4 Big Band/Jazz                           216.
##  5 Country/Folk/Country Rock/Folk Rock     216.
##  6 Funk/Disco                              222.
##  7 Hard Rock/Metal                         230.
##  8 Soul/Gospel/R&B                         242.
##  9 Blues/Blues Rock                        242.
## 10 <NA>                                    248.
## 11 Hip-Hop/Rap                             249.
## 12 Singer-Songwriter/Heartland Rock        261.
## 13 Punk/Post-Punk/New Wave/Power Pop       280.
## 14 Indie/Alternative Rock                  285.
## 15 Electronic                              337.
## 16 Latin                                   427.
## 17 Afrobeat                                434.
p_genre <- ggplot(genre_rank, aes(x = reorder(genre, avg_rank), y = avg_rank)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Average Rank by Genre",
       x = "Genre",
       y = "Average Rank")
autoplot(p_genre)

#Which artists have the most albums on the list?

artist_album_count <- rolling %>%
  group_by(clean_name) %>%
  summarize(album_count = n_distinct(album)) %>%
  arrange(desc(album_count))

artist_album_count
## # A tibble: 386 × 2
##    clean_name        album_count
##    <chr>                   <int>
##  1 Bob Dylan                  11
##  2 The Beatles                11
##  3 Rolling Stones             10
##  4 Bruce Springsteen           9
##  5 David Bowie                 7
##  6 The Who                     7
##  7 Elton John                  6
##  8 Kanye West                  6
##  9 Neil Young                  6
## 10 Bob Marley                  5
## # ℹ 376 more rows
top_artists <- artist_album_count %>% top_n(10, album_count)
p_artist <- ggplot(top_artists, aes(x = reorder(clean_name, album_count), y = album_count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Artists with Most Albums on the List",
       x = "Artist",
       y = "Number of Albums")
autoplot(p_artist)

By tidying this dataset and by removing that one column and changing those three different columns into two, we were able to perform strong analysis on the dataset to find some insights.

Cola Dataset

This Coca Cola dataset stores quarterly sales data across multiple columns. I tried to go outside of my comfort zone and work with an excel file for this project, also with data that have data stored in multiple columns and are put into a very computer un-friendly way.

This dataset contains sales data for different cola brands as well that are spread across separate columns.

This dataset has several issues that make it untidy:

  1. We have separate year columns (ex. FY ’09) which should be converted into a single “year” variable
  2. Some summarization rows (ex. totals, sub_totals) may appear in the data and should either be removed or retained as a separate category
  3. Some categories, such as “Cost of goods sold” appear as row labels, but should become a variable “Category”
  • Reading the Data:

Messy excel files have headers, footers, or multiple tables in one sheet, so to avoid this we can use the read_excel() function with the range argument to isolate the table that I need

In this case, I only want to focus on the “Profit and Loss” portion of the data, so depending on which portion of the data that falls in for the excel sheet, we can isolate just that

cola_profloss <- read_excel("Cola.xlsx", range = "A3:K10", col_names = TRUE)
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
cola_profloss
## # A tibble: 7 × 11
##   Profit & Loss statemen…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
##   <lgl>                    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NA                       in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
## 2 NA                       NET … 30990 35119 46542 48017 46854 45998 44294 41863
## 3 NA                       Cost… 11088 12693 18215 19053 18421 17889 17482 16465
## 4 NA                       Gros… 19902 22426 28327 28964 28433 28109 26812 25398
## 5 NA                       Sell… 11358 13194 17422 17738 17310 17218 16427 15262
## 6 NA                       Othe… 313   819   732   447   895   1183  1657  1510 
## 7 NA                       Oper… 8231  8413  10173 10779 10228 9708  8728  8626 
## # ℹ abbreviated name: ¹​`Profit & Loss statement`
## # ℹ 1 more variable: ...11 <chr>

Taking a look at this and given the output, the first row and columns contain extra headings rather than clean column and values. We need to: 1. Remove the first row 2. Use the new first row as column names 3. Rename the columns 4. Pivot from a wide format to a long format (by changing the years to become a single year column) 5. We need to parse the year strings into a numeric year

#convert data to a tibble for easier manipulation and remove the first row since it is unecessary
cola_profloss <- as_tibble(cola_profloss)

cola_profloss <- cola_profloss[ , -1]

cola_profloss
## # A tibble: 7 × 10
##   ...2                     ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11
##   <chr>                    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 in million USD           FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
## 2 NET OPERATING REVENUES   30990 35119 46542 48017 46854 45998 44294 41863 35410
## 3 Cost of goods sold       11088 12693 18215 19053 18421 17889 17482 16465 13255
## 4 Gross Profit             19902 22426 28327 28964 28433 28109 26812 25398 22155
## 5 Selling, general and ad… 11358 13194 17422 17738 17310 17218 16427 15262 12654
## 6 Other operating charges  313   819   732   447   895   1183  1657  1510  1902 
## 7 Operating Income         8231  8413  10173 10779 10228 9708  8728  8626  7599
#use the new first row as column names

colnames(cola_profloss) <- as.character(cola_profloss[1, ])

cola_profloss
## # A tibble: 7 × 10
##   `in million USD`         `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
##   <chr>                    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 in million USD           FY '09   FY '10   FY '11   FY '12   FY '13   FY '14  
## 2 NET OPERATING REVENUES   30990    35119    46542    48017    46854    45998   
## 3 Cost of goods sold       11088    12693    18215    19053    18421    17889   
## 4 Gross Profit             19902    22426    28327    28964    28433    28109   
## 5 Selling, general and ad… 11358    13194    17422    17738    17310    17218   
## 6 Other operating charges  313      819      732      447      895      1183    
## 7 Operating Income         8231     8413     10173    10779    10228    9708    
## # ℹ 3 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>
#since we used the first row as column names, we can now remove that row from the data
cola_profloss <- cola_profloss[-1, ]

cola_profloss
## # A tibble: 6 × 10
##   `in million USD`         `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
##   <chr>                    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 NET OPERATING REVENUES   30990    35119    46542    48017    46854    45998   
## 2 Cost of goods sold       11088    12693    18215    19053    18421    17889   
## 3 Gross Profit             19902    22426    28327    28964    28433    28109   
## 4 Selling, general and ad… 11358    13194    17422    17738    17310    17218   
## 5 Other operating charges  313      819      732      447      895      1183    
## 6 Operating Income         8231     8413     10173    10779    10228    9708    
## # ℹ 3 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>
cola_profloss <- cola_profloss %>%
  rename(Category = `in million USD`)

cola_profloss
## # A tibble: 6 × 10
##   Category        `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14` `FY '15`
##   <chr>           <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 NET OPERATING … 30990    35119    46542    48017    46854    45998    44294   
## 2 Cost of goods … 11088    12693    18215    19053    18421    17889    17482   
## 3 Gross Profit    19902    22426    28327    28964    28433    28109    26812   
## 4 Selling, gener… 11358    13194    17422    17738    17310    17218    16427   
## 5 Other operatin… 313      819      732      447      895      1183     1657    
## 6 Operating Inco… 8231     8413     10173    10779    10228    9708     8728    
## # ℹ 2 more variables: `FY '16` <chr>, `FY '17` <chr>
cola_profloss_tidy <- cola_profloss %>%
  pivot_longer(
    cols = starts_with("FY"),
    names_to = "Year",
    values_to = "Value"
  )

cola_profloss_tidy
## # A tibble: 54 × 3
##    Category               Year   Value
##    <chr>                  <chr>  <chr>
##  1 NET OPERATING REVENUES FY '09 30990
##  2 NET OPERATING REVENUES FY '10 35119
##  3 NET OPERATING REVENUES FY '11 46542
##  4 NET OPERATING REVENUES FY '12 48017
##  5 NET OPERATING REVENUES FY '13 46854
##  6 NET OPERATING REVENUES FY '14 45998
##  7 NET OPERATING REVENUES FY '15 44294
##  8 NET OPERATING REVENUES FY '16 41863
##  9 NET OPERATING REVENUES FY '17 35410
## 10 Cost of goods sold     FY '09 11088
## # ℹ 44 more rows
cola_profloss_tidy <- cola_profloss_tidy %>%
  mutate(
    Year = str_remove(Year, "FY "),   #remove "FY "
    Year = str_remove_all(Year, "'"), #remove symbols
    Year = as.numeric(Year),         #convert to numeric
    Year = if_else(Year < 30, 2000 + Year, 1900 + Year)
  )

cola_profloss_tidy
## # A tibble: 54 × 3
##    Category                Year Value
##    <chr>                  <dbl> <chr>
##  1 NET OPERATING REVENUES  2009 30990
##  2 NET OPERATING REVENUES  2010 35119
##  3 NET OPERATING REVENUES  2011 46542
##  4 NET OPERATING REVENUES  2012 48017
##  5 NET OPERATING REVENUES  2013 46854
##  6 NET OPERATING REVENUES  2014 45998
##  7 NET OPERATING REVENUES  2015 44294
##  8 NET OPERATING REVENUES  2016 41863
##  9 NET OPERATING REVENUES  2017 35410
## 10 Cost of goods sold      2009 11088
## # ℹ 44 more rows
  • Analysis:

For my analysis, I will create a bar plot to compare the values of each category in the most recent year

latest_year_data <- cola_profloss_tidy %>%
  filter(Year == max(Year, na.rm = TRUE))

p_bar <- ggplot(latest_year_data, aes(x = reorder(Category, Value), y = Value)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = paste("Category Values in", max(latest_year_data$Year)),
    x = "Category",
    y = "Value (Millions USD)"
  )
p_bar
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

By tidying the dataset, we are able to understand the data more and obtain stornger analysis on it depending on what type of information we are trying to see.

Data Science Jobs Dataset

The Uncleaned_DS_jobs.csv file is accessible through my github repo which contains job posting data for Data Science jobs posted on Glassdoor that was web scraped into a csv file.

This dataset is untidy because: - The Salary Estimate column often includes a rage with symbols rather than seprate min/max numeric values - The Location column can contain both city and state in one cell rather than having a separate columns for each - The Job Description column has extremely long text that may contain multiples lines, symbols, and repeated content - There are other columns that may not be necessary for some analysis

  • Load Data:
library(stringr)
ds_jobs <- read_csv("https://raw.githubusercontent.com/silmaxk/DATA607_CUNYSPS/refs/heads/main/Project%232/Uncleaned_DS_jobs.csv")
## Rows: 672 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): Job Title, Salary Estimate, Job Description, Company Name, Locatio...
## dbl  (3): index, Rating, Founded
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ds_jobs)
## # A tibble: 6 × 15
##   index `Job Title`    `Salary Estimate` `Job Description` Rating `Company Name`
##   <dbl> <chr>          <chr>             <chr>              <dbl> <chr>         
## 1     0 Sr Data Scien… $137K-$171K (Gla… "Description\n\n…    3.1 "Healthfirst\…
## 2     1 Data Scientist $137K-$171K (Gla… "Secure our Nati…    4.2 "ManTech\n4.2"
## 3     2 Data Scientist $137K-$171K (Gla… "Overview\n\n\nA…    3.8 "Analysis Gro…
## 4     3 Data Scientist $137K-$171K (Gla… "JOB DESCRIPTION…    3.5 "INFICON\n3.5"
## 5     4 Data Scientist $137K-$171K (Gla… "Data Scientist\…    2.9 "Affinity Sol…
## 6     5 Data Scientist $137K-$171K (Gla… "About Us:\n\nHe…    4.2 "HG Insights\…
## # ℹ 9 more variables: Location <chr>, Headquarters <chr>, Size <chr>,
## #   Founded <dbl>, `Type of ownership` <chr>, Industry <chr>, Sector <chr>,
## #   Revenue <chr>, Competitors <chr>
glimpse(ds_jobs)
## Rows: 672
## Columns: 15
## $ index               <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ `Job Title`         <chr> "Sr Data Scientist", "Data Scientist", "Data Scien…
## $ `Salary Estimate`   <chr> "$137K-$171K (Glassdoor est.)", "$137K-$171K (Glas…
## $ `Job Description`   <chr> "Description\n\nThe Senior Data Scientist is respo…
## $ Rating              <dbl> 3.1, 4.2, 3.8, 3.5, 2.9, 4.2, 3.9, 3.5, 4.4, 3.6, …
## $ `Company Name`      <chr> "Healthfirst\n3.1", "ManTech\n4.2", "Analysis Grou…
## $ Location            <chr> "New York, NY", "Chantilly, VA", "Boston, MA", "Ne…
## $ Headquarters        <chr> "New York, NY", "Herndon, VA", "Boston, MA", "Bad …
## $ Size                <chr> "1001 to 5000 employees", "5001 to 10000 employees…
## $ Founded             <dbl> 1993, 1968, 1981, 2000, 1998, 2010, 1996, 1990, 19…
## $ `Type of ownership` <chr> "Nonprofit Organization", "Company - Public", "Pri…
## $ Industry            <chr> "Insurance Carriers", "Research & Development", "C…
## $ Sector              <chr> "Insurance", "Business Services", "Business Servic…
## $ Revenue             <chr> "Unknown / Non-Applicable", "$1 to $2 billion (USD…
## $ Competitors         <chr> "EmblemHealth, UnitedHealth Group, Aetna", "-1", "…
  • Tidying Data:
jobs_clean <- ds_jobs %>%
  mutate(
    Salary_Clean = str_remove_all(`Salary Estimate`, "\\$|K|\\(.*\\)"),
    Salary_Min = as.numeric(str_extract(Salary_Clean, "^[0-9]+")),
    Salary_Max = as.numeric(str_extract(Salary_Clean, "(?<=-)[0-9]+"))
  )

jobs_clean %>% select(`Job Title`, `Salary Estimate`, Salary_Min, Salary_Max) %>% head()
## # A tibble: 6 × 4
##   `Job Title`       `Salary Estimate`            Salary_Min Salary_Max
##   <chr>             <chr>                             <dbl>      <dbl>
## 1 Sr Data Scientist $137K-$171K (Glassdoor est.)        137        171
## 2 Data Scientist    $137K-$171K (Glassdoor est.)        137        171
## 3 Data Scientist    $137K-$171K (Glassdoor est.)        137        171
## 4 Data Scientist    $137K-$171K (Glassdoor est.)        137        171
## 5 Data Scientist    $137K-$171K (Glassdoor est.)        137        171
## 6 Data Scientist    $137K-$171K (Glassdoor est.)        137        171
jobs_clean <- jobs_clean %>%
  separate(Location, into = c("City", "State"), sep = ", ", fill = "right", extra = "merge")

jobs_clean %>% select(`Job Title`, City, State) %>% head()
## # A tibble: 6 × 3
##   `Job Title`       City          State
##   <chr>             <chr>         <chr>
## 1 Sr Data Scientist New York      NY   
## 2 Data Scientist    Chantilly     VA   
## 3 Data Scientist    Boston        MA   
## 4 Data Scientist    Newton        MA   
## 5 Data Scientist    New York      NY   
## 6 Data Scientist    Santa Barbara CA
jobs_clean <- jobs_clean %>%
  select(-index, -Competitors, -Headquarters)  # remove columns you don't need

colnames(jobs_clean)
##  [1] "Job Title"         "Salary Estimate"   "Job Description"  
##  [4] "Rating"            "Company Name"      "City"             
##  [7] "State"             "Size"              "Founded"          
## [10] "Type of ownership" "Industry"          "Sector"           
## [13] "Revenue"           "Salary_Clean"      "Salary_Min"       
## [16] "Salary_Max"
jobs_clean <- jobs_clean %>%
  mutate(`Job Description` = str_squish(`Job Description`))

jobs_clean %>% select(`Job Description`) %>% head(1)
## # A tibble: 1 × 1
##   `Job Description`                                                             
##   <chr>                                                                         
## 1 Description The Senior Data Scientist is responsible for defining, building, …
  • Analysis:

Now that we have cleaned the data, we can now do some analysis on the data. For example, we can see which states have the most job postings

state_counts <- jobs_clean %>%
  filter(!is.na(State)) %>%
  count(State, sort = TRUE)

head(state_counts, 10)
## # A tibble: 10 × 2
##    State     n
##    <chr> <int>
##  1 CA      165
##  2 VA       89
##  3 MA       62
##  4 NY       52
##  5 MD       39
##  6 IL       30
##  7 DC       26
##  8 TX       17
##  9 WA       16
## 10 OH       14
top_states <- head(state_counts, 10)

p_states <- ggplot(top_states, aes(x = reorder(State, n), y = n)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 States with DS Job Postings",
       x = "State",
       y = "Number of Postings")

autoplot(p_states)

Taking a look at this analysis, we can see that California has a significant amount of job listings for Data Science

Conclusion

Throughout this project, we were able to examine three untidy datasets and performed comprehensive data tidying, transformation to make the data tidy and performed some analysis on the dataset

  1. Rolling Stone Dataset:
  • We had multiple rank columns across different years and duplicate artist name columns
  • We removed repeated columns, pivoted the rank columns into a single “Rank” column with the associated “Year” variable and this allowed us to easily compare and analyze how rankings change over time as well as identify which genres and artists were most popular
  1. Cola Dataset:
  • The financial data was spread across multiple columns for different years and line item categories stored as row labels
  • We were able to identify the relevant range of cells in an excel file, removed the headers, and pivoted the wide format into a tidy structure creating a single year column and a value column while still keeping line items under the category column. Once we tidied the data, we were able to perform statistical analysis and visualized the trend across the most recent year
  1. Data Science Jobs Dataset:
  • In this dataset we had complex text fields, and numerical observations in a string format and combined location data in a single column
  • We were able to manipulate the tring to parse through the salary ranges into a minimum and maximumm numerical column, split the location data into two separate columns for city and state, and removed the extra text in the job description data. These transformations allowed us to perform some analysis on the data such as finding the top 10 states with the most data science job postings

By following the rules for having tidy data, each dataset was transformed from an untidy, wide structure, into a clear and consistent format that allows for deeper and stronger statistical analysis and visualizations