Your task is to:
Choose any three of the “wide” datasets identified in the Week 4 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 4 assignment!) For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
In this project, we will work with three datasets that are not considered to be “tidy” data or in a “tidy” format. Tidy data follows the principle that each variable should have its own column and each observation has its own row. When data is untidy, it can lead to challenges when trying to perform some analysis on the data because the information is spread across multiple columns or compressed into a single cell.
For this project, I decided to use these three untidy datasets: - rolling_stone.csv Rolling Stone Dataset - cola.xlxs - Coca Cola Sales Dataset - Uncleaned_DS_jobs.csv - Data Science Jobs Dataset (from Glassdor)
Each of these datasets is untidy for different reason and we will cover the reasons why as we walk through our cleaning or tidying of the data.
rolling_stone.csv: - This dataset came from the “tidytuesday” githib repo that collects untidy datasets that have information about artists and their albums and how the albums have ranked throughout the years 2002, 2012, and 2020. The dataset also contains information about thr artists and type of music produced
Link: https://github.com/rfordatascience/tidytuesday/blob/main/data/2024/2024-05-07/rolling_stone.csv
cola.xlxs: - This dataset is collected from kaggle and contains information regarding Coca Cola sales and their profit and loss
Link: https://www.kaggle.com/datasets/shivavashishtha/dirty-excel-data?resource=download Uncleaned_DS_jobs.csv: - This dataset contains information regarding data science jobs that were scrapped from Glassdoor’s website. This is the uncleaned version, so the data that was direcly pulled from the website, without any cleaning work done to it
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.4.3
The rolling_stone.csv file is accessed directly from my github repo for this project and can also be found using the github link provided above.
This dataset is considered to be untidy because it includes duplicate columns for the artist’s name in two different columns. Additionally, it holds ranking data in three separate columns. Rather than having separate columns for each years rank, tidy data should have a single “Rank” column with a corresponding “Year” column. This structure makes it challenging to compare rankings across years without reshaping the data.
rolling <- read_csv("https://raw.githubusercontent.com/silmaxk/DATA607_CUNYSPS/refs/heads/main/Project%232/rolling_stone.csv")
## Rows: 691 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): sort_name, clean_name, album, genre, type, spotify_url, artist_gen...
## dbl (13): rank_2003, rank_2012, rank_2020, differential, release_year, weeks...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(rolling)
## # A tibble: 6 × 21
## sort_name clean_name album rank_2003 rank_2012 rank_2020 differential
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Sinatra, Frank Frank Sinatra In t… 100 101 282 -182
## 2 Diddley, Bo Bo Diddley Bo D… 214 216 455 -241
## 3 Presley, Elvis Elvis Presley Elvi… 55 56 332 -277
## 4 Sinatra, Frank Frank Sinatra Song… 306 308 NA -195
## 5 Little Richard Little Richard Here… 50 50 227 -177
## 6 Beyonce Beyonce Lemo… NA NA 32 469
## # ℹ 14 more variables: release_year <dbl>, genre <chr>, type <chr>,
## # weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## # spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## # artist_gender <chr>, artist_birth_year_sum <dbl>,
## # debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## # years_between <dbl>, album_id <chr>
Remove duplicate Artists Name: Since “sort_name” and “clean_name” represent the same information, we can choose to keep one, while removing the other. In this case, we will keep “clean_name” as it sets the name already to be in a readable format without unecessary punctuation
Pivot rank column: We reshape the three rank columns into a long format that produces a “Year: column that will hold values 2003, 2012, and 2020) and a”Rank” column
rolling_tidy <- rolling %>%
select(-sort_name) %>% #removing the duplicate column and just keeping "clean_name"
pivot_longer(
cols = starts_with("rank_"), #select columns that begin with "rank_" to make it faster to deal with
names_to = "year", #creating a new column that will hold the year information
names_prefix = "rank_", #removing the "rank_" prefix from the year values
values_to = "rank" #creating a new column for the rank values
) %>%
filter(!is.na(rank)) #removing any rows with missing rank values
head(rolling_tidy)
## # A tibble: 6 × 19
## clean_name album differential release_year genre type weeks_on_billboard
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 Frank Sinatra In the… -182 1955 Big … Stud… 14
## 2 Frank Sinatra In the… -182 1955 Big … Stud… 14
## 3 Frank Sinatra In the… -182 1955 Big … Stud… 14
## 4 Bo Diddley Bo Did… -241 1955 Rock… Stud… NA
## 5 Bo Diddley Bo Did… -241 1955 Rock… Stud… NA
## 6 Bo Diddley Bo Did… -241 1955 Rock… Stud… NA
## # ℹ 12 more variables: peak_billboard_position <dbl>, spotify_popularity <dbl>,
## # spotify_url <chr>, artist_member_count <dbl>, artist_gender <chr>,
## # artist_birth_year_sum <dbl>, debut_album_release_year <dbl>,
## # ave_age_at_top_500 <dbl>, years_between <dbl>, album_id <chr>, year <chr>,
## # rank <dbl>
Now, we can see that this new updated tidy dataset only has one column with the artists name and now two columns, one titled “year” and another titled “rank” that has broken down the three columns holding rank values and have transformed it into a long data format.
Using this new tidy version of data, we can now begin got perform some analysis on the data. For this project, I will be answering these questions: - What is the average rank per year? - Which genres are consistently ranked the highest? - Which artists have the most albums on the list?
#What is the average rank per year?
avg_rank <- rolling_tidy %>%
group_by(year) %>%
summarize(avg_rank = mean(rank, na.rm = TRUE))
avg_rank_year <- ggplot(avg_rank, aes(x = as.numeric(year), y = avg_rank)) +
geom_line() +
geom_point() +
labs(title = "Average Rank Over Years",
x = "year",
y = "Average Rank")
autoplot(avg_rank_year)
#Which genres are consistently ranked the highest?
genre_rank <- rolling_tidy %>%
group_by(genre) %>%
summarize(avg_rank = mean(rank, na.rm = TRUE)) %>%
arrange(avg_rank)
genre_rank
## # A tibble: 17 × 2
## genre avg_rank
## <chr> <dbl>
## 1 Blues/Blues ROck 169.
## 2 Reggae 182.
## 3 Rock n' Roll/Rhythm & Blues 189.
## 4 Big Band/Jazz 216.
## 5 Country/Folk/Country Rock/Folk Rock 216.
## 6 Funk/Disco 222.
## 7 Hard Rock/Metal 230.
## 8 Soul/Gospel/R&B 242.
## 9 Blues/Blues Rock 242.
## 10 <NA> 248.
## 11 Hip-Hop/Rap 249.
## 12 Singer-Songwriter/Heartland Rock 261.
## 13 Punk/Post-Punk/New Wave/Power Pop 280.
## 14 Indie/Alternative Rock 285.
## 15 Electronic 337.
## 16 Latin 427.
## 17 Afrobeat 434.
p_genre <- ggplot(genre_rank, aes(x = reorder(genre, avg_rank), y = avg_rank)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Average Rank by Genre",
x = "Genre",
y = "Average Rank")
autoplot(p_genre)
#Which artists have the most albums on the list?
artist_album_count <- rolling %>%
group_by(clean_name) %>%
summarize(album_count = n_distinct(album)) %>%
arrange(desc(album_count))
artist_album_count
## # A tibble: 386 × 2
## clean_name album_count
## <chr> <int>
## 1 Bob Dylan 11
## 2 The Beatles 11
## 3 Rolling Stones 10
## 4 Bruce Springsteen 9
## 5 David Bowie 7
## 6 The Who 7
## 7 Elton John 6
## 8 Kanye West 6
## 9 Neil Young 6
## 10 Bob Marley 5
## # ℹ 376 more rows
top_artists <- artist_album_count %>% top_n(10, album_count)
p_artist <- ggplot(top_artists, aes(x = reorder(clean_name, album_count), y = album_count)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Artists with Most Albums on the List",
x = "Artist",
y = "Number of Albums")
autoplot(p_artist)
By tidying this dataset and by removing that one column and changing those three different columns into two, we were able to perform strong analysis on the dataset to find some insights.
This Coca Cola dataset stores quarterly sales data across multiple columns. I tried to go outside of my comfort zone and work with an excel file for this project, also with data that have data stored in multiple columns and are put into a very computer un-friendly way.
This dataset contains sales data for different cola brands as well that are spread across separate columns.
This dataset has several issues that make it untidy:
Messy excel files have headers, footers, or multiple tables in one sheet, so to avoid this we can use the read_excel() function with the range argument to isolate the table that I need
In this case, I only want to focus on the “Profit and Loss” portion of the data, so depending on which portion of the data that falls in for the excel sheet, we can isolate just that
cola_profloss <- read_excel("Cola.xlsx", range = "A3:K10", col_names = TRUE)
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
cola_profloss
## # A tibble: 7 × 11
## Profit & Loss statemen…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NA in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
## 2 NA NET … 30990 35119 46542 48017 46854 45998 44294 41863
## 3 NA Cost… 11088 12693 18215 19053 18421 17889 17482 16465
## 4 NA Gros… 19902 22426 28327 28964 28433 28109 26812 25398
## 5 NA Sell… 11358 13194 17422 17738 17310 17218 16427 15262
## 6 NA Othe… 313 819 732 447 895 1183 1657 1510
## 7 NA Oper… 8231 8413 10173 10779 10228 9708 8728 8626
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 1 more variable: ...11 <chr>
Taking a look at this and given the output, the first row and columns contain extra headings rather than clean column and values. We need to: 1. Remove the first row 2. Use the new first row as column names 3. Rename the columns 4. Pivot from a wide format to a long format (by changing the years to become a single year column) 5. We need to parse the year strings into a numeric year
#convert data to a tibble for easier manipulation and remove the first row since it is unecessary
cola_profloss <- as_tibble(cola_profloss)
cola_profloss <- cola_profloss[ , -1]
cola_profloss
## # A tibble: 7 × 10
## ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 in million USD FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
## 2 NET OPERATING REVENUES 30990 35119 46542 48017 46854 45998 44294 41863 35410
## 3 Cost of goods sold 11088 12693 18215 19053 18421 17889 17482 16465 13255
## 4 Gross Profit 19902 22426 28327 28964 28433 28109 26812 25398 22155
## 5 Selling, general and ad… 11358 13194 17422 17738 17310 17218 16427 15262 12654
## 6 Other operating charges 313 819 732 447 895 1183 1657 1510 1902
## 7 Operating Income 8231 8413 10173 10779 10228 9708 8728 8626 7599
#use the new first row as column names
colnames(cola_profloss) <- as.character(cola_profloss[1, ])
cola_profloss
## # A tibble: 7 × 10
## `in million USD` `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 in million USD FY '09 FY '10 FY '11 FY '12 FY '13 FY '14
## 2 NET OPERATING REVENUES 30990 35119 46542 48017 46854 45998
## 3 Cost of goods sold 11088 12693 18215 19053 18421 17889
## 4 Gross Profit 19902 22426 28327 28964 28433 28109
## 5 Selling, general and ad… 11358 13194 17422 17738 17310 17218
## 6 Other operating charges 313 819 732 447 895 1183
## 7 Operating Income 8231 8413 10173 10779 10228 9708
## # ℹ 3 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>
#since we used the first row as column names, we can now remove that row from the data
cola_profloss <- cola_profloss[-1, ]
cola_profloss
## # A tibble: 6 × 10
## `in million USD` `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NET OPERATING REVENUES 30990 35119 46542 48017 46854 45998
## 2 Cost of goods sold 11088 12693 18215 19053 18421 17889
## 3 Gross Profit 19902 22426 28327 28964 28433 28109
## 4 Selling, general and ad… 11358 13194 17422 17738 17310 17218
## 5 Other operating charges 313 819 732 447 895 1183
## 6 Operating Income 8231 8413 10173 10779 10228 9708
## # ℹ 3 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>
cola_profloss <- cola_profloss %>%
rename(Category = `in million USD`)
cola_profloss
## # A tibble: 6 × 10
## Category `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14` `FY '15`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NET OPERATING … 30990 35119 46542 48017 46854 45998 44294
## 2 Cost of goods … 11088 12693 18215 19053 18421 17889 17482
## 3 Gross Profit 19902 22426 28327 28964 28433 28109 26812
## 4 Selling, gener… 11358 13194 17422 17738 17310 17218 16427
## 5 Other operatin… 313 819 732 447 895 1183 1657
## 6 Operating Inco… 8231 8413 10173 10779 10228 9708 8728
## # ℹ 2 more variables: `FY '16` <chr>, `FY '17` <chr>
cola_profloss_tidy <- cola_profloss %>%
pivot_longer(
cols = starts_with("FY"),
names_to = "Year",
values_to = "Value"
)
cola_profloss_tidy
## # A tibble: 54 × 3
## Category Year Value
## <chr> <chr> <chr>
## 1 NET OPERATING REVENUES FY '09 30990
## 2 NET OPERATING REVENUES FY '10 35119
## 3 NET OPERATING REVENUES FY '11 46542
## 4 NET OPERATING REVENUES FY '12 48017
## 5 NET OPERATING REVENUES FY '13 46854
## 6 NET OPERATING REVENUES FY '14 45998
## 7 NET OPERATING REVENUES FY '15 44294
## 8 NET OPERATING REVENUES FY '16 41863
## 9 NET OPERATING REVENUES FY '17 35410
## 10 Cost of goods sold FY '09 11088
## # ℹ 44 more rows
cola_profloss_tidy <- cola_profloss_tidy %>%
mutate(
Year = str_remove(Year, "FY "), #remove "FY "
Year = str_remove_all(Year, "'"), #remove symbols
Year = as.numeric(Year), #convert to numeric
Year = if_else(Year < 30, 2000 + Year, 1900 + Year)
)
cola_profloss_tidy
## # A tibble: 54 × 3
## Category Year Value
## <chr> <dbl> <chr>
## 1 NET OPERATING REVENUES 2009 30990
## 2 NET OPERATING REVENUES 2010 35119
## 3 NET OPERATING REVENUES 2011 46542
## 4 NET OPERATING REVENUES 2012 48017
## 5 NET OPERATING REVENUES 2013 46854
## 6 NET OPERATING REVENUES 2014 45998
## 7 NET OPERATING REVENUES 2015 44294
## 8 NET OPERATING REVENUES 2016 41863
## 9 NET OPERATING REVENUES 2017 35410
## 10 Cost of goods sold 2009 11088
## # ℹ 44 more rows
For my analysis, I will create a bar plot to compare the values of each category in the most recent year
latest_year_data <- cola_profloss_tidy %>%
filter(Year == max(Year, na.rm = TRUE))
p_bar <- ggplot(latest_year_data, aes(x = reorder(Category, Value), y = Value)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = paste("Category Values in", max(latest_year_data$Year)),
x = "Category",
y = "Value (Millions USD)"
)
p_bar
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
By tidying the dataset, we are able to understand the data more and obtain stornger analysis on it depending on what type of information we are trying to see.
The Uncleaned_DS_jobs.csv file is accessible through my github repo which contains job posting data for Data Science jobs posted on Glassdoor that was web scraped into a csv file.
This dataset is untidy because: - The Salary Estimate column often includes a rage with symbols rather than seprate min/max numeric values - The Location column can contain both city and state in one cell rather than having a separate columns for each - The Job Description column has extremely long text that may contain multiples lines, symbols, and repeated content - There are other columns that may not be necessary for some analysis
library(stringr)
ds_jobs <- read_csv("https://raw.githubusercontent.com/silmaxk/DATA607_CUNYSPS/refs/heads/main/Project%232/Uncleaned_DS_jobs.csv")
## Rows: 672 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): Job Title, Salary Estimate, Job Description, Company Name, Locatio...
## dbl (3): index, Rating, Founded
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ds_jobs)
## # A tibble: 6 × 15
## index `Job Title` `Salary Estimate` `Job Description` Rating `Company Name`
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 0 Sr Data Scien… $137K-$171K (Gla… "Description\n\n… 3.1 "Healthfirst\…
## 2 1 Data Scientist $137K-$171K (Gla… "Secure our Nati… 4.2 "ManTech\n4.2"
## 3 2 Data Scientist $137K-$171K (Gla… "Overview\n\n\nA… 3.8 "Analysis Gro…
## 4 3 Data Scientist $137K-$171K (Gla… "JOB DESCRIPTION… 3.5 "INFICON\n3.5"
## 5 4 Data Scientist $137K-$171K (Gla… "Data Scientist\… 2.9 "Affinity Sol…
## 6 5 Data Scientist $137K-$171K (Gla… "About Us:\n\nHe… 4.2 "HG Insights\…
## # ℹ 9 more variables: Location <chr>, Headquarters <chr>, Size <chr>,
## # Founded <dbl>, `Type of ownership` <chr>, Industry <chr>, Sector <chr>,
## # Revenue <chr>, Competitors <chr>
glimpse(ds_jobs)
## Rows: 672
## Columns: 15
## $ index <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ `Job Title` <chr> "Sr Data Scientist", "Data Scientist", "Data Scien…
## $ `Salary Estimate` <chr> "$137K-$171K (Glassdoor est.)", "$137K-$171K (Glas…
## $ `Job Description` <chr> "Description\n\nThe Senior Data Scientist is respo…
## $ Rating <dbl> 3.1, 4.2, 3.8, 3.5, 2.9, 4.2, 3.9, 3.5, 4.4, 3.6, …
## $ `Company Name` <chr> "Healthfirst\n3.1", "ManTech\n4.2", "Analysis Grou…
## $ Location <chr> "New York, NY", "Chantilly, VA", "Boston, MA", "Ne…
## $ Headquarters <chr> "New York, NY", "Herndon, VA", "Boston, MA", "Bad …
## $ Size <chr> "1001 to 5000 employees", "5001 to 10000 employees…
## $ Founded <dbl> 1993, 1968, 1981, 2000, 1998, 2010, 1996, 1990, 19…
## $ `Type of ownership` <chr> "Nonprofit Organization", "Company - Public", "Pri…
## $ Industry <chr> "Insurance Carriers", "Research & Development", "C…
## $ Sector <chr> "Insurance", "Business Services", "Business Servic…
## $ Revenue <chr> "Unknown / Non-Applicable", "$1 to $2 billion (USD…
## $ Competitors <chr> "EmblemHealth, UnitedHealth Group, Aetna", "-1", "…
jobs_clean <- ds_jobs %>%
mutate(
Salary_Clean = str_remove_all(`Salary Estimate`, "\\$|K|\\(.*\\)"),
Salary_Min = as.numeric(str_extract(Salary_Clean, "^[0-9]+")),
Salary_Max = as.numeric(str_extract(Salary_Clean, "(?<=-)[0-9]+"))
)
jobs_clean %>% select(`Job Title`, `Salary Estimate`, Salary_Min, Salary_Max) %>% head()
## # A tibble: 6 × 4
## `Job Title` `Salary Estimate` Salary_Min Salary_Max
## <chr> <chr> <dbl> <dbl>
## 1 Sr Data Scientist $137K-$171K (Glassdoor est.) 137 171
## 2 Data Scientist $137K-$171K (Glassdoor est.) 137 171
## 3 Data Scientist $137K-$171K (Glassdoor est.) 137 171
## 4 Data Scientist $137K-$171K (Glassdoor est.) 137 171
## 5 Data Scientist $137K-$171K (Glassdoor est.) 137 171
## 6 Data Scientist $137K-$171K (Glassdoor est.) 137 171
jobs_clean <- jobs_clean %>%
separate(Location, into = c("City", "State"), sep = ", ", fill = "right", extra = "merge")
jobs_clean %>% select(`Job Title`, City, State) %>% head()
## # A tibble: 6 × 3
## `Job Title` City State
## <chr> <chr> <chr>
## 1 Sr Data Scientist New York NY
## 2 Data Scientist Chantilly VA
## 3 Data Scientist Boston MA
## 4 Data Scientist Newton MA
## 5 Data Scientist New York NY
## 6 Data Scientist Santa Barbara CA
jobs_clean <- jobs_clean %>%
select(-index, -Competitors, -Headquarters) # remove columns you don't need
colnames(jobs_clean)
## [1] "Job Title" "Salary Estimate" "Job Description"
## [4] "Rating" "Company Name" "City"
## [7] "State" "Size" "Founded"
## [10] "Type of ownership" "Industry" "Sector"
## [13] "Revenue" "Salary_Clean" "Salary_Min"
## [16] "Salary_Max"
jobs_clean <- jobs_clean %>%
mutate(`Job Description` = str_squish(`Job Description`))
jobs_clean %>% select(`Job Description`) %>% head(1)
## # A tibble: 1 × 1
## `Job Description`
## <chr>
## 1 Description The Senior Data Scientist is responsible for defining, building, …
Now that we have cleaned the data, we can now do some analysis on the data. For example, we can see which states have the most job postings
state_counts <- jobs_clean %>%
filter(!is.na(State)) %>%
count(State, sort = TRUE)
head(state_counts, 10)
## # A tibble: 10 × 2
## State n
## <chr> <int>
## 1 CA 165
## 2 VA 89
## 3 MA 62
## 4 NY 52
## 5 MD 39
## 6 IL 30
## 7 DC 26
## 8 TX 17
## 9 WA 16
## 10 OH 14
top_states <- head(state_counts, 10)
p_states <- ggplot(top_states, aes(x = reorder(State, n), y = n)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 States with DS Job Postings",
x = "State",
y = "Number of Postings")
autoplot(p_states)
Taking a look at this analysis, we can see that California has a
significant amount of job listings for Data Science
Throughout this project, we were able to examine three untidy datasets and performed comprehensive data tidying, transformation to make the data tidy and performed some analysis on the dataset
By following the rules for having tidy data, each dataset was transformed from an untidy, wide structure, into a clear and consistent format that allows for deeper and stronger statistical analysis and visualizations