First Untidy Data

Getting data from local machine

Ran into some trouble importing data initially due to file not being a csv file but a xlxs file. Fixed that really quick by saving the file again with .csv notation

Wilson: Looking at the data below. We can see that the price column is in a string format, which might not be ideal if we were going to do any calculations with the price variable. Also, there’s duplicate like apple that shows the same price and we can remove one of them since those two rows should be one distinct observation.

fruit_data <- read.csv("https://raw.githubusercontent.com/wilsonvetdev/DATA607/main/fruits.csv")

fruit_data
##         item   price calories
## 1     banana    "$1"      105
## 2      apple "$0.75"       95
## 3      apple "$0.75"       95
## 4      peach    "$3"       55
## 5      peach    "$4"       55
## 6 clementine  "$2.5"       35

find and count duplicated rows

duplicates <- fruit_data %>% duplicated() %>% table()
print(duplicates)
## .
## FALSE  TRUE 
##     5     1

remove duplicated rows, keep only unique rows

fruit_data <- fruit_data %>% distinct()

fruit_data
##         item   price calories
## 1     banana    "$1"      105
## 2      apple "$0.75"       95
## 3      peach    "$3"       55
## 4      peach    "$4"       55
## 5 clementine  "$2.5"       35

look at the data types of the columns in the data frame: we can see that price has a data type of “chr”, which isn’t ideal if we wanted to do any math or visualization. We have to first take out the dollar sign character and then transform the column into a numeric column.

str(fruit_data)
## 'data.frame':    5 obs. of  3 variables:
##  $ item    : chr  "banana" "apple" "peach" "peach" ...
##  $ price   : chr  "\"$1\"" "\"$0.75\"" "\"$3\"" "\"$4\"" ...
##  $ calories: int  105 95 55 55 35
updated_fruit_data <- fruit_data  %>%
  mutate(price=gsub('^\\s*["]', '', gsub('[$",]', '', price)))

updated_fruit_data
##         item price calories
## 1     banana     1      105
## 2      apple  0.75       95
## 3      peach     3       55
## 4      peach     4       55
## 5 clementine   2.5       35

Last step before analysis is to convert the price column into a numeric column.

updated_fruit_data <- updated_fruit_data %>% mutate(price = as.numeric(price))

updated_fruit_data
##         item price calories
## 1     banana  1.00      105
## 2      apple  0.75       95
## 3      peach  3.00       55
## 4      peach  4.00       55
## 5 clementine  2.50       35

We could use this data to find out which fruit has the most calories and it’s also the most affordable. Here we see that on the dollar_per_cal column that the apple cost the least when judging by dollar per calorie.

updated_fruit_data <- updated_fruit_data %>% mutate(dollar_per_cal = price/calories)

updated_fruit_data %>%
  arrange(desc(dollar_per_cal))
##         item price calories dollar_per_cal
## 1      peach  4.00       55    0.072727273
## 2 clementine  2.50       35    0.071428571
## 3      peach  3.00       55    0.054545455
## 4     banana  1.00      105    0.009523810
## 5      apple  0.75       95    0.007894737

Second Untidy Data

Getting data from gist provided by Jhalak Das

student_data <- read.csv('https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv')

student_data
##    id   name phone sex.and.age test.number term.1 term.2 term.3
## 1   1   Mike   134        m_12      test 1     76     84     87
## 2   2  Linda   270        f_13      test 1     88     90     73
## 3   3    Sam   210        m_11      test 1     78     74     80
## 4   4 Esther   617        f_12      test 1     68     75     74
## 5   5   Mary   114        f_14      test 1     65     67     64
## 6   1   Mike   134        m_12      test 2     85     80     90
## 7   2  Linda   270        f_13      test 2     87     82     94
## 8   3    Sam   210        m_11      test 2     80     87     80
## 9   4 Esther   617        f_12      test 2     70     75     78
## 10  5   Mary   114        f_14      test 2     68     70     63

Tasks requested by Jhalak Das

  • The three terms be in the same column called terms or semesters.
  • “sex and age” should be two distinct columns; ‘sex’ and ‘age’.
  • Instead of taking “test number” as variable, we should consider two columns with headers ‘test1’ and ‘test 2’ accordingly.
  • Finally, we should split the entire table into two for two types of observational units: STUDENTS and RESULTS. In the students table, we can keep id, name, phone, sex and age columns. On the results table, we can take the rest; id, terms, test1 and test2 columns. In this way we can join the two tables at any instance with id as primary key.

Separating the sex_and_age column with the separate function

student_data <- separate(student_data, sex.and.age, into = c('sex', 'age'), sep = '_')

student_data
##    id   name phone sex age test.number term.1 term.2 term.3
## 1   1   Mike   134   m  12      test 1     76     84     87
## 2   2  Linda   270   f  13      test 1     88     90     73
## 3   3    Sam   210   m  11      test 1     78     74     80
## 4   4 Esther   617   f  12      test 1     68     75     74
## 5   5   Mary   114   f  14      test 1     65     67     64
## 6   1   Mike   134   m  12      test 2     85     80     90
## 7   2  Linda   270   f  13      test 2     87     82     94
## 8   3    Sam   210   m  11      test 2     80     87     80
## 9   4 Esther   617   f  12      test 2     70     75     78
## 10  5   Mary   114   f  14      test 2     68     70     63

Wilson: I would argue that we don’t create two different columns named ‘test1’ and ‘test2’, because those are two distinct observations. They can belong on the same table as is or be split up into two separate tables. I also don’t think the table will come out looking good and it will probably result in empty cells between the two columns.

test_1 <- student_data %>% filter(test.number == 'test 1')
test_2 <- student_data %>% filter(test.number == 'test 2')

test_1
##   id   name phone sex age test.number term.1 term.2 term.3
## 1  1   Mike   134   m  12      test 1     76     84     87
## 2  2  Linda   270   f  13      test 1     88     90     73
## 3  3    Sam   210   m  11      test 1     78     74     80
## 4  4 Esther   617   f  12      test 1     68     75     74
## 5  5   Mary   114   f  14      test 1     65     67     64
test_2
##   id   name phone sex age test.number term.1 term.2 term.3
## 1  1   Mike   134   m  12      test 2     85     80     90
## 2  2  Linda   270   f  13      test 2     87     82     94
## 3  3    Sam   210   m  11      test 2     80     87     80
## 4  4 Esther   617   f  12      test 2     70     75     78
## 5  5   Mary   114   f  14      test 2     68     70     63

Wilson: I think a good analysis here is to find out if each student did better on test 1 or test 2. Let’s go with that!

Wilson: We can see that everyone with the exception of Linda(dropped one point) have improved a little bit in test 2.

tests_average_1 <- 
  test_1 %>%
  group_by(name) %>%
  summarise(mean_grade_1 = mean(term.1, term.2, term.3))

tests_average_2 <-
  test_2 %>%
  group_by(name) %>%
  summarise(mean_grade_2 = mean(term.1, term.2, term.3))

merge(tests_average_1, tests_average_2)
##     name mean_grade_1 mean_grade_2
## 1 Esther           68           70
## 2  Linda           88           87
## 3   Mary           65           68
## 4   Mike           76           85
## 5    Sam           78           80

Third Untidy Data

Analysis requested by Enid Roman:

Compare salaries by gender and years of experience. Compare salaries compare for the same role in different locations.

managers <- read.csv('https://raw.githubusercontent.com/wilsonvetdev/DATA607/main/managers.csv')

names(managers)
##  [1] "Timestamp"                                                                                                                                                                                                                           
##  [2] "How.old.are.you."                                                                                                                                                                                                                    
##  [3] "What.industry.do.you.work.in."                                                                                                                                                                                                       
##  [4] "Job.title"                                                                                                                                                                                                                           
##  [5] "If.your.job.title.needs.additional.context..please.clarify.here."                                                                                                                                                                    
##  [6] "What.is.your.annual.salary...You.ll.indicate.the.currency.in.a.later.question..If.you.are.part.time.or.hourly..please.enter.an.annualized.equivalent....what.you.would.earn.if.you.worked.the.job.40.hours.a.week..52.weeks.a.year.."
##  [7] "How.much.additional.monetary.compensation.do.you.get..if.any..for.example..bonuses.or.overtime.in.an.average.year...Please.only.include.monetary.compensation.here..not.the.value.of.benefits."                                      
##  [8] "Please.indicate.the.currency"                                                                                                                                                                                                        
##  [9] "If..Other...please.indicate.the.currency.here.."                                                                                                                                                                                     
## [10] "If.your.income.needs.additional.context..please.provide.it.here."                                                                                                                                                                    
## [11] "What.country.do.you.work.in."                                                                                                                                                                                                        
## [12] "If.you.re.in.the.U.S...what.state.do.you.work.in."                                                                                                                                                                                   
## [13] "What.city.do.you.work.in."                                                                                                                                                                                                           
## [14] "How.many.years.of.professional.work.experience.do.you.have.overall."                                                                                                                                                                 
## [15] "How.many.years.of.professional.work.experience.do.you.have.in.your.field."                                                                                                                                                           
## [16] "What.is.your.highest.level.of.education.completed."                                                                                                                                                                                  
## [17] "What.is.your.gender."                                                                                                                                                                                                                
## [18] "What.is.your.race...Choose.all.that.apply.."

Clean up the column names and only take what I think i will need.

  • The most important ones are probably job_title, years_of_experience, country, city, salary, and gender.
  • I’m going to only look at cities in the US for simplicity, so I will be filtering out the rest of the observations that aren’t US.
managers <- managers %>%
  transmute(
    age = How.old.are.you.,
    job_title = Job.title,
    currency = Please.indicate.the.currency,
    salary = What.is.your.annual.salary...You.ll.indicate.the.currency.in.a.later.question..If.you.are.part.time.or.hourly..please.enter.an.annualized.equivalent....what.you.would.earn.if.you.worked.the.job.40.hours.a.week..52.weeks.a.year..,
    industry = What.industry.do.you.work.in.,
    country = What.country.do.you.work.in.,
    city = What.city.do.you.work.in.,
    gender = What.is.your.gender.,
    years_of_experience = How.many.years.of.professional.work.experience.do.you.have.overall.,
  )

head(managers)
##     age                                job_title currency salary
## 1 25-34       Research and Instruction Librarian      USD 55,000
## 2 25-34 Change & Internal Communications Manager      GBP 54,600
## 3 25-34                     Marketing Specialist      USD 34,000
## 4 25-34                          Program Manager      USD 62,000
## 5 25-34                       Accounting Manager      USD 60,000
## 6 25-34           Scholarly Publishing Librarian      USD 62,000
##                        industry        country        city     gender
## 1  Education (Higher Education)  United States      Boston      Woman
## 2             Computing or Tech United Kingdom   Cambridge Non-binary
## 3 Accounting, Banking & Finance             US Chattanooga      Woman
## 4                    Nonprofits            USA   Milwaukee      Woman
## 5 Accounting, Banking & Finance             US  Greenville      Woman
## 6  Education (Higher Education)            USA     Hanover        Man
##   years_of_experience
## 1           5-7 years
## 2        8 - 10 years
## 3         2 - 4 years
## 4        8 - 10 years
## 5        8 - 10 years
## 6        8 - 10 years

I noticed there were job titles that didn’t have the word manager, so let’s filter out those observations.

** trimming leading and trailing spaces as well because I realized the filter wasn’t working as expected. “United Kingdom” cell wasn’t getting filtered out in the beginning so I suspect it was actually “United Kingdom”.

managers$job_title <- trimws(managers$job_title, which = c("both"))
managers$country <- trimws(managers$country, which = c("both"))

managers <- managers %>% filter(grepl("manager", job_title))

managers <- filter(managers, grepl("^[uU]", country))

managers <- filter(managers, !grepl("^[uU]", country) | !grepl("^[uU].?[kK].?", country)) %>% filter(!grepl("[my]$", country))

distinct(managers, country)
##                     country
## 1                        US
## 2             United States
## 3                       USA
## 4             United states
## 5  united states of america
## 6                        Us
## 7                       Usa
## 8                      U.S.
## 9                       usa
## 10             United State
## 11            United statew
## 12            united states
## 13                       us
## 14 United States of America

Compare salaries compare for the same role in different locations.

  • There’s over 300 distinct job titles, but I think looking at a few will be easier to tackle.
  • Let’s find out which roles have more responses compared to the rest.
  • Looking at the table below - I am personally interested in analyzing project manager since it has the most observations.
managers %>% group_by(job_title) %>% summarize(count = n()) %>% arrange(desc(count))
## # A tibble: 307 × 2
##    job_title           count
##    <chr>               <int>
##  1 Project manager        31
##  2 Program manager        23
##  3 Office manager         16
##  4 Product manager        11
##  5 Senior manager         11
##  6 Operations manager      9
##  7 Marketing manager       8
##  8 Account manager         7
##  9 Case manager            7
## 10 Engineering manager     7
## # … with 297 more rows

Tidying the salary column and coverting it to a numeric column.

project_managers <- managers %>% filter(job_title == "Project manager")

project_managers <- project_managers %>%
  mutate(salary=gsub('^\\s*["]', '', gsub('[$",]', '', salary)))

project_managers <- project_managers %>% mutate(salary = as.numeric(salary))

project_managers
##      age       job_title currency salary                             industry
## 1  35-44 Project manager      USD 125300                    Computing or Tech
## 2  55-64 Project manager      USD 115000         Engineering or Manufacturing
## 3  25-34 Project manager      USD  65000                 Hospitality & Events
## 4  25-34 Project manager      USD  67000         Engineering or Manufacturing
## 5  35-44 Project manager      USD  88007                           Geospatial
## 6  25-34 Project manager      USD 105000                    Computing or Tech
## 7  45-54 Project manager      USD 200000                    Computing or Tech
## 8  25-34 Project manager      USD  70000                           Nonprofits
## 9  45-54 Project manager      USD  82000                    Computing or Tech
## 10 25-34 Project manager      USD 100800         Engineering or Manufacturing
## 11 35-44 Project manager      USD  77000               Business or Consulting
## 12 45-54 Project manager      USD 178000        Accounting, Banking & Finance
## 13 25-34 Project manager      USD  97000                    Computing or Tech
## 14 45-54 Project manager      USD 100000                          Health care
## 15 25-34 Project manager      USD 118000         Engineering or Manufacturing
## 16 35-44 Project manager      USD  88000 Government and Public Administration
## 17 35-44 Project manager      USD  91000               Business or Consulting
## 18 35-44 Project manager      USD  80000                        Construction 
## 19 45-54 Project manager      USD 182000                    Computing or Tech
## 20 35-44 Project manager      USD  46000                           Nonprofits
## 21 35-44 Project manager      USD  60000               Transport or Logistics
## 22 25-34 Project manager      USD  70000          Marketing, Advertising & PR
## 23 25-34 Project manager      USD  71379                           Nonprofits
## 24 35-44 Project manager      USD 125000                    Computing or Tech
## 25 25-34 Project manager      USD  62000                      Pharmaceuticals
## 26 35-44 Project manager      USD  84000             Property or Construction
## 27 45-54 Project manager      USD 126000         Education (Higher Education)
## 28 25-34 Project manager      USD  58750                    Computing or Tech
## 29 25-34 Project manager      USD  58750                    Computing or Tech
## 30 25-34 Project manager      USD 170000         Engineering or Manufacturing
## 31 35-44 Project manager      USD 102000             Property or Construction
##          country            city                        gender
## 1           U.S.         Horsham                         Woman
## 2            USA      Hackensack                         Woman
## 3            Usa        Fairfax                          Woman
## 4            USA Charlottesville                         Woman
## 5  United States       Corvallis                           Man
## 6  United States          Dallas                         Woman
## 7  United States              SF Other or prefer not to answer
## 8  United States       New York                          Woman
## 9            USA     Valparaiso                          Woman
## 10          U.S.         Raleigh                         Woman
## 11            US      Wyomissing                         Woman
## 12 United States       Baltimore                              
## 13           USA    Philadelphia                         Woman
## 14           USA         Raleigh                         Woman
## 15           USA  NYC metro area                         Woman
## 16           USA        Portland                         Woman
## 17 United States  Field employee                         Woman
## 18           USA         Seattle                           Man
## 19           USA       Cupertino                           Man
## 20 United States          Denver                           Man
## 21 United states       Anderson                          Woman
## 22           USA         Chicago                              
## 23           USA       New York                          Woman
## 24 United states  Washington DC                          Woman
## 25 United states       Souderton                         Woman
## 26 United states      Carbondale                         Woman
## 27           USA        St Louis                         Woman
## 28           USA        Richmond                         Woman
## 29           Usa        Richmond                         Woman
## 30            US         Midland                         Woman
## 31            Us       St. Louis                         Woman
##    years_of_experience
## 1        11 - 20 years
## 2        21 - 30 years
## 3         8 - 10 years
## 4          2 - 4 years
## 5        11 - 20 years
## 6         8 - 10 years
## 7        21 - 30 years
## 8         8 - 10 years
## 9        11 - 20 years
## 10      1 year or less
## 11       11 - 20 years
## 12       31 - 40 years
## 13           5-7 years
## 14        8 - 10 years
## 15        8 - 10 years
## 16       11 - 20 years
## 17       11 - 20 years
## 18       11 - 20 years
## 19       21 - 30 years
## 20       11 - 20 years
## 21        8 - 10 years
## 22           5-7 years
## 23        8 - 10 years
## 24       11 - 20 years
## 25        8 - 10 years
## 26       11 - 20 years
## 27       21 - 30 years
## 28         2 - 4 years
## 29         2 - 4 years
## 30        8 - 10 years
## 31       11 - 20 years

This visualization is helpful, but the data has so little responses for project managers(even though it has the most responses compared to other roles) that makes it difficult to draw any conclusions. It took quite some time to tidy this data set in order for any type of analysis.

viz <- ggplot(data = project_managers, aes(x = city, y = salary)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, size = 8))

viz