Ran into some trouble importing data initially due to file not being a csv file but a xlxs file. Fixed that really quick by saving the file again with .csv notation
Wilson: Looking at the data below. We can see that the price column is in a string format, which might not be ideal if we were going to do any calculations with the price variable. Also, there’s duplicate like apple that shows the same price and we can remove one of them since those two rows should be one distinct observation.
fruit_data <- read.csv("https://raw.githubusercontent.com/wilsonvetdev/DATA607/main/fruits.csv")
fruit_data
## item price calories
## 1 banana "$1" 105
## 2 apple "$0.75" 95
## 3 apple "$0.75" 95
## 4 peach "$3" 55
## 5 peach "$4" 55
## 6 clementine "$2.5" 35
duplicates <- fruit_data %>% duplicated() %>% table()
print(duplicates)
## .
## FALSE TRUE
## 5 1
fruit_data <- fruit_data %>% distinct()
fruit_data
## item price calories
## 1 banana "$1" 105
## 2 apple "$0.75" 95
## 3 peach "$3" 55
## 4 peach "$4" 55
## 5 clementine "$2.5" 35
str(fruit_data)
## 'data.frame': 5 obs. of 3 variables:
## $ item : chr "banana" "apple" "peach" "peach" ...
## $ price : chr "\"$1\"" "\"$0.75\"" "\"$3\"" "\"$4\"" ...
## $ calories: int 105 95 55 55 35
updated_fruit_data <- fruit_data %>%
mutate(price=gsub('^\\s*["]', '', gsub('[$",]', '', price)))
updated_fruit_data
## item price calories
## 1 banana 1 105
## 2 apple 0.75 95
## 3 peach 3 55
## 4 peach 4 55
## 5 clementine 2.5 35
updated_fruit_data <- updated_fruit_data %>% mutate(price = as.numeric(price))
updated_fruit_data
## item price calories
## 1 banana 1.00 105
## 2 apple 0.75 95
## 3 peach 3.00 55
## 4 peach 4.00 55
## 5 clementine 2.50 35
updated_fruit_data <- updated_fruit_data %>% mutate(dollar_per_cal = price/calories)
updated_fruit_data %>%
arrange(desc(dollar_per_cal))
## item price calories dollar_per_cal
## 1 peach 4.00 55 0.072727273
## 2 clementine 2.50 35 0.071428571
## 3 peach 3.00 55 0.054545455
## 4 banana 1.00 105 0.009523810
## 5 apple 0.75 95 0.007894737
student_data <- read.csv('https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv')
student_data
## id name phone sex.and.age test.number term.1 term.2 term.3
## 1 1 Mike 134 m_12 test 1 76 84 87
## 2 2 Linda 270 f_13 test 1 88 90 73
## 3 3 Sam 210 m_11 test 1 78 74 80
## 4 4 Esther 617 f_12 test 1 68 75 74
## 5 5 Mary 114 f_14 test 1 65 67 64
## 6 1 Mike 134 m_12 test 2 85 80 90
## 7 2 Linda 270 f_13 test 2 87 82 94
## 8 3 Sam 210 m_11 test 2 80 87 80
## 9 4 Esther 617 f_12 test 2 70 75 78
## 10 5 Mary 114 f_14 test 2 68 70 63
student_data <- separate(student_data, sex.and.age, into = c('sex', 'age'), sep = '_')
student_data
## id name phone sex age test.number term.1 term.2 term.3
## 1 1 Mike 134 m 12 test 1 76 84 87
## 2 2 Linda 270 f 13 test 1 88 90 73
## 3 3 Sam 210 m 11 test 1 78 74 80
## 4 4 Esther 617 f 12 test 1 68 75 74
## 5 5 Mary 114 f 14 test 1 65 67 64
## 6 1 Mike 134 m 12 test 2 85 80 90
## 7 2 Linda 270 f 13 test 2 87 82 94
## 8 3 Sam 210 m 11 test 2 80 87 80
## 9 4 Esther 617 f 12 test 2 70 75 78
## 10 5 Mary 114 f 14 test 2 68 70 63
Wilson: I would argue that we don’t create two different columns named ‘test1’ and ‘test2’, because those are two distinct observations. They can belong on the same table as is or be split up into two separate tables. I also don’t think the table will come out looking good and it will probably result in empty cells between the two columns.
test_1 <- student_data %>% filter(test.number == 'test 1')
test_2 <- student_data %>% filter(test.number == 'test 2')
test_1
## id name phone sex age test.number term.1 term.2 term.3
## 1 1 Mike 134 m 12 test 1 76 84 87
## 2 2 Linda 270 f 13 test 1 88 90 73
## 3 3 Sam 210 m 11 test 1 78 74 80
## 4 4 Esther 617 f 12 test 1 68 75 74
## 5 5 Mary 114 f 14 test 1 65 67 64
test_2
## id name phone sex age test.number term.1 term.2 term.3
## 1 1 Mike 134 m 12 test 2 85 80 90
## 2 2 Linda 270 f 13 test 2 87 82 94
## 3 3 Sam 210 m 11 test 2 80 87 80
## 4 4 Esther 617 f 12 test 2 70 75 78
## 5 5 Mary 114 f 14 test 2 68 70 63
Wilson: I think a good analysis here is to find out if each student did better on test 1 or test 2. Let’s go with that!
Wilson: We can see that everyone with the exception of Linda(dropped one point) have improved a little bit in test 2.
tests_average_1 <-
test_1 %>%
group_by(name) %>%
summarise(mean_grade_1 = mean(term.1, term.2, term.3))
tests_average_2 <-
test_2 %>%
group_by(name) %>%
summarise(mean_grade_2 = mean(term.1, term.2, term.3))
merge(tests_average_1, tests_average_2)
## name mean_grade_1 mean_grade_2
## 1 Esther 68 70
## 2 Linda 88 87
## 3 Mary 65 68
## 4 Mike 76 85
## 5 Sam 78 80
Compare salaries by gender and years of experience. Compare salaries compare for the same role in different locations.
managers <- read.csv('https://raw.githubusercontent.com/wilsonvetdev/DATA607/main/managers.csv')
names(managers)
## [1] "Timestamp"
## [2] "How.old.are.you."
## [3] "What.industry.do.you.work.in."
## [4] "Job.title"
## [5] "If.your.job.title.needs.additional.context..please.clarify.here."
## [6] "What.is.your.annual.salary...You.ll.indicate.the.currency.in.a.later.question..If.you.are.part.time.or.hourly..please.enter.an.annualized.equivalent....what.you.would.earn.if.you.worked.the.job.40.hours.a.week..52.weeks.a.year.."
## [7] "How.much.additional.monetary.compensation.do.you.get..if.any..for.example..bonuses.or.overtime.in.an.average.year...Please.only.include.monetary.compensation.here..not.the.value.of.benefits."
## [8] "Please.indicate.the.currency"
## [9] "If..Other...please.indicate.the.currency.here.."
## [10] "If.your.income.needs.additional.context..please.provide.it.here."
## [11] "What.country.do.you.work.in."
## [12] "If.you.re.in.the.U.S...what.state.do.you.work.in."
## [13] "What.city.do.you.work.in."
## [14] "How.many.years.of.professional.work.experience.do.you.have.overall."
## [15] "How.many.years.of.professional.work.experience.do.you.have.in.your.field."
## [16] "What.is.your.highest.level.of.education.completed."
## [17] "What.is.your.gender."
## [18] "What.is.your.race...Choose.all.that.apply.."
managers <- managers %>%
transmute(
age = How.old.are.you.,
job_title = Job.title,
currency = Please.indicate.the.currency,
salary = What.is.your.annual.salary...You.ll.indicate.the.currency.in.a.later.question..If.you.are.part.time.or.hourly..please.enter.an.annualized.equivalent....what.you.would.earn.if.you.worked.the.job.40.hours.a.week..52.weeks.a.year..,
industry = What.industry.do.you.work.in.,
country = What.country.do.you.work.in.,
city = What.city.do.you.work.in.,
gender = What.is.your.gender.,
years_of_experience = How.many.years.of.professional.work.experience.do.you.have.overall.,
)
head(managers)
## age job_title currency salary
## 1 25-34 Research and Instruction Librarian USD 55,000
## 2 25-34 Change & Internal Communications Manager GBP 54,600
## 3 25-34 Marketing Specialist USD 34,000
## 4 25-34 Program Manager USD 62,000
## 5 25-34 Accounting Manager USD 60,000
## 6 25-34 Scholarly Publishing Librarian USD 62,000
## industry country city gender
## 1 Education (Higher Education) United States Boston Woman
## 2 Computing or Tech United Kingdom Cambridge Non-binary
## 3 Accounting, Banking & Finance US Chattanooga Woman
## 4 Nonprofits USA Milwaukee Woman
## 5 Accounting, Banking & Finance US Greenville Woman
## 6 Education (Higher Education) USA Hanover Man
## years_of_experience
## 1 5-7 years
## 2 8 - 10 years
## 3 2 - 4 years
## 4 8 - 10 years
## 5 8 - 10 years
## 6 8 - 10 years
** trimming leading and trailing spaces as well because I realized the filter wasn’t working as expected. “United Kingdom” cell wasn’t getting filtered out in the beginning so I suspect it was actually “United Kingdom”.
managers$job_title <- trimws(managers$job_title, which = c("both"))
managers$country <- trimws(managers$country, which = c("both"))
managers <- managers %>% filter(grepl("manager", job_title))
managers <- filter(managers, grepl("^[uU]", country))
managers <- filter(managers, !grepl("^[uU]", country) | !grepl("^[uU].?[kK].?", country)) %>% filter(!grepl("[my]$", country))
distinct(managers, country)
## country
## 1 US
## 2 United States
## 3 USA
## 4 United states
## 5 united states of america
## 6 Us
## 7 Usa
## 8 U.S.
## 9 usa
## 10 United State
## 11 United statew
## 12 united states
## 13 us
## 14 United States of America
managers %>% group_by(job_title) %>% summarize(count = n()) %>% arrange(desc(count))
## # A tibble: 307 × 2
## job_title count
## <chr> <int>
## 1 Project manager 31
## 2 Program manager 23
## 3 Office manager 16
## 4 Product manager 11
## 5 Senior manager 11
## 6 Operations manager 9
## 7 Marketing manager 8
## 8 Account manager 7
## 9 Case manager 7
## 10 Engineering manager 7
## # … with 297 more rows
project_managers <- managers %>% filter(job_title == "Project manager")
project_managers <- project_managers %>%
mutate(salary=gsub('^\\s*["]', '', gsub('[$",]', '', salary)))
project_managers <- project_managers %>% mutate(salary = as.numeric(salary))
project_managers
## age job_title currency salary industry
## 1 35-44 Project manager USD 125300 Computing or Tech
## 2 55-64 Project manager USD 115000 Engineering or Manufacturing
## 3 25-34 Project manager USD 65000 Hospitality & Events
## 4 25-34 Project manager USD 67000 Engineering or Manufacturing
## 5 35-44 Project manager USD 88007 Geospatial
## 6 25-34 Project manager USD 105000 Computing or Tech
## 7 45-54 Project manager USD 200000 Computing or Tech
## 8 25-34 Project manager USD 70000 Nonprofits
## 9 45-54 Project manager USD 82000 Computing or Tech
## 10 25-34 Project manager USD 100800 Engineering or Manufacturing
## 11 35-44 Project manager USD 77000 Business or Consulting
## 12 45-54 Project manager USD 178000 Accounting, Banking & Finance
## 13 25-34 Project manager USD 97000 Computing or Tech
## 14 45-54 Project manager USD 100000 Health care
## 15 25-34 Project manager USD 118000 Engineering or Manufacturing
## 16 35-44 Project manager USD 88000 Government and Public Administration
## 17 35-44 Project manager USD 91000 Business or Consulting
## 18 35-44 Project manager USD 80000 Construction
## 19 45-54 Project manager USD 182000 Computing or Tech
## 20 35-44 Project manager USD 46000 Nonprofits
## 21 35-44 Project manager USD 60000 Transport or Logistics
## 22 25-34 Project manager USD 70000 Marketing, Advertising & PR
## 23 25-34 Project manager USD 71379 Nonprofits
## 24 35-44 Project manager USD 125000 Computing or Tech
## 25 25-34 Project manager USD 62000 Pharmaceuticals
## 26 35-44 Project manager USD 84000 Property or Construction
## 27 45-54 Project manager USD 126000 Education (Higher Education)
## 28 25-34 Project manager USD 58750 Computing or Tech
## 29 25-34 Project manager USD 58750 Computing or Tech
## 30 25-34 Project manager USD 170000 Engineering or Manufacturing
## 31 35-44 Project manager USD 102000 Property or Construction
## country city gender
## 1 U.S. Horsham Woman
## 2 USA Hackensack Woman
## 3 Usa Fairfax Woman
## 4 USA Charlottesville Woman
## 5 United States Corvallis Man
## 6 United States Dallas Woman
## 7 United States SF Other or prefer not to answer
## 8 United States New York Woman
## 9 USA Valparaiso Woman
## 10 U.S. Raleigh Woman
## 11 US Wyomissing Woman
## 12 United States Baltimore
## 13 USA Philadelphia Woman
## 14 USA Raleigh Woman
## 15 USA NYC metro area Woman
## 16 USA Portland Woman
## 17 United States Field employee Woman
## 18 USA Seattle Man
## 19 USA Cupertino Man
## 20 United States Denver Man
## 21 United states Anderson Woman
## 22 USA Chicago
## 23 USA New York Woman
## 24 United states Washington DC Woman
## 25 United states Souderton Woman
## 26 United states Carbondale Woman
## 27 USA St Louis Woman
## 28 USA Richmond Woman
## 29 Usa Richmond Woman
## 30 US Midland Woman
## 31 Us St. Louis Woman
## years_of_experience
## 1 11 - 20 years
## 2 21 - 30 years
## 3 8 - 10 years
## 4 2 - 4 years
## 5 11 - 20 years
## 6 8 - 10 years
## 7 21 - 30 years
## 8 8 - 10 years
## 9 11 - 20 years
## 10 1 year or less
## 11 11 - 20 years
## 12 31 - 40 years
## 13 5-7 years
## 14 8 - 10 years
## 15 8 - 10 years
## 16 11 - 20 years
## 17 11 - 20 years
## 18 11 - 20 years
## 19 21 - 30 years
## 20 11 - 20 years
## 21 8 - 10 years
## 22 5-7 years
## 23 8 - 10 years
## 24 11 - 20 years
## 25 8 - 10 years
## 26 11 - 20 years
## 27 21 - 30 years
## 28 2 - 4 years
## 29 2 - 4 years
## 30 8 - 10 years
## 31 11 - 20 years
viz <- ggplot(data = project_managers, aes(x = city, y = salary)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, size = 8))
viz