This Quarto document is an activity for INFO 201 to practice basic data wrangling tools in R with dplyr. There are four substantive exercises labeled below to try out. In each one, write some come to get the information requested from the data frame loaded below. Once you’re done, render the document as an html file and upload for the daily review.
In each exercise, you will need to use multiple data wrangling functions – string them together with the pipe symbol (|>). Put your own code in the code chunks where the … is (but note that you will typically use multiple lines of code). The main data wrangling functions to use here are: filter, summarize, mutate, group_by, and arrange.
Set Up Environment
For this activity, we’ll be using dplyr tools. dplyr is part of the tidyverse, so you could use either of the lines below.
#library(dplyr)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Company Expense Data
For this analysis, you’ll look at data about all five large technology companies in an imaginary city. Specifically, you are looking at information about the monthly personnel-related expenses at each company last month. Your will be asked to report on a number of personnel related issues to a regional labor organization using this data set.
expenses_df <-read.csv("expenses.csv")expenses_df
Company Department NumEmployees Expenses
1 Aurelia Systems Research 198 1132842.3
2 Aurelia Systems Sales 224 1087892.6
3 Aurelia Systems Support 340 950695.4
4 Aurelia Systems HR 369 1052446.7
5 Aurelia Systems Engineering 161 1149223.8
6 Bluefin Robotics Engineering 469 4290294.0
7 Bluefin Robotics HR 323 859012.8
8 Bluefin Robotics Research 58 495394.7
9 Bluefin Robotics Finance 484 2348886.5
10 Cloudspire Labs Research 329 1985131.6
11 Cloudspire Labs Sales 140 525017.4
12 Cloudspire Labs Engineering 338 2471426.2
13 Cloudspire Labs Finance 176 568335.4
14 Cloudspire Labs HR 355 1065796.8
15 Nimbus Industries Research 464 4699627.5
16 Nimbus Industries Support 44 223644.4
17 Nimbus Industries HR 215 795257.8
18 Solara Tech HR 200 633002.6
19 Solara Tech Research 411 3002900.1
20 Solara Tech Engineering 313 2809942.7
21 Solara Tech Finance 280 1369971.1
The features in this data represent:
Company: The name of the company
Department: The name of the department for which records were recorded. All departments in each company are listed. Not all companies have the same number of distinct departments.
NumEmployees: The total number of employees working last month in a department at the company.
Expenses: The total amount of personnel-related expenses (mostly salary) accrued last month in each department. Shown in US dollars.
Exercise 0: Example Code
The following code chunk uses five data wrangling functions connected with pipe operators. It takes our original dataframe and produces a dataframe that would let us rank that total expenses for each of the large companies in town that have more than 1000 employees. Looks like Aurelia Systems has the smallest expenses of the four large companies.
To check your understanding, delete the last line and the final pipe operator and hit run. Then try this again for each subsequent line to double check what each step is doing! Afterward, use CTRL+Z or Edit->Undo to get the code back. Nothing else to do for this exercise.
expenses_df |>mutate(ExpensesMillions = Expenses /1000000) |># Create a new columns with expenses in terms of millions of dollarsgroup_by(Company) |># Split the data up by company (each company has multiple rows)summarize(TotalExpensesMillions =sum(ExpensesMillions), # Calculate summary statistics ('sum' in this case), do it for each company because of the group_by()TotalEmployees =sum(NumEmployees)) |>filter(TotalEmployees >1000) |># Look only at companies that have more than 1000 employeesarrange(desc(TotalExpensesMillions)) # Order data from high to low in expenses
Exercise 1: Rank Companies by Employees in Engineering
Use dplyr wrangling tools in the code chunk below to output a dataframe that shows you the requested information.
For all recorded companies with engineering departments, rank them from the largest to the smallest in terms of the number of employees in their engineering departments. This will probably take two data wrangling steps.
Company Department NumEmployees Expenses
1 Bluefin Robotics Engineering 469 4290294
2 Cloudspire Labs Engineering 338 2471426
3 Solara Tech Engineering 313 2809943
4 Aurelia Systems Engineering 161 1149224
Exercise 2: Rank Companies by Total Number of Employees
Use dplyr wrangling tools in the code chunk below to output a dataframe that shows you the requested information.
For the five companies in the city, rank them from the largest to the smallest in terms of the total number of employees at the company. This should take three data wrangling steps.
ranked_by_n_emp <- expenses_df |>group_by(Company) |># consider different companiessummarize(TotalEmployees =sum(NumEmployees)) |># calculate total employeesarrange(TotalEmployees)# rank by num employeesranked_by_n_emp
# A tibble: 5 × 2
Company TotalEmployees
<chr> <int>
1 Nimbus Industries 723
2 Solara Tech 1204
3 Aurelia Systems 1292
4 Bluefin Robotics 1334
5 Cloudspire Labs 1338
Exercise 3: Rank Departments by Average Per-Employee Expenses
Use dplyr wrangling tools in the code chunk below to output a dataframe that shows you the requested information.
Rank types of departments recorded at the companies in terms of the mean amount of expenses per employee in that kind of department. Additionally, keep track of the number of companies in your study that have each type of department. The output should have three columns (Department, AveragePerEmpExpenses, and NumCompanies) with six rows (Engineering, Finance, HR, Research, Salse, Support). This should take around four data wrangling steps.
ranked_by_peremp <- expenses_df |>group_by(Department) |># group things by dept typemutate(PerEmpExpenses = Expenses / NumEmployees) |># find expenses per employeesummarize(AveragePerEmpExpenses =mean(PerEmpExpenses),# take the average of expenses per employeeNumCompanies =n()) |>arrange(AveragePerEmpExpenses)ranked_by_peremp
# A tibble: 6 × 3
Department AveragePerEmpExpenses NumCompanies
<chr> <dbl> <int>
1 HR 3076. 5
2 Support 3939. 2
3 Sales 4303. 2
4 Finance 4325. 3
5 Research 7546. 5
6 Engineering 8144. 4
Exercise 4: Rank Departments by Average Per-Employee Expenses
Use dplyr wrangling tools in the code chunk below to output a dataframe that shows you the requested information.
How much greater are the overall per-employee expenses at the company with the highest per-employee expenses compared to the company with the lowest per-employee expenses? Output a data frame with a single column for “ExpenseRange” that has a single row. The final value should be $3750.7. This should require between three and five data wrangling steps depending on how you write the code.