INFO 201: Wrangling Practice

Author

Nitya Shankar

Overview

This Quarto document is an activity for INFO 201 to practice basic data wrangling tools in R with dplyr. There are four substantive exercises labeled below to try out. In each one, write some come to get the information requested from the data frame loaded below. Once you’re done, render the document as an html file and upload for the daily review.

In each exercise, you will need to use multiple data wrangling functions – string them together with the pipe symbol (|>). Put your own code in the code chunks where the … is (but note that you will typically use multiple lines of code). The main data wrangling functions to use here are: filter, summarize, mutate, group_by, and arrange.

Set Up Environment

For this activity, we’ll be using dplyr tools. dplyr is part of the tidyverse, so you could use either of the lines below.

#library(dplyr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Company Expense Data

For this analysis, you’ll look at data about all five large technology companies in an imaginary city. Specifically, you are looking at information about the monthly personnel-related expenses at each company last month. Your will be asked to report on a number of personnel related issues to a regional labor organization using this data set.

expenses_df <- read.csv("expenses.csv")
expenses_df 
             Company  Department NumEmployees  Expenses
1    Aurelia Systems    Research          198 1132842.3
2    Aurelia Systems       Sales          224 1087892.6
3    Aurelia Systems     Support          340  950695.4
4    Aurelia Systems          HR          369 1052446.7
5    Aurelia Systems Engineering          161 1149223.8
6   Bluefin Robotics Engineering          469 4290294.0
7   Bluefin Robotics          HR          323  859012.8
8   Bluefin Robotics    Research           58  495394.7
9   Bluefin Robotics     Finance          484 2348886.5
10   Cloudspire Labs    Research          329 1985131.6
11   Cloudspire Labs       Sales          140  525017.4
12   Cloudspire Labs Engineering          338 2471426.2
13   Cloudspire Labs     Finance          176  568335.4
14   Cloudspire Labs          HR          355 1065796.8
15 Nimbus Industries    Research          464 4699627.5
16 Nimbus Industries     Support           44  223644.4
17 Nimbus Industries          HR          215  795257.8
18       Solara Tech          HR          200  633002.6
19       Solara Tech    Research          411 3002900.1
20       Solara Tech Engineering          313 2809942.7
21       Solara Tech     Finance          280 1369971.1

The features in this data represent:

  • Company: The name of the company
  • Department: The name of the department for which records were recorded. All departments in each company are listed. Not all companies have the same number of distinct departments.
  • NumEmployees: The total number of employees working last month in a department at the company.
  • Expenses: The total amount of personnel-related expenses (mostly salary) accrued last month in each department. Shown in US dollars.

Exercise 0: Example Code

The following code chunk uses five data wrangling functions connected with pipe operators. It takes our original dataframe and produces a dataframe that would let us rank that total expenses for each of the large companies in town that have more than 1000 employees. Looks like Aurelia Systems has the smallest expenses of the four large companies.

To check your understanding, delete the last line and the final pipe operator and hit run. Then try this again for each subsequent line to double check what each step is doing! Afterward, use CTRL+Z or Edit->Undo to get the code back. Nothing else to do for this exercise.

expenses_df |>
  mutate(ExpensesMillions = Expenses / 1000000) |>          # Create a new columns with expenses in terms of millions of dollars
  group_by(Company) |>                                      # Split the data up by company (each company has multiple rows)
  summarize(TotalExpensesMillions = sum(ExpensesMillions),  # Calculate summary statistics ('sum' in this case), do it for each company because of the group_by()
            TotalEmployees = sum(NumEmployees)) |>
  filter(TotalEmployees > 1000) |>                          # Look only at companies that have more than 1000 employees
  arrange(desc(TotalExpensesMillions))                      # Order data from high to low in expenses
# A tibble: 4 × 3
  Company          TotalExpensesMillions TotalEmployees
  <chr>                            <dbl>          <int>
1 Bluefin Robotics                  7.99           1334
2 Solara Tech                       7.82           1204
3 Cloudspire Labs                   6.62           1338
4 Aurelia Systems                   5.37           1292

Exercise 1: Rank Companies by Employees in Engineering

Use dplyr wrangling tools in the code chunk below to output a dataframe that shows you the requested information.

For all recorded companies with engineering departments, rank them from the largest to the smallest in terms of the number of employees in their engineering departments. This will probably take two data wrangling steps.

ranked_by_n_eng <- expenses_df |>
  filter(Department == "Engineering") |>
  arrange(desc(NumEmployees))

ranked_by_n_eng
           Company  Department NumEmployees Expenses
1 Bluefin Robotics Engineering          469  4290294
2  Cloudspire Labs Engineering          338  2471426
3      Solara Tech Engineering          313  2809943
4  Aurelia Systems Engineering          161  1149224

Exercise 2: Rank Companies by Total Number of Employees

Use dplyr wrangling tools in the code chunk below to output a dataframe that shows you the requested information.

For the five companies in the city, rank them from the largest to the smallest in terms of the total number of employees at the company. This should take three data wrangling steps.

ranked_by_n_emp <- expenses_df |>
  group_by(Company) |> # consider different companies
  summarize(TotalEmployees = sum(NumEmployees)) |> # calculate total employees
  arrange(TotalEmployees)# rank by num employees

ranked_by_n_emp
# A tibble: 5 × 2
  Company           TotalEmployees
  <chr>                      <int>
1 Nimbus Industries            723
2 Solara Tech                 1204
3 Aurelia Systems             1292
4 Bluefin Robotics            1334
5 Cloudspire Labs             1338

Exercise 3: Rank Departments by Average Per-Employee Expenses

Use dplyr wrangling tools in the code chunk below to output a dataframe that shows you the requested information.

Rank types of departments recorded at the companies in terms of the mean amount of expenses per employee in that kind of department. Additionally, keep track of the number of companies in your study that have each type of department. The output should have three columns (Department, AveragePerEmpExpenses, and NumCompanies) with six rows (Engineering, Finance, HR, Research, Salse, Support). This should take around four data wrangling steps.

ranked_by_peremp <- expenses_df |>
  group_by(Department) |> # group things by dept type
  mutate(PerEmpExpenses = Expenses / NumEmployees) |> # find expenses per employee
  summarize(AveragePerEmpExpenses = mean(PerEmpExpenses),# take the average of expenses per employee
            NumCompanies = n()) |>
  arrange(AveragePerEmpExpenses)

ranked_by_peremp
# A tibble: 6 × 3
  Department  AveragePerEmpExpenses NumCompanies
  <chr>                       <dbl>        <int>
1 HR                          3076.            5
2 Support                     3939.            2
3 Sales                       4303.            2
4 Finance                     4325.            3
5 Research                    7546.            5
6 Engineering                 8144.            4

Exercise 4: Rank Departments by Average Per-Employee Expenses

Use dplyr wrangling tools in the code chunk below to output a dataframe that shows you the requested information.

How much greater are the overall per-employee expenses at the company with the highest per-employee expenses compared to the company with the lowest per-employee expenses? Output a data frame with a single column for “ExpenseRange” that has a single row. The final value should be $3750.7. This should require between three and five data wrangling steps depending on how you write the code.

departments_ranked <- expenses_df |>
  group_by(Company) |>
  summarize(TotalExpenses = sum(Expenses), TotalEmployees = sum(NumEmployees)) |>
  mutate(PerEmpExpenses = TotalExpenses / TotalEmployees) |>
  summarize(highest_exp = max(PerEmpExpenses), lowest_exp = min(PerEmpExpenses)) |>
  mutate(range_exp = highest_exp - lowest_exp)

departments_ranked
# A tibble: 1 × 3
  highest_exp lowest_exp range_exp
        <dbl>      <dbl>     <dbl>
1       7909.      4159.     3751.
departments_ranked <- expenses_df |>
  group_by(Company) |>
  summarize(TotalExpenses = sum(Expenses), TotalEmployees = sum(NumEmployees)) |>
  mutate(PerEmpExpenses = TotalExpenses / TotalEmployees) |>
  summarize(highest_exp = max(PerEmpExpenses), 
            lowest_exp = min(PerEmpExpenses),
            range_exp = max(PerEmpExpenses) - min(PerEmpExpenses))

departments_ranked
# A tibble: 1 × 3
  highest_exp lowest_exp range_exp
        <dbl>      <dbl>     <dbl>
1       7909.      4159.     3751.

All Done

When you’re all done, render this document and submit it for the daily review activity on Canvas!