Due Monday 9/11/17 before the start of class.

You are to work alone on this assignment. (We will discuss.)

Universities are typically opaque, bureaucratic institutions. Many public schools such as the University of North Carolina system make data about university employees publically available. Using this dataset what can we learn about the dynamics of the university system?

dplyr

To download the RMarkdown file for this assignment, right-click on .Rmd file for this assignment, and select open link in new tab. Then proceed to download by saving the file to your Working Directory

The best way to to learn the dplyr package is to just use it a lot. Read through the following dplyr tutorial the complete questions in this .Rmd document. I suggest doing the coding in an R script then putting the answers into this documents.

Instructions

Q: Return the sum of the first 10 positive integers?

Q: What is the sum of the first 10 positive integers?

sum(1:10)
## [1] 55

Create a variable x with the sum of the first 10 positive integers.

x <- sum(1:10)

The data

First download the data set from Ryan Thornburg’s website. You may want to set cache=TRUE for this chunk so you don’t have to download the data set every time you run the script.

library(tidyverse)

# Load the UNC departments data
data <- read_csv(url("http://ryanthornburg.com/wp-content/uploads/2015/05/UNC_Salares_NandO_2015-05-06.csv"))

Take a first look at the data

colnames(data)
##  [1] "name"     "campus"   "dept"     "position" "exempt2"  "employed"
##  [7] "hiredate" "fte"      "status"   "stservyr" "statesal" "nonstsal"
## [13] "totalsal" "age"
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    12287 obs. of  14 variables:
##  $ name    : chr  "AARON, NANCY G" "ABARBANELL, JEFFERY S" "ABARE, BETSY" "ABATE, AARON B" ...
##  $ campus  : chr  "UNC-CH" "UNC-CH" "UNC-CH" "UNC-CH" ...
##  $ dept    : chr  "Romance Languages" "Kenan-Flagler Business School" "Institute of Marine Sciences" "Medicine Administration" ...
##  $ position: chr  "Senior Lecturer" "Associate Professor" "Research Technician" "Accounting Technician" ...
##  $ exempt2 : chr  "Exempt" "Exempt" "Subject to State Personnel Act" "Subject to State Personnel Act" ...
##  $ employed: int  9 9 12 12 12 12 12 12 12 12 ...
##  $ hiredate: int  20030701 19990101 20110912 20090420 20120103 20051003 19960923 20130401 19870101 20120702 ...
##  $ fte     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ status  : chr  "Fixed-Term" "Continuing" "Permanent" "Permanent" ...
##  $ stservyr: int  11 17 3 5 2 15 34 11 27 2 ...
##  $ statesal: int  46350 173000 0 0 41696 56588 41707 0 0 0 ...
##  $ nonstsal: int  0 0 38170 50070 0 4412 0 80227 55803 32889 ...
##  $ totalsal: int  46350 173000 38170 50070 41696 61000 41707 80227 55803 32889 ...
##  $ age     : int  55 57 54 29 35 41 62 36 64 26 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 14
##   .. ..$ name    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ campus  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ dept    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ position: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ exempt2 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ employed: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ hiredate: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ fte     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ status  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ stservyr: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ statesal: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ nonstsal: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ totalsal: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ age     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
head(data)
## # A tibble: 6 x 14
##                        name campus                          dept
##                       <chr>  <chr>                         <chr>
## 1            AARON, NANCY G UNC-CH             Romance Languages
## 2     ABARBANELL, JEFFERY S UNC-CH Kenan-Flagler Business School
## 3              ABARE, BETSY UNC-CH  Institute of Marine Sciences
## 4            ABATE, AARON B UNC-CH       Medicine Administration
## 5        ABATEMARCO, JODI M UNC-CH           School of Education
## 6 ABBOTT-LUNSFORD, SHELBY L UNC-CH       Medicine Administration
## # ... with 11 more variables: position <chr>, exempt2 <chr>,
## #   employed <int>, hiredate <int>, fte <dbl>, status <chr>,
## #   stservyr <int>, statesal <int>, nonstsal <int>, totalsal <int>,
## #   age <int>

Assignment

Question 1

Return a data frame with columns: name, dept, age,totalsal

#

Return a data frame with columns: statesal, nonstsal,totalsal using one of the select_helpers functions. If needed, google ‘select_helpers in R’.

#

Question 2

Rename the fte column to fulltime. Make sure this change is saved (i.e. data <- ...).

#

Question 3

What is the mean salary in the Neurosurgery department?

#

Return a data frame with employee’s in the Neurosurgery department making more than $500,000. Why might these professors be so well paid?

#

Question 4

What is the total amount that full time Dermatology employees get paid?

#

Question 5

Create a data frame called radio_dept whose rows are the employees from the Radiology department.

  • include only the following columns: name, position, age, nonstsal, totalsal.
  • order the employees by salary

First without pipes

#

Next with pipes

#

Make a histogram of Radiology salaries

#

Question 6

Create a data frame called dept_summary whose rows are the departments and whose columns are: department size, mean department salary, median department salary, and maximum salary (using totalsal for salary).

#

Order the departments by highest mean salary and print the 10 highest paid departments.

#

Order the departments by highest median salary and print the 10 highest paid departments.

#

Why do these lists differ? If you were asked for the top 10 best paid departments at UNC which summary would you choose and why?

Question 7

How many departments have at least 10 employees?

#

Question 8

Which department hired the most people in 2010? Hint: you probably have to modify hiredate.

#

Question 9

Make a list of all the department names and sort this list alphabetically. What is the 42nd department in this list?

#

Question 10

Plot number of people hired by the CS dept per year vs. year

#

Now add STOR, Math, Biostatistics, SILS and Physics to the above plot

#

Open question

Examine the relationship between age and salary at UNC. You can interpret this however you like e.g.

Requirements

hints

Every question can be answered using: filter, arrange, select, mutate, summarize and group_by functions.