Download the cv2010on.csv file from Moodle; save it in your computer; and upload it into the data folder of your RStudio. Revise the read.csv code line so that the code matches both the name and address of the data file. The data contains civil cases filed at the federal district courts in six New England States in 2011 and on. The row represents civil cases, and the column their characteristics. DEF stands for defendants; PLT plaintiffs; and nature_of_suit type of lawsuits. Change civilCases to civilCases.

Q1 How many cases are there?

36643

# Load dplyr package
library(dplyr) #for use of dplyr functions such as glimpse(), mutate(), and filter()
library(ggplot2) #for use of ggplot2 functions such ggplot()

# Import data
civilCases <- read.csv("/resources/rstudio/businessstatistics/data/cv2010on.csv") 

# Convert data to tbl_df
civilCases <- tbl_df(civilCases)
str(civilCases)
## Classes 'tbl_df', 'tbl' and 'data.frame':    36643 obs. of  6 variables:
##  $ DISTRICT      : Factor w/ 6 levels "CT","MA","ME",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ PLT           : Factor w/ 19900 levels "-8",":WALKER EL: VENUS-ANTOINETTE",..: 6393 3300 5130 19442 7175 3482 6269 4384 12436 13162 ...
##  $ DEF           : Factor w/ 19496 levels "-8","'47 BRAND, LLC",..: 8018 11968 5576 10445 5251 14988 7759 1510 8210 13180 ...
##  $ FILEYEAR      : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ NOS           : int  445 385 442 440 440 190 440 442 190 110 ...
##  $ nature_of_suit: Factor w/ 44 levels "ADMINISTRATIVE PROCEDURE ACT/REVIEW OR APPEAL OF AGENCY\nDECISION",..: 7 37 9 28 28 29 28 9 29 19 ...

Q2 List all values in District variable.

Revise the level code below so that R returns all levels (values) in the id variable. CT MA ME NH RI VT ## Q3 How many HEALTH CARE / PHARM cases were filed in New Hampshire? Revise the table code below so that R returns the answer for the question. 445 cases

# Check the levels of gender
levels(civilCases$gender)
## NULL

# Create a 2-way contingency table
tab <- table(civilCases$align, civilCases$gender)

# Print tab
tab
## < table of extent 0 x 0 >

Q4 What is the district that handles the largest number of civil cases?

Revise the barchart code below to find the answer. MA

# Create plot of align
ggplot(civilCases, aes(x = DISTRICT)) + 
  geom_bar()

Q5 Revise the code as instructed below. Don’t be surprised if your chart doesn’t look right. Explain the reason why.

There were to many variables on the x axis, this made it difficult to find some stats

# Plot proportion of gender, conditional on align
ggplot(civilCases, aes(x = DISTRICT, fill = nature_of_suit)) + 
  geom_bar(position = "fill")

Q6 Which district has the largest share of HEALTH CARE / PHARM cases?

Simplify the chart in Q5 by filtering for the top five categories in the nature of suit: MA 1. HEALTH CARE / PHARM 2. OTHER CIVIL RIGHTS 3. OTHER CONTRACT ACTIONS 4. CIVIL RIGHTS JOBS 5. PERSONAL INJURY -PRODUCT LIABILITY

Revise the filter code below.

civilCases_filtered <-
  civilCases %>%
  filter(nature_of_suit %in% c("Health Care/ Pharm", "OTHER CIVIL RIGHTS", "OTHER CONTRACT ACTIONS", "CIVIL RIGHTS JOBS", "PERSONAL INJURY INJURY -PRODUCT LIABILITY"))

ggplot(civilCases_filtered, aes(x = DISTRICT, fill = nature_of_suit)) +
  geom_bar(position = "fill")