Data Preparation

# load data
schsb <- read.csv(file = 'https://raw.githubusercontent.com/pkofy/DATA606/main/Data%20Project/F_SCH_SB_2020_latest.csv')


Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Are pension plans with higher liabilities more likely to be better funded? How does that vary by pension plan type: Single, Multi-, or Multiple Employer?


Cases

What are the cases, and how many are there?

Each case is a Schedule SB that was filed with the Department of Labor regarding a defined benefit pension plan’s funding status using a 2020 Form 5500. We have 39,524 cases.


Data collection

Describe the method of data collection.

The data is compiled by the Employee Benefits Security Administration to comply with the Freedom of Information Act. The data is not processed, it’s the raw data fields from the electronic submissions.


Type of study

What type of study is this (observational/experiment)?

This is an observational study.


Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The data is stored on the DOL’s website. I’ve selected the Schedule SB information from the 2020 Form year filings as the 2021 filings have not all been submitted.


Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is quantitative. It’s the funding percentage of the plan represented by the column SB_ADJ_FNDNG_TGT_PRCNT, for example “101.49” means a plan that’s over 100% funded.


Independent Variable

You should have two independent variables, one quantitative and one qualitative.

The quantitative independent variable is the size, or liability, of the plan. This is represented by the column, SB_TOT_FNDNG_TGT_AMT, for example “1011315” means the plan has a liability of $1,011,315.

The qualitative independent variable is the type of the plan. This is represented by the column SB_PLAN_TYPE_CODE, for example “1” means the plan is a Single Employer plan. The other values, 2 and 3, represent Multi- and Multiple Employer plan types. Multiemployer plans are for two or more similar employers sharing in the same benefit plan so that say a trucker working for multiple trucking companies can have one benefit plan instead of small benefits in many plans. A multiple employer plan is for two or more unrelated employers to help share the administrative cost of running a pension plan.


Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.


Dependent Variable: Funding Percentage

schsb$SB_ADJ_FNDNG_TGT_PRCNT <- as.numeric(schsb$SB_ADJ_FNDNG_TGT_PRCNT)/100

qplot(schsb$SB_ADJ_FNDNG_TGT_PRCNT,
     xlab="Funding Percentage",
     xlim=c(0,3),
     bins = 50)


Quantitative Independent Variable: Liability

schsb$SB_TOT_FNDNG_TGT_AMT <- as.numeric(schsb$SB_TOT_FNDNG_TGT_AMT)

qplot(schsb$SB_TOT_FNDNG_TGT_AMT,
     xlab="Pension Plan Liability in Dollars",
     xlim=c(0,10000000),
     bins = 50)


Qualitative Independent Variable: Plan Type

table(schsb$SB_PLAN_TYPE_CODE)
## 
##     1     2     3 
## 39378    58    88