Reproducible Analysis Data Science Project

After reading a recent article and watching a debate on the topic, I chose to look at the correlation between the pursuit of continuing higher education and salary of an individual.

Q1. Research Question:

Q2. What are the cases, and how many are there?

Q3. Describe the method of data collection.

Q4. What type of study is this (observational/experiment)?

Q5. Data Source: If you collected the data, state self-collected. If not, provide a citation/link.

Q6. Response: What is the response variable, and what type is it (numerical/categorical)?

Q7. Explanatory: What is the explanatory variable(s), and what type is it (numerical/categorical)?

Let’s begin by loading original csv file

original_csv <- read.csv(file="csv_pny 2/ss09pny.csv", header=TRUE, sep=",")
dim(original_csv)
## [1] 188767    279

Next, we will create a dataframe subset

new_csv <- data.frame(original_csv[c("AGEP", "CIT", "COW", "SCHG", "SCHL", "SEX", "PERNP", "PINCP")])
names(new_csv) <- c("Age", "Citizenship_Status", "Worker_Class",  "School_Attending", "Educational_Attainment", "SEX", "Total_Personal_Earnings", "Total_Personal_Income")

Let’s see the dataframe

head(new_csv, 2)
##   Age Citizenship_Status Worker_Class School_Attending
## 1  79                  4           NA               NA
## 2  75                  3           NA               NA
##   Educational_Attainment SEX Total_Personal_Earnings Total_Personal_Income
## 1                     15   1                       0                  9300
## 2                      1   2                       0                  3600

We can create a new csv file containing the chosen subset

write.csv(new_csv, file = "/Users/s/Documents/MSDS/R Studio Projects/tutorial/Cuny S1/Data606/Project Proposal_data606/NY_Data.csv")

Load csv file

NY_Data2 <- read.csv(file="https://raw.githubusercontent.com/rickidonsingh/MSDS/master/Data%20606%20Statistics%20and%20Probability%20for%20Data%20Analytics/Projects/Project%20Proposal/NY_Data.csv", header=TRUE, sep=",")
head(NY_Data2)
##   X Age Citizenship_Status Worker_Class School_Attending
## 1 1  79                  4           NA               NA
## 2 2  75                  3           NA               NA
## 3 3  68                  4           NA               NA
## 4 4  68                  1           NA               NA
## 5 5  69                  1           NA               NA
## 6 6  46                  4            1               NA
##   Educational_Attainment SEX Total_Personal_Earnings Total_Personal_Income
## 1                     15   1                       0                  9300
## 2                      1   2                       0                  3600
## 3                     22   2                       0                  3800
## 4                     22   2                       0                 84200
## 5                     23   1                       0                 92500
## 6                     16   2                    3800                  3800

Q8.Relevant summary statistics

summary(NY_Data2)
##        X               Age        Citizenship_Status  Worker_Class  
##  Min.   :     1   Min.   : 0.00   Min.   :1.000      Min.   :1.00   
##  1st Qu.: 47192   1st Qu.:20.00   1st Qu.:1.000      1st Qu.:1.00   
##  Median : 94384   Median :41.00   Median :1.000      Median :1.00   
##  Mean   : 94384   Mean   :40.04   Mean   :1.632      Mean   :2.18   
##  3rd Qu.:141576   3rd Qu.:58.00   3rd Qu.:1.000      3rd Qu.:3.00   
##  Max.   :188767   Max.   :94.00   Max.   :5.000      Max.   :9.00   
##                                                      NA's   :77675  
##  School_Attending Educational_Attainment      SEX       
##  Min.   : 1.0     Min.   : 1.00          Min.   :1.000  
##  1st Qu.: 6.0     1st Qu.:13.00          1st Qu.:1.000  
##  Median :11.0     Median :17.00          Median :2.000  
##  Mean   : 9.8     Mean   :15.88          Mean   :1.521  
##  3rd Qu.:15.0     3rd Qu.:20.00          3rd Qu.:2.000  
##  Max.   :16.0     Max.   :24.00          Max.   :2.000  
##  NA's   :141365   NA's   :6108                          
##  Total_Personal_Earnings Total_Personal_Income
##  Min.   : -7400          Min.   : -13200      
##  1st Qu.:     0          1st Qu.:   7000      
##  Median : 12000          Median :  22400      
##  Mean   : 31315          Mean   :  38837      
##  3rd Qu.: 42100          3rd Qu.:  50000      
##  Max.   :957000          Max.   :1225000      
##  NA's   :35877           NA's   :33251