After reading a recent article and watching a debate on the topic, I chose to look at the correlation between the pursuit of continuing higher education and salary of an individual.
Q1. Research Question:
Q2. What are the cases, and how many are there?
Q3. Describe the method of data collection.
Q4. What type of study is this (observational/experiment)?
Q5. Data Source: If you collected the data, state self-collected. If not, provide a citation/link.
Q6. Response: What is the response variable, and what type is it (numerical/categorical)?
Q7. Explanatory: What is the explanatory variable(s), and what type is it (numerical/categorical)?
Let’s begin by loading original csv file
original_csv <- read.csv(file="csv_pny 2/ss09pny.csv", header=TRUE, sep=",")
dim(original_csv)## [1] 188767 279
Next, we will create a dataframe subset
new_csv <- data.frame(original_csv[c("AGEP", "CIT", "COW", "SCHG", "SCHL", "SEX", "PERNP", "PINCP")])
names(new_csv) <- c("Age", "Citizenship_Status", "Worker_Class", "School_Attending", "Educational_Attainment", "SEX", "Total_Personal_Earnings", "Total_Personal_Income")Let’s see the dataframe
head(new_csv, 2)## Age Citizenship_Status Worker_Class School_Attending
## 1 79 4 NA NA
## 2 75 3 NA NA
## Educational_Attainment SEX Total_Personal_Earnings Total_Personal_Income
## 1 15 1 0 9300
## 2 1 2 0 3600
We can create a new csv file containing the chosen subset
write.csv(new_csv, file = "/Users/s/Documents/MSDS/R Studio Projects/tutorial/Cuny S1/Data606/Project Proposal_data606/NY_Data.csv")Load csv file
NY_Data2 <- read.csv(file="https://raw.githubusercontent.com/rickidonsingh/MSDS/master/Data%20606%20Statistics%20and%20Probability%20for%20Data%20Analytics/Projects/Project%20Proposal/NY_Data.csv", header=TRUE, sep=",")head(NY_Data2)## X Age Citizenship_Status Worker_Class School_Attending
## 1 1 79 4 NA NA
## 2 2 75 3 NA NA
## 3 3 68 4 NA NA
## 4 4 68 1 NA NA
## 5 5 69 1 NA NA
## 6 6 46 4 1 NA
## Educational_Attainment SEX Total_Personal_Earnings Total_Personal_Income
## 1 15 1 0 9300
## 2 1 2 0 3600
## 3 22 2 0 3800
## 4 22 2 0 84200
## 5 23 1 0 92500
## 6 16 2 3800 3800
Q8.Relevant summary statistics
summary(NY_Data2)## X Age Citizenship_Status Worker_Class
## Min. : 1 Min. : 0.00 Min. :1.000 Min. :1.00
## 1st Qu.: 47192 1st Qu.:20.00 1st Qu.:1.000 1st Qu.:1.00
## Median : 94384 Median :41.00 Median :1.000 Median :1.00
## Mean : 94384 Mean :40.04 Mean :1.632 Mean :2.18
## 3rd Qu.:141576 3rd Qu.:58.00 3rd Qu.:1.000 3rd Qu.:3.00
## Max. :188767 Max. :94.00 Max. :5.000 Max. :9.00
## NA's :77675
## School_Attending Educational_Attainment SEX
## Min. : 1.0 Min. : 1.00 Min. :1.000
## 1st Qu.: 6.0 1st Qu.:13.00 1st Qu.:1.000
## Median :11.0 Median :17.00 Median :2.000
## Mean : 9.8 Mean :15.88 Mean :1.521
## 3rd Qu.:15.0 3rd Qu.:20.00 3rd Qu.:2.000
## Max. :16.0 Max. :24.00 Max. :2.000
## NA's :141365 NA's :6108
## Total_Personal_Earnings Total_Personal_Income
## Min. : -7400 Min. : -13200
## 1st Qu.: 0 1st Qu.: 7000
## Median : 12000 Median : 22400
## Mean : 31315 Mean : 38837
## 3rd Qu.: 42100 3rd Qu.: 50000
## Max. :957000 Max. :1225000
## NA's :35877 NA's :33251
Let’s visualize our findings
hist(NY_Data2$Educational_Attainment, xlab= "Education", main = "Educational Attainment")hist(NY_Data2$Total_Personal_Income, xlab= "Personal Income", main = "Total Personal Income")