This is a final project to show off what you have learned. Select your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list). Another good source is found here: https://https://archive.ics.uci.edu/ml/datasets.html

Start with a problem statement at the beginning and make sure to answer it at the end with what you learned The presentation approach is up to you but it should contain the following:

Problem Statement: Looking for insight on if wages are affected based on education and experience, and if full/part-time status, location or ethnicity impact an employees wage.

Dataset: Determinants of Wages Data (CPS 1988)

  1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
csvurl <- "https://vincentarelbundock.github.io/Rdatasets/csv/AER/CPS1988.csv"
# Determinants of Wages Data (CPS 1988)
mydata <- read.table(file=csvurl, header=TRUE, sep=",")
print(head(mydata, 20))
##     X    wage education experience ethnicity smsa    region parttime
## 1   1  354.94         7         45      cauc  yes northeast       no
## 2   2  123.46        12          1      cauc  yes northeast      yes
## 3   3  370.37         9          9      cauc  yes northeast       no
## 4   4  754.94        11         46      cauc  yes northeast       no
## 5   5  593.54        12         36      cauc  yes northeast       no
## 6   6  377.23        16         22      cauc  yes northeast       no
## 7   7  284.90         8         51      cauc  yes northeast       no
## 8   8  561.13        12         34      cauc  yes northeast       no
## 9   9  264.06        12          0      cauc  yes northeast       no
## 10 10 1643.83        14         18      cauc  yes northeast       no
## 11 11  474.83        12         17      cauc  yes northeast       no
## 12 12  299.15         8         42      cauc  yes northeast       no
## 13 13  244.88        10         10      cauc  yes northeast       no
## 14 14  474.83        14         19      cauc   no northeast       no
## 15 15  213.68        12         40      cauc   no northeast       no
## 16 16  864.20        16         42      cauc   no northeast       no
## 17 17  841.93        14         27      cauc   no northeast       no
## 18 18  301.96        16         -1      cauc   no northeast       no
## 19 19  669.28        12         17      cauc   no northeast       no
## 20 20  403.61        12         42      cauc   no northeast       no
summary(mydata)
##        X              wage            education       experience  
##  Min.   :    1   Min.   :   50.05   Min.   : 0.00   Min.   :-4.0  
##  1st Qu.: 7040   1st Qu.:  308.64   1st Qu.:12.00   1st Qu.: 8.0  
##  Median :14078   Median :  522.32   Median :12.00   Median :16.0  
##  Mean   :14078   Mean   :  603.73   Mean   :13.07   Mean   :18.2  
##  3rd Qu.:21117   3rd Qu.:  783.48   3rd Qu.:15.00   3rd Qu.:27.0  
##  Max.   :28155   Max.   :18777.20   Max.   :18.00   Max.   :63.0  
##   ethnicity             smsa              region            parttime        
##  Length:28155       Length:28155       Length:28155       Length:28155      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

Note: smsa refers to ‘Standard Metropolitan Statistical Area’, according to the US Office of Management and Budget.

From a cursory glance, say rows 16 and 20, we can see a few key initial differences. Employee 16 (E16) and Employee 20 (E20) both have 42 years of experience, live in a non-smsa section in the northeast, and work full-time. However, E16 has 16 years of education and E20 has 12 years, hence E16 has higher wages.

On a similar note, E11 and E19 both have 12 years of education, 17 years of experience, and work full-time in the northeast. However, E11 lives in a smsa region, and E19 doesn’t, yet E19’s wages are higher. One would think logically that living in a metropolitan area would command a higher wage, but this doesn’t seem to always be the case.

On general statistics, we can see the average wage across the 28,155 employees is 603.73, with the max wage of 18,777.20 which I would presume to be a CEO. I would assume that the wages follow a normal distribution, but with a heavy right skew for the few employees who command a large salary. Education ranges from 0 years to 18 years, with the median being 12 years and mean of 13.07, which means there is a right-hand skew with more employees located in the 12-18 range. This can be verified as the 1st Quartile is 0-12, the 2nd is 12-13.07, the 3rd is 13.07-15, and the last is 15-18.

In terms of experience, the values range from -4 to 63 years, which can be confusing as a negative value for experience should be impossible, so I will likely remove those instances. Upon my own analysis of the raw data, it seems that the 438 employees with negative experience values have 13-18 years of education with wages ranging from 51.44 to 3806.58, but the amount of data points ignored is minor compared to the original row count of 28,155 so I have no qualms sub-setting these employees out.

  1. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
myframe <- data.frame(mydata,stringsAsFactors=FALSE)
myframe <- subset(myframe,experience >= 0) 
colnames(myframe) <- c("Employee","Wage","Education_Yrs","Experience_Yrs","Ethnicity","Metro","Region","Part_Time")
myframe$Wage <- round(myframe$Wage,digits = 0)  # Round to nearest whole number
head(myframe)
##   Employee Wage Education_Yrs Experience_Yrs Ethnicity Metro    Region
## 1        1  355             7             45      cauc   yes northeast
## 2        2  123            12              1      cauc   yes northeast
## 3        3  370             9              9      cauc   yes northeast
## 4        4  755            11             46      cauc   yes northeast
## 5        5  594            12             36      cauc   yes northeast
## 6        6  377            16             22      cauc   yes northeast
##   Part_Time
## 1        no
## 2       yes
## 3        no
## 4        no
## 5        no
## 6        no
  1. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
library(scatterplot3d)
library(rgl)
library(ggplot2)
#library(ggthemes)

# Histogram
hist(myframe$Education_Yrs, breaks= 20, main = "Frequency of Years of Education",xlab="Education",col="lightgreen",labels=TRUE)

# Box Plot
boxplot(myframe$Education_Yrs,main="Years of Education Histogram",xlab="Education")

# Scatterplots
ggplot(myframe, aes(x=Wage, y=Region)) + geom_point() + facet_wrap(~Region)

ggplot(myframe, aes(x=Wage, y=Education_Yrs)) + geom_point() 

ggplot(myframe, aes(x=Wage, y=Experience_Yrs)) + geom_point() 

ggplot(myframe, aes(x=Experience_Yrs, y=Education_Yrs)) + geom_point(aes(color=Wage)) 

scatterplot3d(myframe$Education_Yrs,myframe$Experience_Yrs,myframe$Wage, pch=16, highlight.3d=TRUE, type="h", main="Education & Experience vs Wage", xlab="Education",ylab="Experience",zlab="Wage") # 3D scatterplot of Education, Experience, Wage

plot3d(myframe$Education_Yrs,myframe$Experience_Yrs,myframe$Wage,col="black", size=3,xlab="Education",ylab="Experience",zlab="Wage") 
# Interactive 3D scatterplot of Education, Experience, Wage


ggplot(myframe, aes(x=Experience_Yrs, y=Education_Yrs)) + geom_point(aes(color=Metro)) + facet_wrap(~Region)

# From this we can see that region seemingly has little to no effect on the Education vs Experience determination on Wage.
ggplot(myframe, aes(x=Experience_Yrs, y=Education_Yrs)) + geom_point(aes(color=Part_Time)) + facet_wrap(~Region)

# Part/Full-time status has some bearing, but not as much as Education and Experience.
# Out of curiosity, let's check the 1-to-1 correlation between Wage and Education/Experience

cor(myframe$Wage, myframe$Education_Yrs, method = c("pearson"))  # Correlation is 0.3112436
## [1] 0.3112436
cor(myframe$Wage, myframe$Experience_Yrs, method = c("pearson")) # Correlation is 0.1793643
## [1] 0.1793643
cor(myframe$Wage, myframe$Education_Yrs, method = c("kendall"))  # Correlation is 0.2588596
## [1] 0.2588596
cor(myframe$Wage, myframe$Experience_Yrs, method = c("kendall")) # Correlation is 0.2071639
## [1] 0.2071639
cor(myframe$Wage, myframe$Education_Yrs, method = c("spearman"))  # Correlation is 0.3519162
## [1] 0.3519162
cor(myframe$Wage, myframe$Experience_Yrs, method = c("spearman")) # Correlation is 0.2942816
## [1] 0.2942816
# It seems that regardless of method used, Education has a higher correlation impact on Wage than Experience. 
  1. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

My initial question was what variables held significant sway on an employees wages from the provided fields in the report. Those being: Years of Educations, Years of Experience, their full/part time status, their location within a metropolitan area.

With regards to full/part time status or ethnicity, that seems to hold little to no impact on wages, as there are several part-time employees earning more than full-time employees and vice versa. The downward facing slope that indicates that wages care more about the employees education and experience over all other factors.

This trend holds true for whether or not the employee lives within a metropolitan location, and while wages do seem to have different high-value outliers in each region, the scatterplot still shows the average wage ranging from 0->5000. There are some statistical outliers earning more than 7500.

Looking at the correlations of Education/Experience vs Wage, it seems that Education has a higher impact on Wage than Experience. However, when looking at the Wage vs Education and Wage vs Experience scatterplots, I believe this is skewed due to some statistical outliers, specifically those making above 7500.

From the 3D scatterplot visualized above,it seems conclusive that years of Experience have a higher impact on wages than education, both of which are more important than all other fields. This can be seen, that when Experience is 0, those with more education have a higher wage. As Experience increases, and Education stays the same, Wage increases as well.

When years of experience are the same, then having more education can lead to having higher wages. With regression analysis it may be possible to assign a precise mathematical value for how strongly multiple fields weigh against an employee’s wages.

  1. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
library (readr)
mygiturl <- "https://raw.githubusercontent.com/RonBalaban/CUNY-SPS-R/main/CPS1988.csv"
mydata<-read_csv(url(mygiturl))
## New names:
## Rows: 28155 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): ethnicity, smsa, region, parttime dbl (4): ...1, wage, education,
## experience
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(mydata)
## # A tibble: 6 × 8
##    ...1  wage education experience ethnicity smsa  region    parttime
##   <dbl> <dbl>     <dbl>      <dbl> <chr>     <chr> <chr>     <chr>   
## 1     1  355.         7         45 cauc      yes   northeast no      
## 2     2  123.        12          1 cauc      yes   northeast yes     
## 3     3  370.         9          9 cauc      yes   northeast no      
## 4     4  755.        11         46 cauc      yes   northeast no      
## 5     5  594.        12         36 cauc      yes   northeast no      
## 6     6  377.        16         22 cauc      yes   northeast no