This is a final project to show off what you have learned. Select your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list). Another good source is found here: https://https://archive.ics.uci.edu/ml/datasets.html
Start with a problem statement at the beginning and make sure to answer it at the end with what you learned The presentation approach is up to you but it should contain the following:
Problem Statement: Looking for insight on if wages are affected based on education and experience, and if full/part-time status, location or ethnicity impact an employees wage.
csvurl <- "https://vincentarelbundock.github.io/Rdatasets/csv/AER/CPS1988.csv"
# Determinants of Wages Data (CPS 1988)
mydata <- read.table(file=csvurl, header=TRUE, sep=",")
print(head(mydata, 20))
## X wage education experience ethnicity smsa region parttime
## 1 1 354.94 7 45 cauc yes northeast no
## 2 2 123.46 12 1 cauc yes northeast yes
## 3 3 370.37 9 9 cauc yes northeast no
## 4 4 754.94 11 46 cauc yes northeast no
## 5 5 593.54 12 36 cauc yes northeast no
## 6 6 377.23 16 22 cauc yes northeast no
## 7 7 284.90 8 51 cauc yes northeast no
## 8 8 561.13 12 34 cauc yes northeast no
## 9 9 264.06 12 0 cauc yes northeast no
## 10 10 1643.83 14 18 cauc yes northeast no
## 11 11 474.83 12 17 cauc yes northeast no
## 12 12 299.15 8 42 cauc yes northeast no
## 13 13 244.88 10 10 cauc yes northeast no
## 14 14 474.83 14 19 cauc no northeast no
## 15 15 213.68 12 40 cauc no northeast no
## 16 16 864.20 16 42 cauc no northeast no
## 17 17 841.93 14 27 cauc no northeast no
## 18 18 301.96 16 -1 cauc no northeast no
## 19 19 669.28 12 17 cauc no northeast no
## 20 20 403.61 12 42 cauc no northeast no
summary(mydata)
## X wage education experience
## Min. : 1 Min. : 50.05 Min. : 0.00 Min. :-4.0
## 1st Qu.: 7040 1st Qu.: 308.64 1st Qu.:12.00 1st Qu.: 8.0
## Median :14078 Median : 522.32 Median :12.00 Median :16.0
## Mean :14078 Mean : 603.73 Mean :13.07 Mean :18.2
## 3rd Qu.:21117 3rd Qu.: 783.48 3rd Qu.:15.00 3rd Qu.:27.0
## Max. :28155 Max. :18777.20 Max. :18.00 Max. :63.0
## ethnicity smsa region parttime
## Length:28155 Length:28155 Length:28155 Length:28155
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
Note: smsa refers to ‘Standard Metropolitan Statistical Area’, according to the US Office of Management and Budget.
From a cursory glance, say rows 16 and 20, we can see a few key initial differences. Employee 16 (E16) and Employee 20 (E20) both have 42 years of experience, live in a non-smsa section in the northeast, and work full-time. However, E16 has 16 years of education and E20 has 12 years, hence E16 has higher wages.
On a similar note, E11 and E19 both have 12 years of education, 17 years of experience, and work full-time in the northeast. However, E11 lives in a smsa region, and E19 doesn’t, yet E19’s wages are higher. One would think logically that living in a metropolitan area would command a higher wage, but this doesn’t seem to always be the case.
On general statistics, we can see the average wage across the 28,155 employees is 603.73, with the max wage of 18,777.20 which I would presume to be a CEO. I would assume that the wages follow a normal distribution, but with a heavy right skew for the few employees who command a large salary. Education ranges from 0 years to 18 years, with the median being 12 years and mean of 13.07, which means there is a right-hand skew with more employees located in the 12-18 range. This can be verified as the 1st Quartile is 0-12, the 2nd is 12-13.07, the 3rd is 13.07-15, and the last is 15-18.
In terms of experience, the values range from -4 to 63 years, which can be confusing as a negative value for experience should be impossible, so I will likely remove those instances. Upon my own analysis of the raw data, it seems that the 438 employees with negative experience values have 13-18 years of education with wages ranging from 51.44 to 3806.58, but the amount of data points ignored is minor compared to the original row count of 28,155 so I have no qualms sub-setting these employees out.
myframe <- data.frame(mydata,stringsAsFactors=FALSE)
myframe <- subset(myframe,experience >= 0)
colnames(myframe) <- c("Employee","Wage","Education_Yrs","Experience_Yrs","Ethnicity","Metro","Region","Part_Time")
myframe$Wage <- round(myframe$Wage,digits = 0) # Round to nearest whole number
head(myframe)
## Employee Wage Education_Yrs Experience_Yrs Ethnicity Metro Region
## 1 1 355 7 45 cauc yes northeast
## 2 2 123 12 1 cauc yes northeast
## 3 3 370 9 9 cauc yes northeast
## 4 4 755 11 46 cauc yes northeast
## 5 5 594 12 36 cauc yes northeast
## 6 6 377 16 22 cauc yes northeast
## Part_Time
## 1 no
## 2 yes
## 3 no
## 4 no
## 5 no
## 6 no
library(scatterplot3d)
library(rgl)
library(ggplot2)
#library(ggthemes)
# Histogram
hist(myframe$Education_Yrs, breaks= 20, main = "Frequency of Years of Education",xlab="Education",col="lightgreen",labels=TRUE)
# Box Plot
boxplot(myframe$Education_Yrs,main="Years of Education Histogram",xlab="Education")
# Scatterplots
ggplot(myframe, aes(x=Wage, y=Region)) + geom_point() + facet_wrap(~Region)
ggplot(myframe, aes(x=Wage, y=Education_Yrs)) + geom_point()
ggplot(myframe, aes(x=Wage, y=Experience_Yrs)) + geom_point()
ggplot(myframe, aes(x=Experience_Yrs, y=Education_Yrs)) + geom_point(aes(color=Wage))
scatterplot3d(myframe$Education_Yrs,myframe$Experience_Yrs,myframe$Wage, pch=16, highlight.3d=TRUE, type="h", main="Education & Experience vs Wage", xlab="Education",ylab="Experience",zlab="Wage") # 3D scatterplot of Education, Experience, Wage
plot3d(myframe$Education_Yrs,myframe$Experience_Yrs,myframe$Wage,col="black", size=3,xlab="Education",ylab="Experience",zlab="Wage")
# Interactive 3D scatterplot of Education, Experience, Wage
ggplot(myframe, aes(x=Experience_Yrs, y=Education_Yrs)) + geom_point(aes(color=Metro)) + facet_wrap(~Region)
# From this we can see that region seemingly has little to no effect on the Education vs Experience determination on Wage.
ggplot(myframe, aes(x=Experience_Yrs, y=Education_Yrs)) + geom_point(aes(color=Part_Time)) + facet_wrap(~Region)
# Part/Full-time status has some bearing, but not as much as Education and Experience.
# Out of curiosity, let's check the 1-to-1 correlation between Wage and Education/Experience
cor(myframe$Wage, myframe$Education_Yrs, method = c("pearson")) # Correlation is 0.3112436
## [1] 0.3112436
cor(myframe$Wage, myframe$Experience_Yrs, method = c("pearson")) # Correlation is 0.1793643
## [1] 0.1793643
cor(myframe$Wage, myframe$Education_Yrs, method = c("kendall")) # Correlation is 0.2588596
## [1] 0.2588596
cor(myframe$Wage, myframe$Experience_Yrs, method = c("kendall")) # Correlation is 0.2071639
## [1] 0.2071639
cor(myframe$Wage, myframe$Education_Yrs, method = c("spearman")) # Correlation is 0.3519162
## [1] 0.3519162
cor(myframe$Wage, myframe$Experience_Yrs, method = c("spearman")) # Correlation is 0.2942816
## [1] 0.2942816
# It seems that regardless of method used, Education has a higher correlation impact on Wage than Experience.
My initial question was what variables held significant sway on an employees wages from the provided fields in the report. Those being: Years of Educations, Years of Experience, their full/part time status, their location within a metropolitan area.
With regards to full/part time status or ethnicity, that seems to hold little to no impact on wages, as there are several part-time employees earning more than full-time employees and vice versa. The downward facing slope that indicates that wages care more about the employees education and experience over all other factors.
This trend holds true for whether or not the employee lives within a metropolitan location, and while wages do seem to have different high-value outliers in each region, the scatterplot still shows the average wage ranging from 0->5000. There are some statistical outliers earning more than 7500.
Looking at the correlations of Education/Experience vs Wage, it seems that Education has a higher impact on Wage than Experience. However, when looking at the Wage vs Education and Wage vs Experience scatterplots, I believe this is skewed due to some statistical outliers, specifically those making above 7500.
From the 3D scatterplot visualized above,it seems conclusive that years of Experience have a higher impact on wages than education, both of which are more important than all other fields. This can be seen, that when Experience is 0, those with more education have a higher wage. As Experience increases, and Education stays the same, Wage increases as well.
When years of experience are the same, then having more education can lead to having higher wages. With regression analysis it may be possible to assign a precise mathematical value for how strongly multiple fields weigh against an employee’s wages.
library (readr)
mygiturl <- "https://raw.githubusercontent.com/RonBalaban/CUNY-SPS-R/main/CPS1988.csv"
mydata<-read_csv(url(mygiturl))
## New names:
## Rows: 28155 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): ethnicity, smsa, region, parttime dbl (4): ...1, wage, education,
## experience
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(mydata)
## # A tibble: 6 × 8
## ...1 wage education experience ethnicity smsa region parttime
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 1 355. 7 45 cauc yes northeast no
## 2 2 123. 12 1 cauc yes northeast yes
## 3 3 370. 9 9 cauc yes northeast no
## 4 4 755. 11 46 cauc yes northeast no
## 5 5 594. 12 36 cauc yes northeast no
## 6 6 377. 16 22 cauc yes northeast no