Introduction: This is a data of wages of the employees of a company briefing about the salary with respect to the relevant detailing. With different exploratory analysis methods, lets predict the conclusion from the given data set.
Descriptive statistics on the data set: Using the readr() and read_csv function to read the wage csv file.
library(readr)
wage_with_na <- read_csv("wage.csv")
## Rows: 525 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gender, race
## dbl (5): married, hourly_wage, years_in_education, years_in_employment, num_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(wage_with_na) # displaying with NA values
## # A tibble: 6 × 7
## married hourly_wage years_in_education years_in_employm… num_dependents gender
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 3.24 12 2 3 female
## 2 0 3 11 0 2 male
## 3 1 6 8 28 0 male
## 4 1 5.3 12 2 1 male
## 5 1 8.75 16 8 0 male
## 6 0 11.2 18 7 0 male
## # … with 1 more variable: race <chr>
summary(wage_with_na) #summary of the data
## married hourly_wage years_in_education years_in_employment
## Min. :0.0000 Min. : 0.530 Min. : 0.00 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 3.350 1st Qu.:12.00 1st Qu.: 0.000
## Median :1.0000 Median : 4.670 Median :12.00 Median : 2.000
## Mean :0.6092 Mean : 5.918 Mean :12.56 Mean : 5.152
## 3rd Qu.:1.0000 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.: 7.000
## Max. :1.0000 Max. :24.980 Max. :18.00 Max. :44.000
## NA's :3 NA's :8 NA's :3 NA's :6
## num_dependents gender race
## Min. :0.000 Length:525 Length:525
## 1st Qu.:0.000 Class :character Class :character
## Median :1.000 Mode :character Mode :character
## Mean :1.044
## 3rd Qu.:2.000
## Max. :6.000
## NA's :5
dim(wage_with_na) # rows and columns of the data
## [1] 525 7
Calculating the number of cells with NA value.
sum(is.na(wage_with_na))
## [1] 39
Compared to the number of entries , the values with NAs’ are eliminated since its less than 10% of the given data. It could also be interpolated with the mean or median but those values could impact the summary of the model, hence it is omitted using the function na.omit()
wage<- na.omit(wage_with_na)
sum(is.na((wage))) #Confirming that everything is omitted
## [1] 0
dim(wage) #Number of rows and columns after omitting null values
## [1] 487 7
To define the descriptive statistics of the data the functions used are str() and summary().
str(wage)
## tibble [487 × 7] (S3: tbl_df/tbl/data.frame)
## $ married : num [1:487] 1 0 1 1 1 0 0 0 1 0 ...
## $ hourly_wage : num [1:487] 3.24 3 6 5.3 8.75 ...
## $ years_in_education : num [1:487] 12 11 8 12 16 18 12 12 17 16 ...
## $ years_in_employment: num [1:487] 2 0 28 2 8 7 3 4 21 2 ...
## $ num_dependents : num [1:487] 3 2 0 1 0 0 0 2 0 0 ...
## $ gender : chr [1:487] "female" "male" "male" "male" ...
## $ race : chr [1:487] "white" "white" "white" "white" ...
## - attr(*, "na.action")= 'omit' Named int [1:38] 17 28 34 42 53 60 63 100 111 135 ...
## ..- attr(*, "names")= chr [1:38] "17" "28" "34" "42" ...
summary(wage)
## married hourly_wage years_in_education years_in_employment
## Min. :0.0000 Min. : 0.530 Min. : 0.00 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 3.350 1st Qu.:12.00 1st Qu.: 0.000
## Median :1.0000 Median : 4.690 Median :12.00 Median : 2.000
## Mean :0.6119 Mean : 5.912 Mean :12.58 Mean : 4.986
## 3rd Qu.:1.0000 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.: 6.000
## Max. :1.0000 Max. :22.860 Max. :18.00 Max. :44.000
## num_dependents gender race
## Min. :0.000 Length:487 Length:487
## 1st Qu.:0.000 Class :character Class :character
## Median :1.000 Mode :character Mode :character
## Mean :1.033
## 3rd Qu.:2.000
## Max. :6.000
Thus by observing the statistical report, it is observed that the data deals with four continuous variables and three categorical variables. It also displays the mean,median and different quadrant of each numerical column(excluding NA).
For the rest of the analysis, installing the most common packages of R.
#install.packages("dplyr")
#install.packages("ggplot2")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
wage_race<-filter(wage, `race` =='nonwhite') #extracting relevant tibbles using filter()
ggplot(wage_race, aes(x=hourly_wage,y=years_in_education))+geom_point(col="red",size=3)+ggtitle(" Hourly Wage for non white vs Years in Education ") +xlab("Hourly wage") + ylab("Years in education") #displaying the scatter plot using geom_point() of ggplot package.
The graph displays hourly wage with respect to years in education for non white race.
wage_gender<-filter(wage, `gender` =='male')
ggplot(wage_gender, aes(x=hourly_wage))+geom_bar(col="blue",size=2)+ggtitle(" Hourly Wage for male vs Years in Employement ") +xlab("Hourly wage") + ylab("Years In employement") #displaying the bar chart using geom_bar() of ggplot package.
The graph displays hourly wage with respect to years in employment for male gender.
wage_add<-filter(wage, `gender` =='male',`race`=='white')
wage_add
## # A tibble: 228 × 7
## married hourly_wage years_in_education years_in_employ… num_dependents gender
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 3 11 0 2 male
## 2 1 6 8 28 0 male
## 3 1 5.3 12 2 1 male
## 4 1 8.75 16 8 0 male
## 5 0 11.2 18 7 0 male
## 6 1 18.2 17 21 0 male
## 7 1 8.77 12 0 2 male
## 8 0 5.5 12 3 0 male
## 9 1 22.2 12 15 1 male
## 10 1 17.3 16 0 1 male
## # … with 218 more rows, and 1 more variable: race <chr>
ggplot(wage_add, aes(x=hourly_wage, y=num_dependents))+geom_bin_2d(col="sky blue",size=2)+ggtitle("Hourly Wage for white male vs Number of dependents(Add. info)")
The graph displays hourly wage with respect to Number of dependents for male belonging to white race.
ggplot(wage,aes(y=years_in_education))+geom_boxplot(fill="yellow")+ggtitle("Years in Education")+ylab("Number of Years")
ggplot(wage,aes(y=years_in_employment))+geom_boxplot(fill="green")+ggtitle("Years in Employment")+ylab("Number of Years")
ggplot(wage,aes(y=hourly_wage))+geom_boxplot(fill="cyan")+ggtitle("Hourly Wage")+ylab("Hourly Wage")
The above box plots show the quartiles of the data while the whisker denotes the minimum, first and third quartile and median of the data with the points representing the outlines.
5.Visualizing the relationship in a pair of continuous variables and determining the correlation between them.
Using tapply() function to each of the ragged array by certain level of factors.
mean_emp=wage%>%group_by(years_in_employment)%>%summarise(mean(hourly_wage))
mean_emp
## # A tibble: 34 × 2
## years_in_employment `mean(hourly_wage)`
## <dbl> <dbl>
## 1 0 4.55
## 2 1 4.49
## 3 2 5.56
## 4 3 6.27
## 5 4 6.45
## 6 5 7.05
## 7 6 5.46
## 8 7 8.99
## 9 8 7.20
## 10 9 6.37
## # … with 24 more rows
Thus extracting the average hourly wage with respect to the work experience.
ggplot(mean_emp,aes(x=years_in_employment,y=`mean(hourly_wage)`))+geom_line()+ggtitle("Experience vs Avg Hourly Wage")+xlab("Years of Experience")+ylab("Avg Wage")
Looking at the visualization there is no relationship between years of experience and Average hourly wage.
mean_edu=wage%>%group_by(years_in_education)%>%summarise(mean(hourly_wage))
mean_edu
## # A tibble: 18 × 2
## years_in_education `mean(hourly_wage)`
## <dbl> <dbl>
## 1 0 3.53
## 2 2 3.75
## 3 3 2.92
## 4 4 3.17
## 5 5 2.9
## 6 6 3.98
## 7 7 4.39
## 8 8 4.94
## 9 9 3.28
## 10 10 3.93
## 11 11 4.31
## 12 12 5.39
## 13 13 5.49
## 14 14 6.53
## 15 15 5.88
## 16 16 8.04
## 17 17 11.3
## 18 18 9.88
ggplot(mean_edu,aes(x=years_in_education,y=`mean(hourly_wage)`))+geom_line()+ggtitle("Education vs Avg Hourly Wage")+xlab("Years of Education")+ylab("Avg Wage")
The visualization suggests that, thought there are few outliers, as the years of education increased hourly wage too increased comparatively.
6.Display unique values of categorical variables.
Using the unique function and dplyr’s pipe,group by and summarize function to display the unique values and its relative count.
unique(wage$married)
## [1] 1 0
unique(wage$race)
## [1] "white" "nonwhite"
unique(wage$gender)
## [1] "female" "male"
wage %>%group_by(gender) %>% summarize(Number = n())
## # A tibble: 2 × 2
## gender Number
## <chr> <int>
## 1 female 231
## 2 male 256
wage %>%group_by(married) %>% summarize(Number = n())
## # A tibble: 2 × 2
## married Number
## <dbl> <int>
## 1 0 189
## 2 1 298
wage %>%group_by(race) %>% summarize(Number = n())
## # A tibble: 2 × 2
## race Number
## <chr> <int>
## 1 nonwhite 50
## 2 white 437
con_tab=table(wage$gender,wage$married)
con_tab
##
## 0 1
## female 107 124
## male 82 174
The contingency table represents the number of occurrences of different values relatively between the two given values here are Gender and martial status. From which we could see majority of male and female are married.
chisq.test(con_tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: con_tab
## X-squared = 9.8472, df = 1, p-value = 0.001701
The p-value is less than the usual significance level of 0.05. Here the p-value is 0.001701. Therefore, we reject the null hypothesis that there is no dependence between the gender and whether the person is married or not.
8.Descriptive statistics on the subset(s).
i). Displaying the first subset with data of only female who are education atleast 8 years and are atmost 30 years experienced.
sub1<-filter(wage,gender=="female" & years_in_education>=8 & years_in_employment<=30)
head(sub1)
## # A tibble: 6 × 7
## married hourly_wage years_in_education years_in_employm… num_dependents gender
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 3.24 12 2 3 female
## 2 0 5 12 3 0 female
## 3 0 3.6 12 4 2 female
## 4 0 6.25 16 2 0 female
## 5 0 8.13 13 0 0 female
## 6 1 7.5 12 0 0 female
## # … with 1 more variable: race <chr>
summary(sub1)
## married hourly_wage years_in_education years_in_employment
## Min. :0.0000 Min. : 0.530 Min. : 8.00 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 3.100 1st Qu.:12.00 1st Qu.: 0.000
## Median :1.0000 Median : 3.810 Median :12.00 Median : 2.000
## Mean :0.5381 Mean : 4.667 Mean :12.56 Mean : 3.552
## 3rd Qu.:1.0000 3rd Qu.: 5.815 3rd Qu.:14.00 3rd Qu.: 4.000
## Max. :1.0000 Max. :21.630 Max. :18.00 Max. :26.000
## num_dependents gender race
## Min. :0.000 Length:223 Length:223
## 1st Qu.:0.000 Class :character Class :character
## Median :1.000 Mode :character Mode :character
## Mean :0.991
## 3rd Qu.:2.000
## Max. :5.000
ii). Displaying the second subset with data of only male who are education atmost 16 years and are greater than 15 years experienced.
sub2<-filter(wage,gender=="male" & years_in_education<=16 & years_in_employment>15)
head(sub2)
## # A tibble: 6 × 7
## married hourly_wage years_in_education years_in_employm… num_dependents gender
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 6 8 28 0 male
## 2 1 20.0 14 23 2 male
## 3 1 11.9 13 19 2 male
## 4 1 9.85 8 24 2 male
## 5 1 11.8 14 39 0 male
## 6 1 13.2 12 22 0 male
## # … with 1 more variable: race <chr>
summary(sub2)
## married hourly_wage years_in_education years_in_employment
## Min. :0.0000 Min. : 1.500 Min. : 3.00 Min. :16.00
## 1st Qu.:1.0000 1st Qu.: 6.312 1st Qu.: 9.50 1st Qu.:20.25
## Median :1.0000 Median : 9.300 Median :12.00 Median :23.50
## Mean :0.8667 Mean :10.053 Mean :11.83 Mean :24.73
## 3rd Qu.:1.0000 3rd Qu.:12.995 3rd Qu.:14.00 3rd Qu.:29.50
## Max. :1.0000 Max. :21.860 Max. :16.00 Max. :44.00
## num_dependents gender race
## Min. :0.000 Length:30 Length:30
## 1st Qu.:0.000 Class :character Class :character
## Median :0.000 Mode :character Mode :character
## Mean :1.167
## 3rd Qu.:2.000
## Max. :5.000
sub_female=wage[wage$`gender`=="female",]$hourly_wage
M1=mean(sub_female)
M1
## [1] 4.623333
sub_male=wage[wage$`gender`=="male",]$hourly_wage
M2=mean(sub_male)
M2
## [1] 7.074531
t.test(sub_male,sub_female)
##
## Welch Two Sample t-test
##
## data: sub_male and sub_female
## t = 8.0372, df = 431.5, p-value = 8.896e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.851761 3.050635
## sample estimates:
## mean of x mean of y
## 7.074531 4.623333
Using the recently introduced package to CRAN, pivottabler, to form a table within groups.
#install.packages("pivottabler")
library(pivottabler)
tab <- PivotTable$new(tableStyle=list("border-color"="black"),headingStyle=list("color"="white", "background-color"="green","font-style"="bold", "border-color"="black"),cellStyle=list("color"="black", "background-color"="cornsilk","border-color"="maroon"),totalStyle=list("color"="maroon", "background-color"="cornsilk","border-color"="maroon", "font-weight"="bold"))
tab$addData(wage)
tab$addColumnDataGroups("race")
tab$addRowDataGroups("gender")
tab$addRowDataGroups("married")
tab$defineCalculation(calculationName="hourly_wage", caption="Avg Hourly Rate",summariseExpression="mean(hourly_wage)")
tab$defineCalculation(calculationName="years_in_employment", caption="Avg years of Experience",summariseExpression="mean(years_in_employment)")
tab$defineCalculation(calculationName="years_in_education", caption="Avg years of Education",summariseExpression="mean(years_in_education)")
tab$renderPivot()
The table here represents gender based on the race and marital status across their average hourly wage,average years in education and average years in employment.
Replacing dummy values using fastdummies from the library for the character variables to perform regression models.
library(fastDummies)
wage <- dummy_cols(wage, select_columns = c('race','gender'))
glimpse(wage)
## Rows: 487
## Columns: 11
## $ married <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,…
## $ hourly_wage <dbl> 3.24, 3.00, 6.00, 5.30, 8.75, 11.25, 5.00, 3.60, 1…
## $ years_in_education <dbl> 12, 11, 8, 12, 16, 18, 12, 12, 17, 16, 13, 12, 12,…
## $ years_in_employment <dbl> 2, 0, 28, 2, 8, 7, 3, 4, 21, 2, 0, 0, 3, 15, 0, 0,…
## $ num_dependents <dbl> 3, 2, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 1, 0, 3,…
## $ gender <chr> "female", "male", "male", "male", "male", "male", …
## $ race <chr> "white", "white", "white", "white", "white", "whit…
## $ race_nonwhite <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ race_white <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ gender_female <int> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,…
## $ gender_male <int> 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,…
As per the glimpse of the data, the gender and race has been given dummy values in a seperate column.
cor.test(wage$hourly_wage, wage$race_white, method="pearson")
##
## Pearson's product-moment correlation
##
## data: wage$hourly_wage and wage$race_white
## t = 0.49871, df = 485, p-value = 0.6182
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06634835 0.11126993
## sample estimates:
## cor
## 0.02263944
From the correlation test, we could conclude that there is positive correlation between the variables- Hourly wage and employee from white race.
Most of the cases ,multiple linear model is run to get a better understanding of the linearity between the models.
model = lm(hourly_wage~gender_female, data=wage)
model
##
## Call:
## lm(formula = hourly_wage ~ gender_female, data = wage)
##
## Coefficients:
## (Intercept) gender_female
## 7.075 -2.451
summary(model)
##
## Call:
## lm(formula = hourly_wage ~ gender_female, data = wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5745 -1.8245 -0.8733 1.5411 17.0067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0745 0.2149 32.918 < 2e-16 ***
## gender_female -2.4512 0.3121 -7.855 2.59e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.439 on 485 degrees of freedom
## Multiple R-squared: 0.1129, Adjusted R-squared: 0.111
## F-statistic: 61.7 on 1 and 485 DF, p-value: 2.591e-14
The R-squared and the adjusted R-squared value indicates that the hourly wage for a female gender is not a very good fit of the model to the data and suggests other factors influence the hourly wage, which our model does not take into account.
sl_resd=rstandard(model)
hist(sl_resd)
The histogram distribution suggests that the residuals are almost distributed around 0 but not eqaully and normally, and also we could see some outliers towards the far end of the distribution.
multi_model = lm(hourly_wage ~ gender_male + years_in_education + years_in_employment + num_dependents, data=wage)
summary(multi_model)
##
## Call:
## lm(formula = hourly_wage ~ gender_male + years_in_education +
## years_in_employment + num_dependents, data = wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6546 -1.8046 -0.5301 1.0495 14.1479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.73705 0.67512 -4.054 5.87e-05 ***
## gender_male 1.76555 0.27344 6.457 2.61e-10 ***
## years_in_education 0.53474 0.04903 10.905 < 2e-16 ***
## years_in_employment 0.16161 0.01908 8.469 3.00e-16 ***
## num_dependents 0.18262 0.10923 1.672 0.0952 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.947 on 482 degrees of freedom
## Multiple R-squared: 0.3523, Adjusted R-squared: 0.3469
## F-statistic: 65.53 on 4 and 482 DF, p-value: < 2.2e-16
The coefficients are the values represented in the second table as all the values are positive it denotes that gender being male positivly affect by the hourly wages. And also the R^2 and the adjusted R^2 values have gone up from the 0.1 rate of the simple linear regression to around 0.3. This indicates that the addition of extra independent variables has helped to improve the model.
st_resids = rstandard(multi_model)
plot(x=multi_model$fitted.values, y=st_resids, abline(h=0), xlab="Standardized residuals", ylab="Fitted values", main = "Residual plot",col="red")
The residual plot suggests that most of the residuals are distributed around the horizontal line but also as seen in the simple model we do have outliners towards the end extreme ends.
library(moments)
jarque.test(st_resids)
##
## Jarque-Bera Normality Test
##
## data: st_resids
## JB = 673.02, p-value < 2.2e-16
## alternative hypothesis: greater
The Jarque-Bera test indicates that the residuals are normally distributed and the p value is less the 0.05 significance level.
Thus,by adding additional variables the quality and assumptions of the linear regression method with respect to the model.
As the outcome of the data analysis process it is evident that the entries are mostly independent but our target (salary) is dependet on various factors like age, gender and education.