Exploratory Data Analysis- Wages of employees in a company.

Introduction: This is a data of wages of the employees of a company briefing about the salary with respect to the relevant detailing. With different exploratory analysis methods, lets predict the conclusion from the given data set.
Descriptive statistics on the data set: Using the readr() and read_csv function to read the wage csv file.

library(readr)
wage_with_na <- read_csv("wage.csv")

## Rows: 525 Columns: 7

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gender, race
## dbl (5): married, hourly_wage, years_in_education, years_in_employment, num_...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(wage_with_na) # displaying with NA values

## # A tibble: 6 × 7
##   married hourly_wage years_in_education years_in_employm… num_dependents gender
##     <dbl>       <dbl>              <dbl>             <dbl>          <dbl> <chr> 
## 1       1        3.24                 12                 2              3 female
## 2       0        3                    11                 0              2 male  
## 3       1        6                     8                28              0 male  
## 4       1        5.3                  12                 2              1 male  
## 5       1        8.75                 16                 8              0 male  
## 6       0       11.2                  18                 7              0 male  
## # … with 1 more variable: race <chr>

summary(wage_with_na) #summary of the data

##     married        hourly_wage     years_in_education years_in_employment
##  Min.   :0.0000   Min.   : 0.530   Min.   : 0.00      Min.   : 0.000     
##  1st Qu.:0.0000   1st Qu.: 3.350   1st Qu.:12.00      1st Qu.: 0.000     
##  Median :1.0000   Median : 4.670   Median :12.00      Median : 2.000     
##  Mean   :0.6092   Mean   : 5.918   Mean   :12.56      Mean   : 5.152     
##  3rd Qu.:1.0000   3rd Qu.: 6.880   3rd Qu.:14.00      3rd Qu.: 7.000     
##  Max.   :1.0000   Max.   :24.980   Max.   :18.00      Max.   :44.000     
##  NA's   :3        NA's   :8        NA's   :3          NA's   :6          
##  num_dependents     gender              race          
##  Min.   :0.000   Length:525         Length:525        
##  1st Qu.:0.000   Class :character   Class :character  
##  Median :1.000   Mode  :character   Mode  :character  
##  Mean   :1.044                                        
##  3rd Qu.:2.000                                        
##  Max.   :6.000                                        
##  NA's   :5

dim(wage_with_na) # rows and columns of the data

## [1] 525   7

Check and handle the missing data as appropriate if needed.

Calculating the number of cells with NA value.

sum(is.na(wage_with_na))

## [1] 39

Compared to the number of entries , the values with NAs’ are eliminated since its less than 10% of the given data. It could also be interpolated with the mean or median but those values could impact the summary of the model, hence it is omitted using the function na.omit()

wage<- na.omit(wage_with_na)
sum(is.na((wage))) #Confirming that everything is omitted

## [1] 0

dim(wage) #Number of rows and columns after omitting null values

## [1] 487   7

To define the descriptive statistics of the data the functions used are str() and summary().

str(wage)

## tibble [487 × 7] (S3: tbl_df/tbl/data.frame)
##  $ married            : num [1:487] 1 0 1 1 1 0 0 0 1 0 ...
##  $ hourly_wage        : num [1:487] 3.24 3 6 5.3 8.75 ...
##  $ years_in_education : num [1:487] 12 11 8 12 16 18 12 12 17 16 ...
##  $ years_in_employment: num [1:487] 2 0 28 2 8 7 3 4 21 2 ...
##  $ num_dependents     : num [1:487] 3 2 0 1 0 0 0 2 0 0 ...
##  $ gender             : chr [1:487] "female" "male" "male" "male" ...
##  $ race               : chr [1:487] "white" "white" "white" "white" ...
##  - attr(*, "na.action")= 'omit' Named int [1:38] 17 28 34 42 53 60 63 100 111 135 ...
##   ..- attr(*, "names")= chr [1:38] "17" "28" "34" "42" ...

summary(wage)

##     married        hourly_wage     years_in_education years_in_employment
##  Min.   :0.0000   Min.   : 0.530   Min.   : 0.00      Min.   : 0.000     
##  1st Qu.:0.0000   1st Qu.: 3.350   1st Qu.:12.00      1st Qu.: 0.000     
##  Median :1.0000   Median : 4.690   Median :12.00      Median : 2.000     
##  Mean   :0.6119   Mean   : 5.912   Mean   :12.58      Mean   : 4.986     
##  3rd Qu.:1.0000   3rd Qu.: 6.880   3rd Qu.:14.00      3rd Qu.: 6.000     
##  Max.   :1.0000   Max.   :22.860   Max.   :18.00      Max.   :44.000     
##  num_dependents     gender              race          
##  Min.   :0.000   Length:487         Length:487        
##  1st Qu.:0.000   Class :character   Class :character  
##  Median :1.000   Mode  :character   Mode  :character  
##  Mean   :1.033                                        
##  3rd Qu.:2.000                                        
##  Max.   :6.000

Thus by observing the statistical report, it is observed that the data deals with four continuous variables and three categorical variables. It also displays the mean,median and different quadrant of each numerical column(excluding NA).

Visualizing the distribution of one or more individual continuous variables of the dataset.

For the rest of the analysis, installing the most common packages of R.

#install.packages("dplyr")
#install.packages("ggplot2")
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

wage_race<-filter(wage, `race` =='nonwhite') #extracting relevant tibbles using filter()
ggplot(wage_race, aes(x=hourly_wage,y=years_in_education))+geom_point(col="red",size=3)+ggtitle(" Hourly Wage for non white vs Years in Education ") +xlab("Hourly wage") + ylab("Years in education") #displaying the scatter plot using geom_point() of ggplot package.

The graph displays hourly wage with respect to years in education for non white race.

wage_gender<-filter(wage, `gender` =='male') 
ggplot(wage_gender, aes(x=hourly_wage))+geom_bar(col="blue",size=2)+ggtitle(" Hourly Wage for male vs Years in Employement ") +xlab("Hourly wage") + ylab("Years In employement") #displaying the bar chart using geom_bar() of ggplot package.

The graph displays hourly wage with respect to years in employment for male gender.

wage_add<-filter(wage, `gender` =='male',`race`=='white')
wage_add

## # A tibble: 228 × 7
##    married hourly_wage years_in_education years_in_employ… num_dependents gender
##      <dbl>       <dbl>              <dbl>            <dbl>          <dbl> <chr> 
##  1       0        3                    11                0              2 male  
##  2       1        6                     8               28              0 male  
##  3       1        5.3                  12                2              1 male  
##  4       1        8.75                 16                8              0 male  
##  5       0       11.2                  18                7              0 male  
##  6       1       18.2                  17               21              0 male  
##  7       1        8.77                 12                0              2 male  
##  8       0        5.5                  12                3              0 male  
##  9       1       22.2                  12               15              1 male  
## 10       1       17.3                  16                0              1 male  
## # … with 218 more rows, and 1 more variable: race <chr>

ggplot(wage_add, aes(x=hourly_wage, y=num_dependents))+geom_bin_2d(col="sky blue",size=2)+ggtitle("Hourly  Wage for white male vs Number of dependents(Add. info)")

The graph displays hourly wage with respect to Number of dependents for male belonging to white race.

ggplot(wage,aes(y=years_in_education))+geom_boxplot(fill="yellow")+ggtitle("Years in Education")+ylab("Number of Years")

ggplot(wage,aes(y=years_in_employment))+geom_boxplot(fill="green")+ggtitle("Years in Employment")+ylab("Number of Years")

ggplot(wage,aes(y=hourly_wage))+geom_boxplot(fill="cyan")+ggtitle("Hourly Wage")+ylab("Hourly Wage")

The above box plots show the quartiles of the data while the whisker denotes the minimum, first and third quartile and median of the data with the points representing the outlines.

5.Visualizing the relationship in a pair of continuous variables and determining the correlation between them.

Using tapply() function to each of the ragged array by certain level of factors.

mean_emp=wage%>%group_by(years_in_employment)%>%summarise(mean(hourly_wage))
mean_emp

## # A tibble: 34 × 2
##    years_in_employment `mean(hourly_wage)`
##                  <dbl>               <dbl>
##  1                   0                4.55
##  2                   1                4.49
##  3                   2                5.56
##  4                   3                6.27
##  5                   4                6.45
##  6                   5                7.05
##  7                   6                5.46
##  8                   7                8.99
##  9                   8                7.20
## 10                   9                6.37
## # … with 24 more rows

Thus extracting the average hourly wage with respect to the work experience.

ggplot(mean_emp,aes(x=years_in_employment,y=`mean(hourly_wage)`))+geom_line()+ggtitle("Experience vs Avg Hourly Wage")+xlab("Years of Experience")+ylab("Avg Wage")

Looking at the visualization there is no relationship between years of experience and Average hourly wage.

mean_edu=wage%>%group_by(years_in_education)%>%summarise(mean(hourly_wage))
mean_edu

## # A tibble: 18 × 2
##    years_in_education `mean(hourly_wage)`
##                 <dbl>               <dbl>
##  1                  0                3.53
##  2                  2                3.75
##  3                  3                2.92
##  4                  4                3.17
##  5                  5                2.9 
##  6                  6                3.98
##  7                  7                4.39
##  8                  8                4.94
##  9                  9                3.28
## 10                 10                3.93
## 11                 11                4.31
## 12                 12                5.39
## 13                 13                5.49
## 14                 14                6.53
## 15                 15                5.88
## 16                 16                8.04
## 17                 17               11.3 
## 18                 18                9.88

ggplot(mean_edu,aes(x=years_in_education,y=`mean(hourly_wage)`))+geom_line()+ggtitle("Education vs Avg Hourly Wage")+xlab("Years of Education")+ylab("Avg Wage")

The visualization suggests that, thought there are few outliers, as the years of education increased hourly wage too increased comparatively.

6.Display unique values of categorical variables.

Using the unique function and dplyr’s pipe,group by and summarize function to display the unique values and its relative count.

unique(wage$married)

## [1] 1 0

unique(wage$race)

## [1] "white"    "nonwhite"

unique(wage$gender)

## [1] "female" "male"

wage %>%group_by(gender) %>% summarize(Number = n())

## # A tibble: 2 × 2
##   gender Number
##   <chr>   <int>
## 1 female    231
## 2 male      256

wage %>%group_by(married) %>% summarize(Number = n())

## # A tibble: 2 × 2
##   married Number
##     <dbl>  <int>
## 1       0    189
## 2       1    298

wage %>%group_by(race) %>% summarize(Number = n())

## # A tibble: 2 × 2
##   race     Number
##   <chr>     <int>
## 1 nonwhite     50
## 2 white       437

Building a contingency table of two potentially related categorical variables and Conducting a statistical test of the independence between them.

con_tab=table(wage$gender,wage$married)
con_tab

##         
##            0   1
##   female 107 124
##   male    82 174

The contingency table represents the number of occurrences of different values relatively between the two given values here are Gender and martial status. From which we could see majority of male and female are married.

chisq.test(con_tab)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  con_tab
## X-squared = 9.8472, df = 1, p-value = 0.001701

The p-value is less than the usual significance level of 0.05. Here the p-value is 0.001701. Therefore, we reject the null hypothesis that there is no dependence between the gender and whether the person is married or not.

8.Descriptive statistics on the subset(s).

i). Displaying the first subset with data of only female who are education atleast 8 years and are atmost 30 years experienced.

sub1<-filter(wage,gender=="female" & years_in_education>=8 & years_in_employment<=30)
head(sub1)

## # A tibble: 6 × 7
##   married hourly_wage years_in_education years_in_employm… num_dependents gender
##     <dbl>       <dbl>              <dbl>             <dbl>          <dbl> <chr> 
## 1       1        3.24                 12                 2              3 female
## 2       0        5                    12                 3              0 female
## 3       0        3.6                  12                 4              2 female
## 4       0        6.25                 16                 2              0 female
## 5       0        8.13                 13                 0              0 female
## 6       1        7.5                  12                 0              0 female
## # … with 1 more variable: race <chr>

summary(sub1)

##     married        hourly_wage     years_in_education years_in_employment
##  Min.   :0.0000   Min.   : 0.530   Min.   : 8.00      Min.   : 0.000     
##  1st Qu.:0.0000   1st Qu.: 3.100   1st Qu.:12.00      1st Qu.: 0.000     
##  Median :1.0000   Median : 3.810   Median :12.00      Median : 2.000     
##  Mean   :0.5381   Mean   : 4.667   Mean   :12.56      Mean   : 3.552     
##  3rd Qu.:1.0000   3rd Qu.: 5.815   3rd Qu.:14.00      3rd Qu.: 4.000     
##  Max.   :1.0000   Max.   :21.630   Max.   :18.00      Max.   :26.000     
##  num_dependents     gender              race          
##  Min.   :0.000   Length:223         Length:223        
##  1st Qu.:0.000   Class :character   Class :character  
##  Median :1.000   Mode  :character   Mode  :character  
##  Mean   :0.991                                        
##  3rd Qu.:2.000                                        
##  Max.   :5.000

ii). Displaying the second subset with data of only male who are education atmost 16 years and are greater than 15 years experienced.

sub2<-filter(wage,gender=="male" & years_in_education<=16 & years_in_employment>15)
head(sub2)

## # A tibble: 6 × 7
##   married hourly_wage years_in_education years_in_employm… num_dependents gender
##     <dbl>       <dbl>              <dbl>             <dbl>          <dbl> <chr> 
## 1       1        6                     8                28              0 male  
## 2       1       20.0                  14                23              2 male  
## 3       1       11.9                  13                19              2 male  
## 4       1        9.85                  8                24              2 male  
## 5       1       11.8                  14                39              0 male  
## 6       1       13.2                  12                22              0 male  
## # … with 1 more variable: race <chr>

summary(sub2)

##     married        hourly_wage     years_in_education years_in_employment
##  Min.   :0.0000   Min.   : 1.500   Min.   : 3.00      Min.   :16.00      
##  1st Qu.:1.0000   1st Qu.: 6.312   1st Qu.: 9.50      1st Qu.:20.25      
##  Median :1.0000   Median : 9.300   Median :12.00      Median :23.50      
##  Mean   :0.8667   Mean   :10.053   Mean   :11.83      Mean   :24.73      
##  3rd Qu.:1.0000   3rd Qu.:12.995   3rd Qu.:14.00      3rd Qu.:29.50      
##  Max.   :1.0000   Max.   :21.860   Max.   :16.00      Max.   :44.00      
##  num_dependents     gender              race          
##  Min.   :0.000   Length:30          Length:30         
##  1st Qu.:0.000   Class :character   Class :character  
##  Median :0.000   Mode  :character   Mode  :character  
##  Mean   :1.167                                        
##  3rd Qu.:2.000                                        
##  Max.   :5.000

Conducting a statistical test of the significance of the difference between the means of two subsets of the data. The independent t test is used here to find the significant difference between the mean of two different samples pf a dataset.

sub_female=wage[wage$`gender`=="female",]$hourly_wage
M1=mean(sub_female)
M1

## [1] 4.623333

sub_male=wage[wage$`gender`=="male",]$hourly_wage
M2=mean(sub_male)
M2

## [1] 7.074531

t.test(sub_male,sub_female)

## 
##  Welch Two Sample t-test
## 
## data:  sub_male and sub_female
## t = 8.0372, df = 431.5, p-value = 8.896e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.851761 3.050635
## sample estimates:
## mean of x mean of y 
##  7.074531  4.623333

Summarized information for each group (e.g. the mean or sum within the group)

Using the recently introduced package to CRAN, pivottabler, to form a table within groups.

#install.packages("pivottabler")
library(pivottabler)

tab <- PivotTable$new(tableStyle=list("border-color"="black"),headingStyle=list("color"="white", "background-color"="green","font-style"="bold", "border-color"="black"),cellStyle=list("color"="black", "background-color"="cornsilk","border-color"="maroon"),totalStyle=list("color"="maroon", "background-color"="cornsilk","border-color"="maroon", "font-weight"="bold"))
tab$addData(wage)
tab$addColumnDataGroups("race")
tab$addRowDataGroups("gender")
tab$addRowDataGroups("married")
tab$defineCalculation(calculationName="hourly_wage", caption="Avg Hourly Rate",summariseExpression="mean(hourly_wage)")
tab$defineCalculation(calculationName="years_in_employment", caption="Avg years of Experience",summariseExpression="mean(years_in_employment)")
tab$defineCalculation(calculationName="years_in_education", caption="Avg years of Education",summariseExpression="mean(years_in_education)")
tab$renderPivot()

The table here represents gender based on the race and marital status across their average hourly wage,average years in education and average years in employment.

Linear regression model and interpreting its output.

Replacing dummy values using fastdummies from the library for the character variables to perform regression models.

library(fastDummies)
wage <- dummy_cols(wage, select_columns = c('race','gender'))
glimpse(wage)

## Rows: 487
## Columns: 11
## $ married             <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,…
## $ hourly_wage         <dbl> 3.24, 3.00, 6.00, 5.30, 8.75, 11.25, 5.00, 3.60, 1…
## $ years_in_education  <dbl> 12, 11, 8, 12, 16, 18, 12, 12, 17, 16, 13, 12, 12,…
## $ years_in_employment <dbl> 2, 0, 28, 2, 8, 7, 3, 4, 21, 2, 0, 0, 3, 15, 0, 0,…
## $ num_dependents      <dbl> 3, 2, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 1, 0, 3,…
## $ gender              <chr> "female", "male", "male", "male", "male", "male", …
## $ race                <chr> "white", "white", "white", "white", "white", "whit…
## $ race_nonwhite       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ race_white          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ gender_female       <int> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,…
## $ gender_male         <int> 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,…

As per the glimpse of the data, the gender and race has been given dummy values in a seperate column.

cor.test(wage$hourly_wage, wage$race_white, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  wage$hourly_wage and wage$race_white
## t = 0.49871, df = 485, p-value = 0.6182
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06634835  0.11126993
## sample estimates:
##        cor 
## 0.02263944

From the correlation test, we could conclude that there is positive correlation between the variables- Hourly wage and employee from white race.

Most of the cases ,multiple linear model is run to get a better understanding of the linearity between the models.

model = lm(hourly_wage~gender_female, data=wage)
model

## 
## Call:
## lm(formula = hourly_wage ~ gender_female, data = wage)
## 
## Coefficients:
##   (Intercept)  gender_female  
##         7.075         -2.451

summary(model)

## 
## Call:
## lm(formula = hourly_wage ~ gender_female, data = wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5745 -1.8245 -0.8733  1.5411 17.0067 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.0745     0.2149  32.918  < 2e-16 ***
## gender_female  -2.4512     0.3121  -7.855 2.59e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.439 on 485 degrees of freedom
## Multiple R-squared:  0.1129, Adjusted R-squared:  0.111 
## F-statistic:  61.7 on 1 and 485 DF,  p-value: 2.591e-14

The R-squared and the adjusted R-squared value indicates that the hourly wage for a female gender is not a very good fit of the model to the data and suggests other factors influence the hourly wage, which our model does not take into account.

sl_resd=rstandard(model)
hist(sl_resd)

The histogram distribution suggests that the residuals are almost distributed around 0 but not eqaully and normally, and also we could see some outliers towards the far end of the distribution.

Multiple regression model:

multi_model = lm(hourly_wage ~ gender_male + years_in_education + years_in_employment + num_dependents, data=wage)
summary(multi_model)

## 
## Call:
## lm(formula = hourly_wage ~ gender_male + years_in_education + 
##     years_in_employment + num_dependents, data = wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6546 -1.8046 -0.5301  1.0495 14.1479 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2.73705    0.67512  -4.054 5.87e-05 ***
## gender_male          1.76555    0.27344   6.457 2.61e-10 ***
## years_in_education   0.53474    0.04903  10.905  < 2e-16 ***
## years_in_employment  0.16161    0.01908   8.469 3.00e-16 ***
## num_dependents       0.18262    0.10923   1.672   0.0952 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.947 on 482 degrees of freedom
## Multiple R-squared:  0.3523, Adjusted R-squared:  0.3469 
## F-statistic: 65.53 on 4 and 482 DF,  p-value: < 2.2e-16

The coefficients are the values represented in the second table as all the values are positive it denotes that gender being male positivly affect by the hourly wages. And also the R^2 and the adjusted R^2 values have gone up from the 0.1 rate of the simple linear regression to around 0.3. This indicates that the addition of extra independent variables has helped to improve the model.

st_resids = rstandard(multi_model)

plot(x=multi_model$fitted.values, y=st_resids, abline(h=0), xlab="Standardized residuals", ylab="Fitted values", main = "Residual plot",col="red")

The residual plot suggests that most of the residuals are distributed around the horizontal line but also as seen in the simple model we do have outliners towards the end extreme ends.

library(moments)
jarque.test(st_resids)

## 
##  Jarque-Bera Normality Test
## 
## data:  st_resids
## JB = 673.02, p-value < 2.2e-16
## alternative hypothesis: greater

The Jarque-Bera test indicates that the residuals are normally distributed and the p value is less the 0.05 significance level.

Thus,by adding additional variables the quality and assumptions of the linear regression method with respect to the model.

Conclusion:

As the outcome of the data analysis process it is evident that the entries are mostly independent but our target (salary) is dependet on various factors like age, gender and education.