library(haven)
## Warning: package 'haven' was built under R version 3.5.3
DVLFS <- read_dta("C:/Users/Saira Rasul/Desktop/data-viz/DVLFS.dta")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(MASS)
library(plyr)
## Warning: package 'plyr' was built under R version 3.5.3

“Story Telling Using the Real Data of Pakistan i.e., Labor Force Survey Of Pakistan(LFS) for the Year2017-18”

Here the data use is from Labor Force Survey Of Pakistan(LFS) for the Year(2017-18). Here, we Mainly target the effect of wages but using different variables, we check the following effects using different plots. Following are the main objects:

1-The Overall impact of wages with respect to the gender comparison.

2-The comparison of wages among Gender (Female/Males) at provincial levels as well at regional level.

3-The overall labor involvement at Provincial as well as regional level with respect to the gender.

4-The Impact of formal and non-formal education on wages.

5-The contribution of Labor as per their age in the Labor Force of Pakistan.

1-The Overall impact of wages with respect to the gender comparison.

At first, we have to check the overall impact of Gender on the wages i.e., to compare the difference between male and female salaries in Pakistan. By looking at the graphs, we come to know that on average males earn more than females in Pakistan. For this analysis, first we draw the density plot.

DVLFS[, c("wages", "gender")]
## # A tibble: 27,345 x 2
##    wages gender
##    <dbl>  <dbl>
##  1  82.5      1
##  2  60.2      1
##  3  73.8      1
##  4  65.4      1
##  5  96.2      1
##  6 248.       1
##  7 265.       1
##  8 248.       1
##  9  85.9      1
## 10  95.6      1
## # ... with 27,335 more rows
DVLFS$gender<-factor(DVLFS$gender)
DVLFS$gender<-revalue(DVLFS$gender, c("0"="Female", "1"="Male"))
p<-ggplot(DVLFS, aes(x=wages, colour=gender)) + geom_density()
p+labs(title="Comparison of wages among gender", x="Wage",y="Density")

Now, we are going to draw the scatter plot

DVLFS[, c("wages", "gender")]
## # A tibble: 27,345 x 2
##    wages gender
##    <dbl> <fct> 
##  1  82.5 Male  
##  2  60.2 Male  
##  3  73.8 Male  
##  4  65.4 Male  
##  5  96.2 Male  
##  6 248.  Male  
##  7 265.  Male  
##  8 248.  Male  
##  9  85.9 Male  
## 10  95.6 Male  
## # ... with 27,335 more rows
DVLFS$gender<-factor(DVLFS$gender)
DVLFS$gender<-revalue(DVLFS$gender, c("0"="Female", "1"="Male"))
## The following `from` values were not present in `x`: 0, 1
p<-ggplot(DVLFS, aes(x=Age, y=wages,colour=gender)) + geom_point()
p

p+labs(title="Age Wage ditribution accordingly gender", x="Age in years",y="Wages")

sps<- ggplot(DVLFS,aes(x=Age,y=wages,color=DVLFS$gender))+geom_point()+scale_color_brewer(palette = "Set1")
sps+geom_point()+geom_line()

By looking at the above graph, we come to know that such graphs are the bad representation for this type of data, as such graphs need greater insight and attention of the viewer. So we move towards the Bar charts, as Horizontal Bar Chart have much more clear analysis. We get the clear picture showing that males get more wages as compared to the females, as of the males have more contribution compared to females in the market.

DVLFS[, c("wages", "gender")]
## # A tibble: 27,345 x 2
##    wages gender
##    <dbl> <fct> 
##  1  82.5 Male  
##  2  60.2 Male  
##  3  73.8 Male  
##  4  65.4 Male  
##  5  96.2 Male  
##  6 248.  Male  
##  7 265.  Male  
##  8 248.  Male  
##  9  85.9 Male  
## 10  95.6 Male  
## # ... with 27,335 more rows
DVLFS$gender<-factor(DVLFS$gender)
DVLFS$gender<-revalue(DVLFS$gender, c("1"="Male", "2"="Female"))
## The following `from` values were not present in `x`: 1, 2
p<-ggplot(data=DVLFS, aes(x=gender, y=lnwages))+geom_bar(stat="identity")
p + scale_fill_brewer(palette="Greens") + theme_minimal()+labs(title="Bar plot Gender wise")

2-The comparison of wages among Gender(Female/Males) at provincial levels as well at regional level.

DVLFS[, c("wages", "PROVINCE")]
## # A tibble: 27,345 x 2
##    wages               PROVINCE
##    <dbl>              <dbl+lbl>
##  1  82.5 1 [KHYBER PAKHTUNKHWA]
##  2  60.2 1 [KHYBER PAKHTUNKHWA]
##  3  73.8 1 [KHYBER PAKHTUNKHWA]
##  4  65.4 1 [KHYBER PAKHTUNKHWA]
##  5  96.2 1 [KHYBER PAKHTUNKHWA]
##  6 248.  1 [KHYBER PAKHTUNKHWA]
##  7 265.  1 [KHYBER PAKHTUNKHWA]
##  8 248.  1 [KHYBER PAKHTUNKHWA]
##  9  85.9 1 [KHYBER PAKHTUNKHWA]
## 10  95.6 1 [KHYBER PAKHTUNKHWA]
## # ... with 27,335 more rows
DVLFS$PROVINCE<-factor(DVLFS$PROVINCE)
DVLFS$PROVINCE<-revalue(DVLFS$PROVINCE, c("1"="KPK", "2"="Punjab" , "3"="Sindh" , "4"= "Baloch"))
p<-ggplot(DVLFS, aes(x=PROVINCE, y=lnwages, fill=gender))+geom_bar(stat="identity")
p + scale_fill_brewer(palette="Greens") + theme_minimal()+labs(title="Bar plot Province wise")

By Considering the Provinces, it is also analyzed through the violen plot. Where the results are more clear as follows:

qplot(PROVINCE, lnwages, data = DVLFS, geom = "violin")  

Now, we are considering the regional impact on wages. Which is as follows:

DVLFS[, c("wages", "Region")]
## # A tibble: 27,345 x 2
##    wages    Region
##    <dbl> <dbl+lbl>
##  1  82.5 2 [Urban]
##  2  60.2 2 [Urban]
##  3  73.8 2 [Urban]
##  4  65.4 2 [Urban]
##  5  96.2 2 [Urban]
##  6 248.  2 [Urban]
##  7 265.  2 [Urban]
##  8 248.  2 [Urban]
##  9  85.9 2 [Urban]
## 10  95.6 2 [Urban]
## # ... with 27,335 more rows
DVLFS$Region<-factor(DVLFS$Region)
DVLFS$Region<-revalue(DVLFS$Region, c("1"="Rural", "2"="Urban"))
p<-ggplot(data=DVLFS, aes(x=Region, y=lnwages, fill=gender))+geom_bar(stat="identity")
p + scale_fill_brewer(palette="Greens") + theme_minimal()+labs(title="Bar plot Region wise")

By Considering the Regions, it is also analyzed through the violen plot. Where the results are clearer as follows:

qplot(Region, lnwages, data = DVLFS, geom = "violin")

3-The overall labor involvement at Provincial as well as regional level with respect to the gender gender.

Here, in the third section, gender impact of wages with respect to the provinces as well as regional impact is to be analyzed through the Scatter plot. By looking at the graphs of both provincial as well as regional, males earn more as compared to the females. But at provincial level, Punjab has the higher level of contribution towards the labor force and in Regional wise comparison, urban areas have higher level of contribution towards the labor force.

p<-ggplot(DVLFS, aes(x = PROVINCE, y = lnwages, color=gender)) + geom_point() 
p+labs(title="At Provincial level,Gender based comparison of wages")

p<-ggplot(DVLFS , aes(x = Region, y = lnwages, color=gender)) + geom_point() 
p+labs(title="At Regional level,Gender based comparison of wages")

4-The Impact of formal and non-formal education on wages.

Here, in the fourth section, to check the impact of “People with formal education” in comparison to the “People with non-formal education” on the overall earnings, we use the Box-plot here. The analysis shows that on an average people having formal education earns more as compared to the people having non-formal education.

DVLFS[, c("wages", "nfe")]
## # A tibble: 27,345 x 2
##    wages   nfe
##    <dbl> <dbl>
##  1  82.5     0
##  2  60.2     0
##  3  73.8     0
##  4  65.4     0
##  5  96.2     0
##  6 248.      0
##  7 265.      0
##  8 248.      0
##  9  85.9     0
## 10  95.6     1
## # ... with 27,335 more rows
DVLFS$nfe<-factor(DVLFS$nfe)
DVLFS$nfe<-revalue(DVLFS$nfe, c("0"="noForEdu", "1"="ForEdu"))
p<-ggplot(DVLFS, aes(x=nfe, y=Age,fill=nfe)) + geom_boxplot()+ labs(x = "formal and non formal education", y = "compete year of Age", title = "boxplot for Age and education type")
p

5-The contribution of Labour as per their age in the Labor Force of Pakistan.

At the end, in the last section, we are going to analyze the overall comparisons of different age groups. To check which age group has the highest contribution in the Labor force of Pakistan. We come to that in Pakistan, people below 40 years of age have more ratio in the labor force.

p <- ggplot(DVLFS, aes(x = Age, colors="green")) + geom_histogram()
p +labs(title = "Age based distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

By looking at the whole graphical analysis, it has been observed that in Pakistan majority of the labor force consists of males especially in the Punjab. Overall, people from rural regions have greater contribution as viewed from the recent labor force survey data of Pakistan. Active labor force is of 40 years after that 50 years of age it is declining. Still the impact of formal vs non-formal education have clear cut difference between the overall earnings of the people.