Introduction

Today we will be looking at this dataset provided by the Australian national government. It displays the names and genders of the managers in Australia from 2014-2015 through 2016-2017. We can tell that this data set is open because it is licensed under creative commons which signals that it is open and available for fair use. Here is the source to download the dataset and the link to the actual website. However, this data is untidy as we have column headers as values and it would be a good idea to fix this before proceeding with analysis.

suppressPackageStartupMessages(library("tidyverse"))
library("knitr")
library("kableExtra")
library("broom")
library("jtools")
main <- read.csv("https://data.gov.au/dataset/03da4508-0db1-4ea1-bd00-b3e8e5cd7089/resource/c15ab147-318f-40fc-a5d1-c32eed092314/download/sdg2018-5-5-2-number-of-managers-by-gender-and-age.csv", stringsAsFactors = FALSE)
kable(main) %>% kable_styling(bootstrap_options = c("striped", "hover"))
Age Female.2014.15 Male.2014.15 Female.2015.16 Male.2015.16 Female.2016.17 Male.2016.17
15-34 158.6 212.6 152.2 205.6 151.9 204.1
35-44 140.0 254.4 152.0 255.1 142.6 259.4
45-54 146.0 261.9 134.5 258.1 145.8 261.8
55-64 81.2 182.8 86.9 193.3 89.8 187.2
15-64 525.8 911.7 525.6 912.1 530.1 912.5
65+ 24.1 67.0 29.4 72.1 28.2 71.2
15 years and over 549.9 978.7 555.0 984.2 558.3 983.6
main_1<-main
names(main_1)<- c("age","Female-2014","Male-2014","Female-2015","Male-2015",
          "Female-2016","Male-2016")
main_1<- gather(main_1,"gender-year","numberOfManagers",-"age")
main_1<-separate(main_1,"gender-year",c("gender", "year"),"-")
main_1$age<- str_replace(main_1$age,"15 years and over","total")
m1<-head(main_1,7)
kable(m1) %>% kable_styling(bootstrap_options = c("striped", "hover"))
age gender year numberOfManagers
15-34 Female 2014 158.6
35-44 Female 2014 140.0
45-54 Female 2014 146.0
55-64 Female 2014 81.2
15-64 Female 2014 525.8
65+ Female 2014 24.1
total Female 2014 549.9

Here is the previous data set tidied up are ready to be used for analysis. Though the previous data set provides a better overview of that data, its structure makes the process of analysis through R much more difficult.

main_1$year<- as.numeric(main_1$year)
ggplot(data=main_1%>% filter(age== "total"), aes(year,numberOfManagers, col=gender))+theme_minimal()+ labs(
  x= "Year",
  y= "Total managers employed",
  title= "Female vs Male managers in Australia",
  caption= "Chart 1"
) + geom_line()+geom_point()+ scale_x_continuous(breaks= c(2014,2015,2016))+
scale_y_continuous(limits= c(0,1200), breaks= c(0,200,400,600,800,1000,1200))

Here is a timeseries graph of the year and total managers employed separated by gender.

slopem<- main_1%>%filter(age== "total" & gender== "Male" )
slopef<- main_1%>%filter(age== "total" & gender== "Female")

femaleGrowthRate <-lm(slopef$numberOfManagers ~ slopef$year)
maleGrowthRate <- lm(slopem$numberOfManagers ~ slopem$year)

summary(femaleGrowthRate)
## 
## Call:
## lm(formula = slopef$numberOfManagers ~ slopef$year)
## 
## Residuals:
##    1    2    3 
## -0.3  0.6 -0.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -7908.6000  1047.0248  -7.553   0.0838 .
## slopef$year     4.2000     0.5196   8.083   0.0784 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7348 on 1 degrees of freedom
## Multiple R-squared:  0.9849, Adjusted R-squared:  0.9698 
## F-statistic: 65.33 on 1 and 1 DF,  p-value: 0.07836
summary(maleGrowthRate)
## 
## Call:
## lm(formula = slopem$numberOfManagers ~ slopem$year)
## 
## Residuals:
##      1      2      3 
## -1.017  2.033 -1.017 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3954.583   3548.251  -1.115    0.466
## slopem$year     2.450      1.761   1.391    0.397
## 
## Residual standard error: 2.49 on 1 degrees of freedom
## Multiple R-squared:  0.6594, Adjusted R-squared:  0.3187 
## F-statistic: 1.936 on 1 and 1 DF,  p-value: 0.3967

Here is a simple linear regression that was run using the previous set of data. As was depicted in the previous graph we can see that there is practically no trend present in the data

main_avg <-aggregate(main_1[,-c(1:3)] ,by= list(main_1$age),mean)
main_avg <-main_avg%>% filter(Group.1 != "15-64")%>%filter(Group.1 != "total")
ggplot(data= main_avg, aes(Group.1,x))+ labs(
  x= "age",
  y= "count of managers",
  title= "Managers by age over 2014-2016 average",
  caption= "Chart 2"
)+ geom_col(position = "dodge", fill = "lightblue") + theme_minimal()

Here is a column chart of showing how many managers are employed in each age group.

Takeaways

Looking at the linear regression we can tell that there has been barely any job growth in Australia over 2014-2016. However, since our regression line only has three years of data it would serve as a poor predictor for future manager job growth. Furthermore, they both have a high P-value so that means that both male and female regressions are not statistically significant. There is also a rather large difference between the amount of male managers and female mangers as depicted by chart 1. Looking at chart 2 the most common age of a manager is around 35-54 years old. After 45-55 there is a steep decline in the amount of older managers. A possible explanation for this would be that managers have a slightly more early retirement age than 65 due to a higher salary. Overall I would guess that these numbers are pretty average when compared to other countries of similar size and GDP.

One of the issues I had with preparing the data is trying to get R to show the code of loading the package and datasets without showing the conflict message. I tried using results=‘hide’ but it still showed the conflict message. I fixed this by using the “suppressPackageStartupMessages()” operatior. I also was looking around at packages that gave a cleaner display of the regression data but I could not get the output that I was looking for using these packages.