DATA FELLOW SURVEY

BACKGROUND

Data Fellow is community of data analyst and scientist, where you can improve your analytic skills by hands on real life projects. The community is currently experiencing poor participation in projects. This survey was conducted to identify factors that are responsible for the low participation. We answered questions like the number of respondents, level of proficiency, level of experience, members by project, members by specialization

DATASET

The data was gathered by merging two tables. Then data was cleand using dplyr to remove missing data.

# Import data from google sheet
#data_l<- read_sheet("https://docs.google.com/spreadsheets/d/1RpZAFLFT84gitnUg9s0mB0KEcYTuIWr9jKrpdj4vSo8/edit#gid=1061786364")
#data_r <- read_sheet("https://docs.google.com/spreadsheets/d/1IVheNIn_Ns1w4kUd1-IMI3aVEUCui5N94JLQXBnDtCQ/edit#gid=908840308")

# read data from folder
data_r <- read_xlsx("data/data_r.xlsx")
data_l <- read_xlsx("data/data_l.xlsx")

# Clean survey data

data_l <- data_l %>%
  filter(`Please state your interest of specialization` == "Data Science" | `Please state your interest of specialization` == "Data Analysis" ) %>%
  filter(`rate your level of proficiency` != "NA" )

# Merge Data
data_merge <- data_l %>% left_join( data_r, 
                                    by=c('Email Address'='Email'))%>%
  select(c(2:9), c(11:15)) 
  
datatable(data_merge, fillContainer = F)

Number of respondents

There were 44 people that responded to the survey out of about over 150 people, It means the participation of the group members is quite low.

responses <- dim(data_merge)[1]
responses

## [1] 44

Members by Project

Almost 60% of respondents have not participated in projects, and all data scientist have participated in a project.

data_merge %>%
  select(`Have you worked on a project or you are currently working on one?`, `Please state your interest of specialization`) %>%
  rename(project = `Have you worked on a project or you are currently working on one?`, specialization = `Please state your interest of specialization`) %>%
  mutate(project = case_when(
    project == "No" ~ "No",
    TRUE ~ "Yes"
  )) %>%
  group_by(project, specialization) %>%
  count(project, specialization) %>%
  ungroup() %>%
  mutate(percentage = n / sum(n) * 100) %>%
  ggplot(aes(x = project, y = percentage, fill=specialization)) +
  geom_bar(stat = "identity", position = 'dodge')

Proficiency by level

The chart below shows that there are more beginners in the community, which also factors in the reduced participation in projects. However, it can be seen that while there were no data scientist that haven’t worked on a project, there are actually data scientist that are beginners.

# Function for Percentage
pct = function(x, digits=1) {
  sprintf(paste0("%1.", digits, "f%%"), x*100)
}

level_pro <- data_merge %>%
  select(`rate your level of proficiency`, `Please state your interest of specialization`) %>%
  group_by(`rate your level of proficiency`, `Please state your interest of specialization`) %>%
  count(`rate your level of proficiency`, `Please state your interest of specialization`) %>%
  ungroup() %>%
  mutate(p = n/sum(n))

ggplot(level_pro, aes(x= `rate your level of proficiency`,y=n, fill=`Please state your interest of specialization`)) + 
  geom_bar(stat = "identity", position = "dodge") +
  theme(legend.position = "top", legend.direction = "vertical") +
  geom_text(aes(label=pct(p, 2), y = n/2 ), position = position_dodge(1)) + 
  theme(
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank()) +
  labs(x="", y="")

Proficiency by tools

It is seen from the chart that, python is the only tool that is common to all groups, especially all data scientist that have had their hands on a project.

prof_tool <- data_merge %>% 
  select(`Please state your interest of specialization`, 
  `rate your level of proficiency`, 
  `Which of these tools do you have high proficiency in?`) %>%
  mutate(tools=case_when(
    `Which of these tools do you have high proficiency in?` == "SQL" ~ "SQL",
    `Which of these tools do you have high proficiency in?` == "R" ~ "R",
    `Which of these tools do you have high proficiency in?` == "Python" ~ "Python",
    `Which of these tools do you have high proficiency in?` == "PowerBI" ~ "PowerBI",
    `Which of these tools do you have high proficiency in?` == "Excel" ~ "Excel",
    `Which of these tools do you have high proficiency in?` == "Microsoft Excel" ~ "Excel",
    `Which of these tools do you have high proficiency in?` == "None" ~ "None",
    `Which of these tools do you have high proficiency in?` == "None, Just Excel" ~ "Excel",
    `Which of these tools do you have high proficiency in?` == "Excel, tableau and python" ~ "Excel",
  `Which of these tools do you have high proficiency in?` == "Tableau" ~ "Tableau")) %>%
  rename(specialization = `Please state your interest of specialization`,
  proficiency = `rate your level of proficiency`)%>%
  group_by(specialization, proficiency, tools) %>%
  count(specialization, proficiency, tools) %>%
  ungroup() %>%
  arrange(n) %>%
  mutate(p = n / sum(n) )

# Function for Percentage
pct = function(x, digits=1) {
  sprintf(paste0("%1.", digits, "f%%"), x*100)
}

ggplot(prof_tool, aes(x=reorder(tools, -n), y = n)) + 
  geom_bar(stat="identity", fill="steelblue") + 
  labs(x="", y="") + 
  geom_text(aes(label=pct(p, 2), y = n /2 )) + 
  theme(
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank()) + 
  facet_wrap(proficiency~specialization)

Recomendations

About 60% percent have not participated in a project, they should be paired in a group with a data scientist, since all data scientist have experience with projects.

All Data scientist have worked on a project, and most them used python, I recommend the community should organize python sessions for data analysts because it gives them more opportunity to work on projects.

Experts should paired with beginners in groups to guide them.

Projects should not be too difficult, since majority of the respondents are beginners and have not participated in projects.

Awareness should be created for using R for data science and Analysis.