1 Data Acquisition

The data set used in this project is Kaggle ML and Data Science Survey 2017. I downloaded the multiple choice item survey results in csv format and placed it in a GitHub repo (https://github.com/keshaws/CUNY_MSDS_2020/tree/master/DATA607)

Importing Multiple Choice data

surverydata_link<-"https://raw.githubusercontent.com/keshaws/CUNY_MSDS_2020/master/DATA607/multipleChoiceResponses.csv"
surverydata_df<-read_csv (surverydata_link)
survey.data <- surverydata_df
#lets create a unique ID variable 
surverydata_df$id <- seq.int(nrow(surverydata_df))
dim(surverydata_df)
## [1] 16716   229


2 Research Question

Which are the most values data science skills?

2.1 Understanding Features

Let’s start gaining some insignts by exploring demographics features of the dataset

survey.demographics <- survey.data%>%
  select(GenderSelect,Country,Age,EmploymentStatus) %>%
  filter(Country!='NA',trimws(Country)!='',Age!='NA',trimws(GenderSelect) %in% c('Male','Female'))

survey.dem.age.plot <- survey.demographics %>%
    group_by(Age,GenderSelect) %>%
    summarise(count=n()) %>%
    arrange(desc(count))

survey.dem.plot <- survey.demographics %>%
  group_by(Age,Country,GenderSelect,EmploymentStatus) %>%
  summarise(count=n()) %>%
  arrange(desc(count))
survey.dem.gen.plot <- survey.demographics %>%
    group_by(GenderSelect) %>%
    summarise(count=n()) %>%
    arrange(desc(count))
head(survey.dem.gen.plot)
## # A tibble: 2 x 2
##   GenderSelect count
##   <chr>        <int>
## 1 Male         13416
## 2 Female        2708

Tidying the dataset and finding the percentile for each gender group

survey.dem.tidy <- survey.dem.gen.plot %>%
    spread(GenderSelect,'count')
head(survey.dem.tidy)
## # A tibble: 1 x 2
##   Female  Male
##    <int> <int>
## 1   2708 13416
gender_percentile <- mutate(survey.dem.tidy,male_percent=round((Male/(Female+Male)*100),2),
                            female_percent=round((Female/(Female+Male))*100,2))
kable(gender_percentile)
Female Male male_percent female_percent
2708 13416 83.21 16.79

There are 16716 survey respondents and we could see that there is a huge gender gap in the given dataset with over 83% are male and female makes up only ~17% of total. Also, most of respondents are full time employed followed by people who are not employed but looking for work.

3 EDA

Let’s take a look at data science activity attributes: TimeGatheringData,TimeModelBuilding,TimeProduction,,TimeVisualizing, ,TimeFindingInsights.

3.1 Data Science Activities

The US reponsdents data analysis show that gathering data is the main activitiy with higest time consumption 37.75%. The model building ranks 2nd, 19.23%, followed by time spent in finding insights and data visualization. Only 10.22% of total appears to be taken by prodcution activities.

DSActivity mean_precent
TimeGatheringData 37.81491
TimeModelBuilding 19.23414
TimeFindingInsights 14.48332
TimeVisualizing 13.72629
TimeProduction 10.22008

3.2 Learning platform

lid Country EmploymentStatus LPlatform LP_count LearningPlatform
1 United States Not employed, but looking for work LearningPlatformUsefulnessKaggle Somewhat useful Kaggle
3 United States Independent contractor, freelancer, or self-employed LearningPlatformUsefulnessBlogs Very useful Blogs
3 United States Independent contractor, freelancer, or self-employed LearningPlatformUsefulnessCollege Very useful College
3 United States Independent contractor, freelancer, or self-employed LearningPlatformUsefulnessConferences Very useful Conferences
3 United States Independent contractor, freelancer, or self-employed LearningPlatformUsefulnessFriends Very useful Friends
3 United States Independent contractor, freelancer, or self-employed LearningPlatformUsefulnessDocumentation Very useful Documentation

The survery reespondents used different learning platform and it appears that learners mostly benefited from personal projects as majority of resonse indicate it very useful. Online courses appears to be 2nd very useful only to be followed by StackOverflow and Kaggle. Blogs,textbooks and college also appear to be very userful whereas newsletters, podcasts and tradebook rank low.