The data set used in this project is Kaggle ML and Data Science Survey 2017. I downloaded the multiple choice item survey results in csv format and placed it in a GitHub repo (https://github.com/keshaws/CUNY_MSDS_2020/tree/master/DATA607)
Importing Multiple Choice data
surverydata_link<-"https://raw.githubusercontent.com/keshaws/CUNY_MSDS_2020/master/DATA607/multipleChoiceResponses.csv"
surverydata_df<-read_csv (surverydata_link)
survey.data <- surverydata_df
#lets create a unique ID variable
surverydata_df$id <- seq.int(nrow(surverydata_df))
dim(surverydata_df)## [1] 16716 229
Which are the most values data science skills?
Let’s start gaining some insignts by exploring demographics features of the dataset
survey.demographics <- survey.data%>%
select(GenderSelect,Country,Age,EmploymentStatus) %>%
filter(Country!='NA',trimws(Country)!='',Age!='NA',trimws(GenderSelect) %in% c('Male','Female'))
survey.dem.age.plot <- survey.demographics %>%
group_by(Age,GenderSelect) %>%
summarise(count=n()) %>%
arrange(desc(count))
survey.dem.plot <- survey.demographics %>%
group_by(Age,Country,GenderSelect,EmploymentStatus) %>%
summarise(count=n()) %>%
arrange(desc(count))survey.dem.gen.plot <- survey.demographics %>%
group_by(GenderSelect) %>%
summarise(count=n()) %>%
arrange(desc(count))
head(survey.dem.gen.plot)## # A tibble: 2 x 2
## GenderSelect count
## <chr> <int>
## 1 Male 13416
## 2 Female 2708
Tidying the dataset and finding the percentile for each gender group
survey.dem.tidy <- survey.dem.gen.plot %>%
spread(GenderSelect,'count')
head(survey.dem.tidy)## # A tibble: 1 x 2
## Female Male
## <int> <int>
## 1 2708 13416
gender_percentile <- mutate(survey.dem.tidy,male_percent=round((Male/(Female+Male)*100),2),
female_percent=round((Female/(Female+Male))*100,2))
kable(gender_percentile)| Female | Male | male_percent | female_percent |
|---|---|---|---|
| 2708 | 13416 | 83.21 | 16.79 |
There are 16716 survey respondents and we could see that there is a huge gender gap in the given dataset with over 83% are male and female makes up only ~17% of total. Also, most of respondents are full time employed followed by people who are not employed but looking for work.
Let’s take a look at data science activity attributes: TimeGatheringData,TimeModelBuilding,TimeProduction,,TimeVisualizing, ,TimeFindingInsights.
The US reponsdents data analysis show that gathering data is the main activitiy with higest time consumption 37.75%. The model building ranks 2nd, 19.23%, followed by time spent in finding insights and data visualization. Only 10.22% of total appears to be taken by prodcution activities.
| DSActivity | mean_precent |
|---|---|
| TimeGatheringData | 37.81491 |
| TimeModelBuilding | 19.23414 |
| TimeFindingInsights | 14.48332 |
| TimeVisualizing | 13.72629 |
| TimeProduction | 10.22008 |
| lid | Country | EmploymentStatus | LPlatform | LP_count | LearningPlatform |
|---|---|---|---|---|---|
| 1 | United States | Not employed, but looking for work | LearningPlatformUsefulnessKaggle | Somewhat useful | Kaggle |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessBlogs | Very useful | Blogs |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessCollege | Very useful | College |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessConferences | Very useful | Conferences |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessFriends | Very useful | Friends |
| 3 | United States | Independent contractor, freelancer, or self-employed | LearningPlatformUsefulnessDocumentation | Very useful | Documentation |
The survery reespondents used different learning platform and it appears that learners mostly benefited from personal projects as majority of resonse indicate it very useful. Online courses appears to be 2nd very useful only to be followed by StackOverflow and Kaggle. Blogs,textbooks and college also appear to be very userful whereas newsletters, podcasts and tradebook rank low.