1 Data Acquisition

The data set used in this project is Kaggle ML and Data Science Survey 2017. I downloaded the multiple choice item survey results in csv format and placed it in a GitHub repo (https://github.com/keshaws/CUNY_MSDS_2020/tree/master/DATA607)

Importing Multiple Choice data

surverydata_link<-"https://raw.githubusercontent.com/keshaws/CUNY_MSDS_2020/master/DATA607/multipleChoiceResponses.csv"
surverydata_df<-read_csv (surverydata_link)
survey.data <- surverydata_df
#lets create a unique ID variable 
surverydata_df$id <- seq.int(nrow(surverydata_df))
dim(surverydata_df)

## [1] 16716   229

2 Research Question

Which are the most values data science skills?

2.1 Understanding Features

Let’s start gaining some insignts by exploring demographics features of the dataset

survey.demographics <- survey.data%>%
  select(GenderSelect,Country,Age,EmploymentStatus) %>%
  filter(Country!='NA',trimws(Country)!='',Age!='NA',trimws(GenderSelect) %in% c('Male','Female'))

survey.dem.age.plot <- survey.demographics %>%
    group_by(Age,GenderSelect) %>%
    summarise(count=n()) %>%
    arrange(desc(count))

survey.dem.plot <- survey.demographics %>%
  group_by(Age,Country,GenderSelect,EmploymentStatus) %>%
  summarise(count=n()) %>%
  arrange(desc(count))

survey.dem.gen.plot <- survey.demographics %>%
    group_by(GenderSelect) %>%
    summarise(count=n()) %>%
    arrange(desc(count))
head(survey.dem.gen.plot)

## # A tibble: 2 x 2
##   GenderSelect count
##   <chr>        <int>
## 1 Male         13416
## 2 Female        2708

Tidying the dataset and finding the percentile for each gender group

survey.dem.tidy <- survey.dem.gen.plot %>%
    spread(GenderSelect,'count')
head(survey.dem.tidy)

## # A tibble: 1 x 2
##   Female  Male
##    <int> <int>
## 1   2708 13416

gender_percentile <- mutate(survey.dem.tidy,male_percent=round((Male/(Female+Male)*100),2),
                            female_percent=round((Female/(Female+Male))*100,2))
kable(gender_percentile)

Female	Male	male_percent	female_percent
2708	13416	83.21	16.79

There are 16716 survey respondents and we could see that there is a huge gender gap in the given dataset with over 83% are male and female makes up only ~17% of total. Also, most of respondents are full time employed followed by people who are not employed but looking for work.

3 EDA

Let’s take a look at data science activity attributes: TimeGatheringData,TimeModelBuilding,TimeProduction,,TimeVisualizing, ,TimeFindingInsights.

3.1 Data Science Activities

The US reponsdents data analysis show that gathering data is the main activitiy with higest time consumption 37.75%. The model building ranks 2nd, 19.23%, followed by time spent in finding insights and data visualization. Only 10.22% of total appears to be taken by prodcution activities.

DSActivity	mean_precent
TimeGatheringData	37.81491
TimeModelBuilding	19.23414
TimeFindingInsights	14.48332
TimeVisualizing	13.72629
TimeProduction	10.22008

3.2 Learning platform

lid	Country	EmploymentStatus	LPlatform	LP_count	LearningPlatform
1	United States	Not employed, but looking for work	LearningPlatformUsefulnessKaggle	Somewhat useful	Kaggle
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessBlogs	Very useful	Blogs
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessCollege	Very useful	College
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessConferences	Very useful	Conferences
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessFriends	Very useful	Friends
3	United States	Independent contractor, freelancer, or self-employed	LearningPlatformUsefulnessDocumentation	Very useful	Documentation

The survery reespondents used different learning platform and it appears that learners mostly benefited from personal projects as majority of resonse indicate it very useful. Online courses appears to be 2nd very useful only to be followed by StackOverflow and Kaggle. Blogs,textbooks and college also appear to be very userful whereas newsletters, podcasts and tradebook rank low.

DATA607 - Project-3

Keshaw Sahay

March 22, 2020