In this assignment, I x-rayed the functionality of several TidyVerse packages used in r-programming for data munging, cleaning and visualization. To achieve this, I have written a funtion named datasciencetools that takes two parameters namely; data frame and a question. The goal of the datasciencetools is to receive a question and returns a data frame and a plot that contains the count and proportion of data scientist in the United states associated with the question. While building the function, I utilized different functions in Dplyr, ggplot2 and Tidyr also known as TidyVerse packages
The dataset is a 2018 multiplechoice interview conducted by Kaggle to gather information on the state of data science and machine learning around the world. The dataset is made up of about 23,859 observations. The dataset is a congregation of responses collected across the globe. The data can be viewed on data.
<- function(ds,question){
datasciencetoolsnames(ds) <- paste(names(ds),ds[1,],sep="_")
<- ds[-c(1),]
ds
# filter function is used to tailor the location/country of data scientist to the USA and current role to Data science
<- ds %>%
ds filter((`Q3_In which country do you currently reside?`=='United States of America')&(`Q6_Select the title most similar to your current role (or most recent title if retired): - Selected Choice`=='Data Scientist'))
# select function is used to select the variable/question from the data frame while the pivot_longer function is used to restructure a wide data set to a long one
<- ds %>% select(starts_with(question),-contains("OTHER_TEXT")) %>%
ds_toolpivot_longer(starts_with(question),names_to="ToolName",values_to="Tool")
# count function counts the number of unique values of a variable
<- ds_tool %>%
ds_tool_tib count(ds_tool$Tool) %>% rename("VisualTools" = "ds_tool$Tool","Count"="n")
# The arrange function is used to sort the values either in a descending or ascending order
<- ds_tool_tib%>%filter(!(VisualTools==""))%>%
ds_tool_tib arrange(desc(Count))
# mutate function offered the opportunity for a new variable to be created for a dataframe
<-ds_tool_tib %>%
prop_n_countmutate(proportion = round((Count /sum(Count))*100,2))%>% arrange(desc(proportion))
print(prop_n_count)
# ggplot function and geom_col makes visualization possible for the ananlsis
<- prop_n_count %>% ggplot(aes(reorder(VisualTools,proportion),proportion)) +
prop_n_count_view geom_col()+coord_flip()+geom_col(fill="#A7ADBE")+
geom_text(aes(label=proportion),color="red") +labs(x="Tools/Technique")+theme_bw()
return(prop_n_count_view)
}
datasciencetools(survey_tib,"Q31")
#> # A tibble: 12 x 3
#> VisualTools Count proportion
#> <chr> <int> <dbl>
#> 1 Numerical Data 642 19.1
#> 2 Categorical Data 563 16.8
#> 3 Time Series Data 524 15.6
#> 4 Text Data 507 15.1
#> 5 Tabular Data 480 14.3
#> 6 Geospatial Data 191 5.68
#> 7 Image Data 182 5.41
#> 8 Sensor Data 128 3.81
#> 9 Video Data 44 1.31
#> 10 Genetic Data 43 1.28
#> 11 Audio Data 32 0.95
#> 12 Other Data 26 0.77
In dplyr, filter() function takes in 2 arguments namely; the dataframe you are operating on and a conditional expression that evaluates to TRUE or FALSE. Under the hood, dplyr filter() funtion works by testing each row against your conditional expression and mapping the results to TRUE and FALSE. It then selects all rows that evaluate to TRUE. However, one can use <, >, <=, >=, ==, and != in similar ways to filter any data. The filter() function played two major roles in the datasciencetools function namely; to select the country of interest (USA) and job role from the global survey dataset. Secondly, to filter out missing values from the dataframe.
datasciencetools(survey_tib,"Q38")
#> # A tibble: 20 x 3
#> VisualTools Count proportion
#> <chr> <int> <dbl>
#> 1 Medium Blog Posts 338 13.6
#> 2 FiveThirtyEight.com 252 10.1
#> 3 Kaggle forums 230 9.24
#> 4 KDnuggets Blog 217 8.72
#> 5 Twitter 188 7.55
#> 6 ArXiv & Preprints 176 7.07
#> 7 r/machinelearning 173 6.95
#> 8 Hacker News 146 5.87
#> 9 O'Reilly Data Newsletter 123 4.94
#> 10 Other 100 4.02
#> 11 Journal Publications 98 3.94
#> 12 The Data Skeptic Podcast 78 3.13
#> 13 None/I do not know 77 3.09
#> 14 Linear Digressions Podcast 71 2.85
#> 15 Siraj Raval YouTube Channel 61 2.45
#> 16 FastML Blog 47 1.89
#> 17 Fastai forums 46 1.85
#> 18 Partially Derivative Podcast 46 1.85
#> 19 DataTau News Aggregator 16 0.64
#> 20 Cloud AI Adventures (YouTube) 6 0.24
Select function in r-programming is used to select variables in R using Dplyr package. Dplyr package in R is provided with select() function which select the columns based on conditions. Select() function in dplyr which is used to select the columns based on conditions like starts with, ends with, contains and matches certain criteria and also selecting column based on position. In the datasciencetools function, the select() function alongside with start_with() function and contains() function played a vital role. The select() function used the start_with() function as a parameter to select variables that starts with a particular alphabet.The contains() function with a negative notation parsed in the select() function helped to restrict variables that contains OTHER_TEXT from being selected.
datasciencetools(survey_tib,"Q39")
#> # A tibble: 6 x 3
#> VisualTools Count proportion
#> <chr> <int> <dbl>
#> 1 Neither better nor worse 350 23
#> 2 No opinion; I do not know 347 22.8
#> 3 Slightly worse 298 19.6
#> 4 Slightly better 233 15.3
#> 5 Much better 165 10.8
#> 6 Much worse 129 8.48
The count() function allows the quickly count of unique values of one or more variables. The datasciencetools function utilized the count() to count the unique occurrence of a particular tool or instance of the question in each category.
datasciencetools(survey_tib,"Q29")
#> # A tibble: 28 x 3
#> VisualTools Count proportion
#> <chr> <int> <dbl>
#> 1 MySQL 419 19.2
#> 2 PostgresSQL 334 15.3
#> 3 SQLite 280 12.8
#> 4 Microsoft SQL Server 245 11.2
#> 5 Oracle Database 160 7.34
#> 6 AWS Relational Database Service 112 5.14
#> 7 Microsoft Access 87 3.99
#> 8 Other 84 3.85
#> 9 AWS DynamoDB 74 3.39
#> 10 Google Cloud Bigtable 53 2.43
#> # ... with 18 more rows
This is one of the dplyr functions in TidyVerse package. The arrange function is used to sort data based on a specific variable. All we need to do is to make sure that the library is specified and then call the arrange function, passing it the data frame and the variable name.In the datasciencetools fuction, the arrange() was used to sort the dataframe by the count variable. The arrange() used the desc() function as an argument to sort the dataframe in a decending order.
datasciencetools(survey_tib,"Q27")
#> # A tibble: 20 x 3
#> VisualTools Count proportion
#> <chr> <int> <dbl>
#> 1 AWS Elastic Compute Cloud (EC2) 383 26.8
#> 2 None 154 10.8
#> 3 AWS Lambda 148 10.4
#> 4 Google Compute Engine 142 9.93
#> 5 Azure Virtual Machines 103 7.2
#> 6 Google App Engine 65 4.55
#> 7 AWS Elastic Beanstalk 54 3.78
#> 8 Google Cloud Functions 54 3.78
#> 9 Google Kubernetes Engine 50 3.5
#> 10 AWS Batch 48 3.36
#> 11 Azure Container Service 42 2.94
#> 12 Azure Functions 40 2.8
#> 13 Other 27 1.89
#> 14 IBM Cloud Virtual Servers 26 1.82
#> 15 IBM Cloud Foundry 21 1.47
#> 16 Azure Batch 19 1.33
#> 17 Azure Kubernetes Service 18 1.26
#> 18 IBM Cloud Kubernetes Service 16 1.12
#> 19 IBM Cloud Container Registry 15 1.05
#> 20 Azure Event Grid 5 0.35
GGPLOT function is part of the GGPLOT2 library of the tidyVerse package. It is a visualization function. It has three components which include Data, Geometry and Aesthetic mapping. In datasciencetools function, the summarized global data survey data table is the data component. The barplots is the geometry component and the color unicodes pased to the geom_col() function makes up the aesthetics. Other possible geometries are scatter, histogram, smooth densities, and boxplot. However, the plot uses several visual cues to represent the information provided by the dataset. The two most important cues in this plot are the point positions on the x-axis and y-axis.