Vignette That Demonstrates the functions in TidyVerse

Nnaemeka Newman Okereafor

2021-10-24

About the Assignment

In this assignment, I x-rayed the functionality of several TidyVerse packages used in r-programming for data munging, cleaning and visualization. To achieve this, I have written a funtion named datasciencetools that takes two parameters namely; data frame and a question. The goal of the datasciencetools is to receive a question and returns a data frame and a plot that contains the count and proportion of data scientist in the United states associated with the question. While building the function, I utilized different functions in Dplyr, ggplot2 and Tidyr also known as TidyVerse packages

Data

The dataset is a 2018 multiplechoice interview conducted by Kaggle to gather information on the state of data science and machine learning around the world. The dataset is made up of about 23,859 observations. The dataset is a congregation of responses collected across the globe. The data can be viewed on data.

Function


datasciencetools<- function(ds,question){
  names(ds) <- paste(names(ds),ds[1,],sep="_")
  ds <- ds[-c(1),]
  
# filter function is used to tailor the location/country of data scientist to the USA and current role to Data science  
  ds <- ds %>% 
    filter((`Q3_In which country do you currently reside?`=='United States of America')&(`Q6_Select the title most similar to your current role (or most recent title if retired): - Selected Choice`=='Data Scientist'))
# select function is used to select the variable/question from the data frame while the pivot_longer function is used to restructure a wide data set to a long one  
  ds_tool<- ds %>% select(starts_with(question),-contains("OTHER_TEXT")) %>%
    pivot_longer(starts_with(question),names_to="ToolName",values_to="Tool")
  
# count function counts the number of unique values of a variable  
  ds_tool_tib <- ds_tool %>%
    count(ds_tool$Tool) %>% rename("VisualTools" = "ds_tool$Tool","Count"="n")
  
# The arrange function is used to sort the values either in a descending or ascending order  
  ds_tool_tib <- ds_tool_tib%>%filter(!(VisualTools==""))%>%
    arrange(desc(Count))
# mutate function offered the opportunity for a new variable to be created for a dataframe 
  prop_n_count<-ds_tool_tib %>%
    mutate(proportion = round((Count /sum(Count))*100,2))%>% arrange(desc(proportion))
  print(prop_n_count)
# ggplot function and geom_col makes visualization possible for the ananlsis  
  prop_n_count_view <- prop_n_count %>% ggplot(aes(reorder(VisualTools,proportion),proportion)) +
    geom_col()+coord_flip()+geom_col(fill="#A7ADBE")+
    geom_text(aes(label=proportion),color="red") +labs(x="Tools/Technique")+theme_bw()
  
  return(prop_n_count_view)
  
  
}

What types of data do you (Data Scientists) interact with most often at work or school?

datasciencetools(survey_tib,"Q31")
#> # A tibble: 12 x 3
#>    VisualTools      Count proportion
#>    <chr>            <int>      <dbl>
#>  1 Numerical Data     642      19.1 
#>  2 Categorical Data   563      16.8 
#>  3 Time Series Data   524      15.6 
#>  4 Text Data          507      15.1 
#>  5 Tabular Data       480      14.3 
#>  6 Geospatial Data    191       5.68
#>  7 Image Data         182       5.41
#>  8 Sensor Data        128       3.81
#>  9 Video Data          44       1.31
#> 10 Genetic Data        43       1.28
#> 11 Audio Data          32       0.95
#> 12 Other Data          26       0.77

Filter Function

In dplyr, filter() function takes in 2 arguments namely; the dataframe you are operating on and a conditional expression that evaluates to TRUE or FALSE. Under the hood, dplyr filter() funtion works by testing each row against your conditional expression and mapping the results to TRUE and FALSE. It then selects all rows that evaluate to TRUE. However, one can use <, >, <=, >=, ==, and != in similar ways to filter any data. The filter() function played two major roles in the datasciencetools function namely; to select the country of interest (USA) and job role from the global survey dataset. Secondly, to filter out missing values from the dataframe.

What is the favorite media sources that report on data science topics?

datasciencetools(survey_tib,"Q38")
#> # A tibble: 20 x 3
#>    VisualTools                   Count proportion
#>    <chr>                         <int>      <dbl>
#>  1 Medium Blog Posts               338      13.6 
#>  2 FiveThirtyEight.com             252      10.1 
#>  3 Kaggle forums                   230       9.24
#>  4 KDnuggets Blog                  217       8.72
#>  5 Twitter                         188       7.55
#>  6 ArXiv & Preprints               176       7.07
#>  7 r/machinelearning               173       6.95
#>  8 Hacker News                     146       5.87
#>  9 O'Reilly Data Newsletter        123       4.94
#> 10 Other                           100       4.02
#> 11 Journal Publications             98       3.94
#> 12 The Data Skeptic Podcast         78       3.13
#> 13 None/I do not know               77       3.09
#> 14 Linear Digressions Podcast       71       2.85
#> 15 Siraj Raval YouTube Channel      61       2.45
#> 16 FastML Blog                      47       1.89
#> 17 Fastai forums                    46       1.85
#> 18 Partially Derivative Podcast     46       1.85
#> 19 DataTau News Aggregator          16       0.64
#> 20 Cloud AI Adventures (YouTube)     6       0.24

Select Function

Select function in r-programming is used to select variables in R using Dplyr package. Dplyr package in R is provided with select() function which select the columns based on conditions. Select() function in dplyr which is used to select the columns based on conditions like starts with, ends with, contains and matches certain criteria and also selecting column based on position. In the datasciencetools function, the select() function alongside with start_with() function and contains() function played a vital role. The select() function used the start_with() function as a parameter to select variables that starts with a particular alphabet.The contains() function with a negative notation parsed in the select() function helped to restrict variables that contains OTHER_TEXT from being selected.

How do you perceive the quality of online learning platforms and in-person bootcamps as compared to the quality of the education provided by traditional brick and mortar institutions? - Online learning platforms and MOOCs:

datasciencetools(survey_tib,"Q39")
#> # A tibble: 6 x 3
#>   VisualTools               Count proportion
#>   <chr>                     <int>      <dbl>
#> 1 Neither better nor worse    350      23   
#> 2 No opinion; I do not know   347      22.8 
#> 3 Slightly worse              298      19.6 
#> 4 Slightly better             233      15.3 
#> 5 Much better                 165      10.8 
#> 6 Much worse                  129       8.48

Count Function

The count() function allows the quickly count of unique values of one or more variables. The datasciencetools function utilized the count() to count the unique occurrence of a particular tool or instance of the question in each category.

Which of the following relational database products have you used at work or school in the last 5 years?

datasciencetools(survey_tib,"Q29")
#> # A tibble: 28 x 3
#>    VisualTools                     Count proportion
#>    <chr>                           <int>      <dbl>
#>  1 MySQL                             419      19.2 
#>  2 PostgresSQL                       334      15.3 
#>  3 SQLite                            280      12.8 
#>  4 Microsoft SQL Server              245      11.2 
#>  5 Oracle Database                   160       7.34
#>  6 AWS Relational Database Service   112       5.14
#>  7 Microsoft Access                   87       3.99
#>  8 Other                              84       3.85
#>  9 AWS DynamoDB                       74       3.39
#> 10 Google Cloud Bigtable              53       2.43
#> # ... with 18 more rows

Arrange Function

This is one of the dplyr functions in TidyVerse package. The arrange function is used to sort data based on a specific variable. All we need to do is to make sure that the library is specified and then call the arrange function, passing it the data frame and the variable name.In the datasciencetools fuction, the arrange() was used to sort the dataframe by the count variable. The arrange() used the desc() function as an argument to sort the dataframe in a decending order.

Which of the following cloud computing products have you used at work or school in the last 5 years?

datasciencetools(survey_tib,"Q27")
#> # A tibble: 20 x 3
#>    VisualTools                     Count proportion
#>    <chr>                           <int>      <dbl>
#>  1 AWS Elastic Compute Cloud (EC2)   383      26.8 
#>  2 None                              154      10.8 
#>  3 AWS Lambda                        148      10.4 
#>  4 Google Compute Engine             142       9.93
#>  5 Azure Virtual Machines            103       7.2 
#>  6 Google App Engine                  65       4.55
#>  7 AWS Elastic Beanstalk              54       3.78
#>  8 Google Cloud Functions             54       3.78
#>  9 Google Kubernetes Engine           50       3.5 
#> 10 AWS Batch                          48       3.36
#> 11 Azure Container Service            42       2.94
#> 12 Azure Functions                    40       2.8 
#> 13 Other                              27       1.89
#> 14 IBM Cloud Virtual Servers          26       1.82
#> 15 IBM Cloud Foundry                  21       1.47
#> 16 Azure Batch                        19       1.33
#> 17 Azure Kubernetes Service           18       1.26
#> 18 IBM Cloud Kubernetes Service       16       1.12
#> 19 IBM Cloud Container Registry       15       1.05
#> 20 Azure Event Grid                    5       0.35

GGPLOT Function

GGPLOT function is part of the GGPLOT2 library of the tidyVerse package. It is a visualization function. It has three components which include Data, Geometry and Aesthetic mapping. In datasciencetools function, the summarized global data survey data table is the data component. The barplots is the geometry component and the color unicodes pased to the geom_col() function makes up the aesthetics. Other possible geometries are scatter, histogram, smooth densities, and boxplot. However, the plot uses several visual cues to represent the information provided by the dataset. The two most important cues in this plot are the point positions on the x-axis and y-axis.