Project 3

Loading and Exploring the Data

Let’s load the dataset and take a quick look at the first few rows to understand its structure.

# Load data
data <- read_csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/week%208/Cleaned_Augmented_Data_Science_Skills.csv")

## Rows: 32 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Timestamp, First Name or Nickname, Email Address, Any data science...
## dbl  (1): Age
## lgl  (1): List the 5 most valuable data science skills (separated by commas)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows of the data
head(data)

## # A tibble: 6 × 16
##   Timestamp  First Name or Nickna…¹ List the 5 most valu…² `Email Address`   Age
##   <chr>      <chr>                  <lgl>                  <chr>           <dbl>
## 1 10/23/24 … Md Asaduzzaman         NA                     m.zaman3201@gm…    42
## 2 10/23/24 … Alex                   NA                     alexander.ptac…    27
## 3 10/23/24 … Inna                   NA                     innayedzinovic…    29
## 4 10/23/24 … Cindy                  NA                     cindylin90@gma…    34
## 5 10/23/24 … Sarika                 NA                     ssgupta.phd@gm…    46
## 6 10/24/24 … Halyna                 NA                     shgalin11@gmai…    27
## # ℹ abbreviated names: ¹`First Name or Nickname`,
## #   ²`List the 5 most valuable data science skills (separated by commas)`
## # ℹ 11 more variables: `Any data science/data analytics experience?` <chr>,
## #   `Any software engineering experience?` <chr>,
## #   `Which programming languages do you use most frequently?` <chr>,
## #   `What resources do you use for learning new data science skills?` <chr>,
## #   `What areas of data science are you most interested in learning more about?` <chr>, …

Data Cleaning

We’ll clean up the data by renaming columns and removing rows with missing values in skill columns.

# Standardize column names to avoid spaces and special characters
colnames(data) <- make.names(colnames(data))

# Select only the columns with the top skills
skills_data <- data %>%
  select(Name..1.most.most.valuable.data.science.skill,
         Name..2.most.most.valuable.data.science.skill,
         Name..3.most.most.valuable.data.science.skill,
         Name..4.most.most.valuable.data.science.skill,
         Name..5.most.most.valuable.data.science.skill) %>%
  pivot_longer(cols = everything(),
               names_to = "skill_rank",
               values_to = "skill") %>%
  filter(!is.na(skill)) # Remove rows with NA skills

# Check cleaned and reshaped data
head(skills_data)

## # A tibble: 6 × 2
##   skill_rank                                    skill                    
##   <chr>                                         <chr>                    
## 1 Name..1.most.most.valuable.data.science.skill R language skill         
## 2 Name..2.most.most.valuable.data.science.skill Python skill             
## 3 Name..3.most.most.valuable.data.science.skill Statistics and math skill
## 4 Name..4.most.most.valuable.data.science.skill Data Visualization skill 
## 5 Name..5.most.most.valuable.data.science.skill SQL skill                
## 6 Name..1.most.most.valuable.data.science.skill data cleaning

Analysis: Most Valued Data Science Skills

Let’s calculate the frequency of each skill to determine which are the most valued.

# Count the occurrences of each skill
skill_counts <- skills_data %>%
  group_by(skill) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

# Display the top skills
head(skill_counts, 10)

## # A tibble: 10 × 2
##    skill                count
##    <chr>                <int>
##  1 Resourcefulness         10
##  2 Critical Thinking        9
##  3 Collaboration            8
##  4 Creativity               8
##  5 Data Visualization       8
##  6 Data Cleaning            7
##  7 Python                   7
##  8 SQL                      7
##  9 Statistical Analysis     7
## 10 Teamwork                 7

Visualization

Finally, we visualize the most valued data science skills using a bar plot.

# Plot the top 10 most valued skills
library(ggplot2)

ggplot(skill_counts[1:10, ], aes(x = reorder(skill, count), y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Most Valued Data Science Skills",
       x = "Skill",
       y = "Count") +
  theme_minimal()

Conclusion

Based on our analysis, the top skills identified in the dataset are:

# Display top skills as a list
skill_counts[1:10, ]

## # A tibble: 10 × 2
##    skill                count
##    <chr>                <int>
##  1 Resourcefulness         10
##  2 Critical Thinking        9
##  3 Collaboration            8
##  4 Creativity               8
##  5 Data Visualization       8
##  6 Data Cleaning            7
##  7 Python                   7
##  8 SQL                      7
##  9 Statistical Analysis     7
## 10 Teamwork                 7

These skills represent the most valued abilities in the data science field according to the responses provided and Resourcefulness is in top as most valued ability in the data science field.