stat wrang

Part 1: Introduction to AI in Data Wrangling

AI-assisted data wrangling uses tools like ChatGPT to simplify data preparation. AI saves time by automating repetitive tasks, suggesting optimizations, and generating code snippets on demand.

data_path <- "C:\\Users\\ntonu\\OneDrive\\Documents\\ai_job_market_insights.csv"

data <- read_csv(data_path)

## Rows: 500 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Job_Title, Industry, Company_Size, Location, AI_Adoption_Level, Aut...
## dbl (1): Salary_USD
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data <- data %>% mutate_if(is.numeric, ~if_else(is.na(.), mean(., na.rm = TRUE), .))

Activity: Prompt ChatGPT for Code Ask ChatGPT for different missing-value strategies:

Replace missing values with Medians:

data <- data %>% mutate_if(is.numeric, ~if_else(is.na(.), median(., na.rm = TRUE), .))

Remove Rows with Many missing values:

threshold <- ncol(data) * 0.5
data <- data[rowSums(is.na(data)) <= threshold, ]

Impute Categorical Missing Values:

# Replace missing values in the Job_Title column with the most frequent value
data$Job_Title <- ifelse(
  is.na(data$Job_Title), 
  names(sort(table(data$Job_Title), decreasing = TRUE))[1], 
  data$Job_Title
)

AI streamlines data wrangling by automating repetitive tasks, such as handling missing values or transforming data, saving time and reducing errors. It suggests optimized solutions, introduces creative approaches, and explains best practices, enabling users to work more efficiently. By acting as both a coding assistant and a tutor, AI empowers users to focus on higher-level analysis and decision-making.

Part 2: AI-Assisted Data Transformation

Example: Creating a New Column Based on Conditional Logic

# Adding a new column 'Risk_Level' based on Automation_Risk and AI_Adoption_Level
data <- data %>% mutate(Risk_Level = case_when(
  Automation_Risk == "High" & AI_Adoption_Level == "High" ~ "Very Risky",
  Automation_Risk == "Medium" & AI_Adoption_Level == "Medium" ~ "Moderately Risky",
  TRUE ~ "Low Risk"
))

Binning Continuous Variables: Categorize employees’ Salary_USD into Low, Medium, and High salary bands:

data <- data %>% mutate(Salary_Band = cut(
  Salary_USD, 
  breaks = c(0, 80000, 120000, Inf), 
  labels = c("Low", "Medium", "High")
))

Combining Columns: Create a Job_Description column that combines Job_Title and Industry:

data <- data %>% mutate(Job_Description = paste(Job_Title, Industry, sep = " - "))

Applying Multiple Conditions to Create a New Variable: Add a column Growth_Opportunity based on Job_Growth_Projection and Remote_Friendly:

data <- data %>% mutate(Growth_Opportunity = case_when(
  Job_Growth_Projection == "Growth" & Remote_Friendly == "Yes" ~ "High Opportunity",
  Job_Growth_Projection == "Stable" ~ "Moderate Opportunity",
  TRUE ~ "Low Opportunity"
))

Creating Composite Metrics: Generate a Tech_Readiness_Score based on AI_Adoption_Level and Required_Skills:

data <- data %>% mutate(Tech_Readiness_Score = case_when(
  AI_Adoption_Level == "High" & Required_Skills %in% c("Machine Learning", "Python") ~ 90,
  AI_Adoption_Level == "Medium" ~ 70,
  TRUE ~ 50
))

AI simplified creating complex transformations like Risk_Level, combining multiple variables such as Automation_Risk and AI_Adoption_Level. It also helped categorize continuous data into meaningful groups like Salary_Band and merge columns such as Job_Title and Industry into Job_Description. These transformations saved time and introduced efficient, creative approaches that would have been harder to develop manually.

Part 3: Visualizing Data with AI Assistance 3.1 Basic Visualizations Using AI

Example 1: Histogram of Salary_USD Visualize the distribution of salaries in your dataset:

library(ggplot2)

# Histogram of Salary_USD
ggplot(data, aes(x = Salary_USD)) +
  geom_histogram(binwidth = 10000, fill = 'blue', color = 'black') +
  labs(title = "Distribution of Salaries", x = "Salary (USD)", y = "Count")

Example 2: Scatter Plot of Salary_USD vs. AI_Adoption_Level Examine the relationship between salaries and AI adoption levels:

# Scatter plot of Salary_USD vs AI_Adoption_Level
ggplot(data, aes(x = AI_Adoption_Level, y = Salary_USD, color = AI_Adoption_Level)) +
  geom_point(size = 3) +
  labs(title = "Salary vs. AI Adoption Level", x = "AI Adoption Level", y = "Salary (USD)")

Example 3: Box Plot of Salary_USD by Company_Size Analyze salary distribution across different company sizes:

# Box plot of Salary_USD by Company_Size
ggplot(data, aes(x = Company_Size, y = Salary_USD, fill = Company_Size)) +
  geom_boxplot() +
  labs(title = "Salary Distribution by Company Size", x = "Company Size", y = "Salary (USD)")

Activity: Ask AI to Generate More Visualizations Prompt: “Generate a line chart of Salary_USD by Location.”

# Line chart of Salary_USD by Location
ggplot(data, aes(x = Location, y = Salary_USD, group = 1)) +
  geom_line(color = "blue", size = 1) +
  geom_point(size = 3, color = "red") +
  labs(title = "Average Salary by Location", x = "Location", y = "Salary (USD)")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

AI greatly simplified the visualization process by providing ready-to-use, efficient ggplot2 code tailored to the dataset. It saved time by automatically selecting appropriate plot types and aesthetics, such as using a histogram for distributions and box plots for categorical comparisons. The results aligned with expectations, showcasing clear, visually appealing representations of the data. Additionally, AI offered customization options like colors, labels, and themes, further enhancing the clarity of the visualizations.

3.2 Customizing Visualizations Explore how to customize AI-generated visualizations to match your style preferences (colors, themes, titles).

Example: Scatter Plot with Trendline for Your Dataset Let’s create a scatter plot of Salary_USD against Automation_Risk with a trendline, customized axis labels, and an updated theme.

library(ggplot2)

# Customized scatter plot with trendline
ggplot(data, aes(x = Automation_Risk, y = Salary_USD, color = Automation_Risk)) +
  geom_point(size = 3) +
  geom_smooth(method = 'lm', color = 'red', se = FALSE) +
  labs(
    x = 'Automation Risk Level',
    y = 'Salary (USD)',
    title = 'Impact of Automation Risk on Salary'
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    legend.position = "top"
  )

## `geom_smooth()` using formula = 'y ~ x'

Modify Legends and Colors: Customize a box plot of Salary_USD by Company_Size with updated colors and a repositioned legend:

ggplot(data, aes(x = Company_Size, y = Salary_USD, fill = Company_Size)) +
  geom_boxplot() +
  labs(
    title = "Salary Distribution by Company Size",
    x = "Company Size",
    y = "Salary (USD)"
  ) +
  scale_fill_manual(values = c("Small" = "blue", "Medium" = "green", "Large" = "purple")) +
  theme_light() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(size = 12)
  )

Customize Themes and Titles: Update a histogram of Salary_USD with a dark theme and larger labels:

ggplot(data, aes(x = Salary_USD)) +
  geom_histogram(binwidth = 10000, fill = 'cyan', color = 'black') +
  labs(
    title = "Salary Distribution",
    x = "Salary (USD)",
    y = "Count"
  ) +
  theme_dark() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "italic"),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12)
  )

AI-generated code made it easy to apply meaningful customizations, such as adding trendlines, repositioning legends, and updating themes for clarity and style. These enhancements improved plot readability and ensured that key insights, such as relationships between Automation_Risk and Salary_USD, were communicated effectively. Additionally, AI provided efficient syntax for incorporating personal preferences, like choosing colors and adjusting text elements, saving time and reducing guesswork.

Part 4: Advanced Visualizations with AI 4.1 Generating Interactive Visualizations

Here’s how to create an interactive scatter plot of Salary_USD vs. Automation_Risk with color coding for AI_Adoption_Level using plotly:

library(plotly)

## Warning: package 'plotly' was built under R version 4.4.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# Interactive scatter plot
fig <- plot_ly(
  data = data, 
  x = ~Automation_Risk, 
  y = ~Salary_USD, 
  color = ~AI_Adoption_Level, 
  type = 'scatter', 
  mode = 'markers',
  marker = list(size = 10)
) %>%
  layout(
    title = "Automation Risk vs. Salary with AI Adoption Level",
    xaxis = list(title = "Automation Risk Level"),
    yaxis = list(title = "Salary (USD)")
  )

fig

Adding Hover Information: Enhance the scatter plot to display additional details like Job_Title and Location when hovering:

fig <- plot_ly(
  data = data, 
  x = ~Automation_Risk, 
  y = ~Salary_USD, 
  color = ~AI_Adoption_Level, 
  type = 'scatter', 
  mode = 'markers',
  text = ~paste("Job Title:", Job_Title, "<br>Location:", Location),
  hoverinfo = 'text',
  marker = list(size = 10)
) %>%
  layout(
    title = "Automation Risk vs. Salary with Hover Information",
    xaxis = list(title = "Automation Risk Level"),
    yaxis = list(title = "Salary (USD)")
  )

fig

Interactive Box Plot: Create an interactive box plot to visualize the distribution of Salary_USD across Company_Size:

fig <- plot_ly(
  data = data, 
  x = ~Company_Size, 
  y = ~Salary_USD, 
  type = 'box', 
  color = ~Company_Size
) %>%
  layout(
    title = "Salary Distribution by Company Size",
    xaxis = list(title = "Company Size"),
    yaxis = list(title = "Salary (USD)")
  )

fig

AI simplified the process of generating interactive visualizations by providing well-structured plotly code tailored to my dataset. It efficiently integrated hover information, color coding, and customizable axis labels, making it easier to highlight key data relationships, such as the impact of Automation_Risk on Salary_USD. Additionally, AI offered options for dynamic exploration, like zooming and filtering, which enhanced the analytical experience and allowed deeper insights without manual trial and error.

Part 5: Project 5.1 AI-Assisted Data Wrangling and Visualization Project

Step 1: Load and Explore the Dataset Start by loading your dataset and performing initial exploration.

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(plotly)


data_path <- "C:\\Users\\ntonu\\OneDrive\\Documents\\ai_job_market_insights.csv"

data <- read_csv(data_path)

## Rows: 500 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Job_Title, Industry, Company_Size, Location, AI_Adoption_Level, Aut...
## dbl (1): Salary_USD
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

str(data)

## spc_tbl_ [500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Job_Title            : chr [1:500] "Cybersecurity Analyst" "Marketing Specialist" "AI Researcher" "Sales Manager" ...
##  $ Industry             : chr [1:500] "Entertainment" "Technology" "Technology" "Retail" ...
##  $ Company_Size         : chr [1:500] "Small" "Large" "Large" "Small" ...
##  $ Location             : chr [1:500] "Dubai" "Singapore" "Singapore" "Berlin" ...
##  $ AI_Adoption_Level    : chr [1:500] "Medium" "Medium" "Medium" "Low" ...
##  $ Automation_Risk      : chr [1:500] "High" "High" "High" "High" ...
##  $ Required_Skills      : chr [1:500] "UX/UI Design" "Marketing" "UX/UI Design" "Project Management" ...
##  $ Salary_USD           : num [1:500] 111392 93793 107170 93028 87753 ...
##  $ Remote_Friendly      : chr [1:500] "Yes" "No" "Yes" "No" ...
##  $ Job_Growth_Projection: chr [1:500] "Growth" "Decline" "Growth" "Growth" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Job_Title = col_character(),
##   ..   Industry = col_character(),
##   ..   Company_Size = col_character(),
##   ..   Location = col_character(),
##   ..   AI_Adoption_Level = col_character(),
##   ..   Automation_Risk = col_character(),
##   ..   Required_Skills = col_character(),
##   ..   Salary_USD = col_double(),
##   ..   Remote_Friendly = col_character(),
##   ..   Job_Growth_Projection = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(data)

##   Job_Title           Industry         Company_Size         Location        
##  Length:500         Length:500         Length:500         Length:500        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  AI_Adoption_Level  Automation_Risk    Required_Skills      Salary_USD    
##  Length:500         Length:500         Length:500         Min.   : 31970  
##  Class :character   Class :character   Class :character   1st Qu.: 78512  
##  Mode  :character   Mode  :character   Mode  :character   Median : 91998  
##                                                           Mean   : 91222  
##                                                           3rd Qu.:103971  
##                                                           Max.   :155210  
##  Remote_Friendly    Job_Growth_Projection
##  Length:500         Length:500           
##  Class :character   Class :character     
##  Mode  :character   Mode  :character     
##                                          
##                                          
##

Step 2: Data Wrangling with AI Assistance 1. Handling Missing Values Impute missing values in Salary_USD with the median and remove rows with more than 50% missing values.

# Replace missing values in Salary_USD with the median
data <- data %>%
  mutate(Salary_USD = ifelse(is.na(Salary_USD), median(Salary_USD, na.rm = TRUE), Salary_USD))

# Remove rows with more than 50% missing values
data <- data[rowSums(is.na(data)) <= ncol(data) / 2, ]

Transforming Columns Create a new column Risk_Level based on Automation_Risk and AI_Adoption_Level.

# Add Risk_Level column
data <- data %>% mutate(Risk_Level = case_when(
  Automation_Risk == "High" & AI_Adoption_Level == "High" ~ "Very Risky",
  Automation_Risk == "Medium" & AI_Adoption_Level == "Medium" ~ "Moderately Risky",
  TRUE ~ "Low Risk"
))

Categorizing Continuous Variables Bin Salary_USD into Low, Medium, and High salary bands.

# Categorize Salary_USD into bands
data <- data %>% mutate(Salary_Band = cut(
  Salary_USD,
  breaks = c(0, 80000, 120000, Inf),
  labels = c("Low", "Medium", "High")
))

Step 3: Visualizations with AI Assistance 1. Histogram of Salary Distribution Visualize the distribution of salaries.

# Histogram
ggplot(data, aes(x = Salary_USD)) +
  geom_histogram(binwidth = 10000, fill = "blue", color = "black") +
  labs(title = "Salary Distribution", x = "Salary (USD)", y = "Count")

Box Plot of Salary by Company Size Analyze how salaries vary across company sizes.

# Box plot
ggplot(data, aes(x = Company_Size, y = Salary_USD, fill = Company_Size)) +
  geom_boxplot() +
  labs(title = "Salary Distribution by Company Size", x = "Company Size", y = "Salary (USD)")

Scatter Plot of Salary vs. Automation Risk Explore the relationship between Salary_USD and Automation_Risk.

# Scatter plot
ggplot(data, aes(x = Automation_Risk, y = Salary_USD, color = AI_Adoption_Level)) +
  geom_point(size = 3) +
  labs(title = "Salary vs Automation Risk", x = "Automation Risk Level", y = "Salary (USD)")

Step 4: Interactive Visualization Create an interactive scatter plot with plotly.

# Interactive scatter plot
fig <- plot_ly(
  data = data,
  x = ~Automation_Risk,
  y = ~Salary_USD,
  color = ~AI_Adoption_Level,
  text = ~paste("Job Title:", Job_Title, "<br>Location:", Location),
  type = 'scatter',
  mode = 'markers',
  marker = list(size = 10)
) %>%
  layout(
    title = "Interactive Scatter Plot: Salary vs Automation Risk",
    xaxis = list(title = "Automation Risk Level"),
    yaxis = list(title = "Salary (USD)")
  )

fig

Prompts Used: “Suggest R code to handle missing values by replacing them with the median.” “How can I create a column combining Automation_Risk and AI_Adoption_Level to assess risk?” “Generate a box plot of salaries by company size using ggplot2.” “Create an interactive scatter plot with hover information using plotly in R.”

Results and Insights: Salary distribution shows most employees earn between $80,000 and $120,000. Smaller companies often have lower salary ranges than larger companies. High automation risk is correlated with higher salaries in some cases, suggesting specialized roles.

Visualizations Included: Histogram, Box Plot, Scatter Plot, and Interactive Scatter Plot.

Evaluation Questions 1. How did AI improve your data wrangling workflow? AI significantly improved the data wrangling process by automating repetitive tasks like handling missing values and categorizing data into meaningful groups. It provided efficient solutions, such as using mutate() and case_when() for creating new columns based on conditional logic, saving time and reducing errors. Additionally, AI suggested optimized methods and introduced creative transformations that enhanced the dataset’s structure for analysis.

What was the most surprising aspect of using AI for data visualization? The most surprising aspect was AI’s ability to generate detailed and visually appealing plots with minimal input. It not only suggested standard visualizations like histograms and scatter plots but also offered advanced customizations, such as adding trendlines, hover information, and dynamic interactivity using plotly. The ease with which AI incorporated these features, particularly for interactive visualizations, made the process more engaging and insightful than expected.
How can you integrate AI assistance into your daily workflow as a data scientist? AI can be integrated into daily workflows by serving as an on-demand assistant for coding, debugging, and exploring new techniques. It can streamline common tasks like data cleaning, feature engineering, and visualization generation, allowing more time for analysis and interpretation. Furthermore, AI can help keep workflows efficient by recommending best practices and introducing new libraries or tools, making it a valuable companion in both routine and complex data science projects.

stat wrang

Oma Tonukari

2024-11-20