The objective of this project is to design an AI-powered data profiling tool that can automatically summarize, assess, and highlight issues within any structured dataset. The goal is to help analysts quickly gain insight into the quality and structure of their data and prepare it for analysis or modeling.
This tool, titled SmartInsights, leverages R and generative AI logic to automate the βfirst glanceβ analysis process β such as inspecting missing values, data types, outliers, unique values, and format issues β and produce human-readable reports.
```{r load-packages, message=FALSE, warning=FALSE} required_packages <- c("readr", "dplyr", "janitor", "skimr", "ggplot2", "lubridate", "rmarkdown") for (pkg in required_packages) { if (!requireNamespace(pkg, quietly = TRUE)) { install.packages(pkg, quiet = TRUE) } library(pkg, character.only = TRUE) } ```
data <- read_csv(params$dataset_path) glimpse(data)
# Stage 3 β Process: Clean the Data
<pre><code>```{r clean-data, message=FALSE, warning=FALSE} # Load the dataset using the path passed from the R script dataset <- read_csv(params$dataset_path) %>% clean_names() # Remove columns with all missing values dataset <- dataset[, colSums(is.na(dataset)) < nrow(dataset)] # Replace missing values with "missing" dataset[is.na(dataset)] <- "missing" # Standardize text columns to lowercase dataset <- dataset %>% mutate(across(where(is.character), tolower)) ```</code></pre>
# Remove fully empty columns
data_clean <- data_clean[, colSums(is.na(data_clean)) < nrow(data_clean)]
# Replace NA with "missing" for non-numeric
data_clean <- data_clean %>%
mutate(across(where(is.character), ~replace_na(.x, "missing")))
```{r profile, message=FALSE, warning=FALSE} # Structure of the dataset str(dataset) # Summary statistics for numeric columns dataset %>% select(where(is.numeric)) %>% summary() # Count of unique values in each column unique_counts <- sapply(dataset, function(x) length(unique(x))) unique_counts ```