This document is an R Markdown Notebook which contains an analysis of a public dataset of Midjourney text-to-image prompts over a period in 2022. The original scraped data in json format is here: [https://www.kaggle.com/datasets/succinctlyai/midjourney-texttoimage]. Data was scraped from discord and stored in JSON format. The transformed data set in csv with much of the extraneous data is here: [https://www.kaggle.com/datasets/ldmtwo/midjourney-250k-csv]
It aims to show some basic information about how many times prompts were repeated, either uniquely or as modifiers to an existing image, and show the most common aspect ratios specified.
We start with code to make package installation streamlined. Rjson is used to manipulate json data.
# Install packages
list_of_packages <- c("ggplot2", "gdata", "dplyr", "scales", "stringr", "tidyr", "readr")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load libraries by applying library to list
lapply(list_of_packages, library, character.only=TRUE)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
##
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
##
## Attaching package: 'gdata'
## The following object is masked from 'package:stats':
##
## nobs
## The following object is masked from 'package:utils':
##
## object.size
## The following object is masked from 'package:base':
##
## startsWith
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:gdata':
##
## combine, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
##
## col_factor
## [[1]]
## [1] "ggplot2" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "gdata" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "dplyr" "gdata" "ggplot2" "stats" "graphics" "grDevices"
## [7] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "scales" "dplyr" "gdata" "ggplot2" "stats" "graphics"
## [7] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "stringr" "scales" "dplyr" "gdata" "ggplot2" "stats"
## [7] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[6]]
## [1] "tidyr" "stringr" "scales" "dplyr" "gdata" "ggplot2"
## [7] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [13] "base"
##
## [[7]]
## [1] "readr" "tidyr" "stringr" "scales" "dplyr" "gdata"
## [7] "ggplot2" "stats" "graphics" "grDevices" "utils" "datasets"
## [13] "methods" "base"
# Load libraries
#for(p in list.of.packages) {
# library(parse(p))
#}
NOTE: R is terribly inefficient at combining multiple large raw json files, so for practical purposes, we’re only using the CSV data sets. Here is some example code on how to combine those files and read the json if you need to review the original data set.
# combined_data_name <- "combined_data.json"
# force_write = FALSE;
#
#
# if (!exists(combined_data_name) | force_write == TRUE)
# {
# raw_json_text <- ""
# df <- list.files(path='data', full.names = TRUE)
# for (f in df) {
# f
# file_text <- readLines(f)
# raw_json_text <- cat(raw_json_text, file_text, sep="\n")
# }
#
# writelines("combined_data.json", raw_json_text)
# }
# full_json <- fromJSON(file = combined_data_name)
mj_data <- read.csv("midjourney_2022_250k_raw.csv")
First, we use distinct on the raw data set to get some idea of how many prompts are re-rerun because the user wants a different output for the same prompt.
mj_distinct <- distinct(mj_data, X_message)
total_prompts = nrow(mj_data)
total_unique_prompts = nrow(distinct(mj_data, X_message))
dupe_percent = 100 - (total_unique_prompts/total_prompts * 100)
ggplot(mj_data) +
geom_bar(mapping = aes(x="Prompts") , fill="green") +
geom_bar(mj_distinct, mapping = aes(x="Unique Prompts") , fill="blue") +
labs(Title="Total vs Unique Prompts")
We see that:
dupe_percent
## [1] 27.37872
% of the prompts are duplicates and we posit that over 1/4 of users want to a new image generated without having to alter any of the parameters of their inquiry.
Next, we do some work on the table to distinguish between text-to-image and image-to-image with additional keywords applied and parse out some useful data
mjd <- mj_data
mjd <- mutate(mjd, prompt = str_replace( str_extract(X_message, "\\*\\*[^\\*]*"), "\\*\\*", "" ) )
mjd <- mutate(mjd, mjd_prompt_img = str_replace( str_extract(prompt, "^\\<[^\\>]*"), "\\<", "" ) )
mjd <- mutate(mjd, aspect = str_replace( str_extract(prompt, "--ar [:digit:]*:[:digit:]*"), "--ar ", "" ) )
mjd <- mutate(mjd, aspect_with_default = replace_na(aspect, "1:1"))
mjd <- mutate(mjd, ar_width = parse_number(str_extract(aspect_with_default, "^[:digit:]*")), 1)
mjd <- mutate(mjd, ar_height = parse_number( str_replace( str_extract(aspect_with_default, ":[:digit:]*"), ":", "")), 1)
mjd <- mutate(mjd, is_img_to_img = startsWith(X_message, "**<http"))
mjd <- mutate(mjd, is_img_from_mj = grepl("https://s.mj.run/", X_message))
ggplot(mjd) +
geom_bar(mapping=aes(x="Prompts", fill=is_img_to_img, position="stack"))
## Warning in geom_bar(mapping = aes(x = "Prompts", fill = is_img_to_img, position
## = "stack")): Ignoring unknown aesthetics: position
Here we find that only half of the prompts have unique instructions without referring to a prior image.
Now we look for how common different aspect ratios are, with the scatterplot point size representing how frequent the ratio is.
ggplot(mjd, mapping=aes(x=ar_width, y=ar_height, color=(ar_width / ar_height), size=(ar_width / ar_height))) + geom_point()
Some outliers are throwing off how easy it is to read our chart, so let’s zoom in a bit and ignore extreme ratios (greater than 2 to 1 in either direction)
mjd_filtered <- filter(mjd, (ar_width / ar_height) < 2 & (ar_height / ar_width) < 2)
ggplot(mjd_filtered, mapping=aes(x=ar_width, y=ar_height, color=(ar_width / ar_height), size=(ar_width / ar_height))) + geom_jitter()
That shows some basic manipulation of the public data set of Midjourney prompt commands, looking specifically at prompt repetition and aspect ratios.