This seeks to utilize the description text from imdb’s top 50 grossing movies and visulizae it using several modern text data techniques.
length(description_data)
## [1] 50
head(description_data)
## [1] " T'Challa, heir to the hidden but advanced kingdom of Wakanda, must step forward to lead his people into a new future and must confront a challenger from his country's past."
## [2] " The Avengers and their allies must be willing to sacrifice all in an attempt to defeat the powerful Thanos before his blitz of devastation and ruin puts an end to the universe."
## [3] " The Incredibles hero family takes on a new mission, which involves a change in family roles: Bob Parr (Mr Incredible) must manage the house while his wife Helen (Elastigirl) goes out to save the world."
## [4] " When the island's dormant volcano begins roaring to life, Owen and Claire mount a campaign to rescue the remaining dinosaurs from this extinction-level event."
## [5] " Foul-mouthed mutant mercenary Wade Wilson (AKA. Deadpool), brings together a team of fellow mutant rogues to protect a young boy with supernatural abilities from the brutal, time-traveling cyborg, Cable."
## [6] " Arthur Curry learns that he is the heir to the underwater kingdom of Atlantis, and must step forward to lead his people and be a hero to the world."
First, we need to rid of all the commas, since we’re going to utilize them to sererate each description when we compress them into a string for analysis. So let’s replace them with a space.
description_data<-gsub(",","",description_data)
head(description_data)
## [1] " T'Challa heir to the hidden but advanced kingdom of Wakanda must step forward to lead his people into a new future and must confront a challenger from his country's past."
## [2] " The Avengers and their allies must be willing to sacrifice all in an attempt to defeat the powerful Thanos before his blitz of devastation and ruin puts an end to the universe."
## [3] " The Incredibles hero family takes on a new mission which involves a change in family roles: Bob Parr (Mr Incredible) must manage the house while his wife Helen (Elastigirl) goes out to save the world."
## [4] " When the island's dormant volcano begins roaring to life Owen and Claire mount a campaign to rescue the remaining dinosaurs from this extinction-level event."
## [5] " Foul-mouthed mutant mercenary Wade Wilson (AKA. Deadpool) brings together a team of fellow mutant rogues to protect a young boy with supernatural abilities from the brutal time-traveling cyborg Cable."
## [6] " Arthur Curry learns that he is the heir to the underwater kingdom of Atlantis and must step forward to lead his people and be a hero to the world."
Looks like that did the trick, now let’s compress these all into a single string of information.
string = paste(description_data, collapse = ", ")
length(string)
## [1] 1
Great! looks like it worked! But there are still inconsistencies such as punctiation marks and periods. If we want to analyze text coherently, these will have to be rid of.
I followed this guide here: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know where someone already wrote some code that would help guide me through this.
#Transform into a corpus (basically a text data format)
docs <- Corpus(VectorSource(string))
inspect(docs)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] T'Challa heir to the hidden but advanced kingdom of Wakanda must step forward to lead his people into a new future and must confront a challenger from his country's past., The Avengers and their allies must be willing to sacrifice all in an attempt to defeat the powerful Thanos before his blitz of devastation and ruin puts an end to the universe., The Incredibles hero family takes on a new mission which involves a change in family roles: Bob Parr (Mr Incredible) must manage the house while his wife Helen (Elastigirl) goes out to save the world., When the island's dormant volcano begins roaring to life Owen and Claire mount a campaign to rescue the remaining dinosaurs from this extinction-level event., Foul-mouthed mutant mercenary Wade Wilson (AKA. Deadpool) brings together a team of fellow mutant rogues to protect a young boy with supernatural abilities from the brutal time-traveling cyborg Cable., Arthur Curry learns that he is the heir to the underwater kingdom of Atlantis and must step forward to lead his people and be a hero to the world., Ethan Hunt and his IMF team along with some familiar allies race against time after a mission gone wrong., As Scott Lang balances being both a Super Hero and a father Hope van Dyne and Dr. Hank Pym present an urgent new mission that finds the Ant-Man fighting alongside The Wasp to uncover secrets from their past., During an adventure into the criminal underworld Han Solo meets his future co-pilot Chewbacca and encounters Lando Calrissian years before joining the Rebellion., When Eddie Brock acquires the powers of a symbiote he will have to release his alter-ego "Venom" to save his life., After escaping an attack by what he claims was a 70-foot shark Jonas Taylor must confront his fears to save those trapped in a sunken submersible., Debbie Ocean gathers an all-female crew to attempt an impossible heist at New York City's yearly Met Gala., When the creator of a virtual reality world called the OASIS dies he releases a video in which he challenges all OASIS users to find his Easter Egg which will give the finder his fortune., Teen Miles Morales becomes Spider-Man of his reality crossing his path with five counterparts from other dimensions to stop a threat for all realities., Robert McCall serves an unflinching justice for the exploited and oppressed but how far will he go when that is someone he loves?, When three different animals become infected with a dangerous pathogen a primatologist and a geneticist team up to stop them from destroying Chicago., On the run in the year of 1987 Bumblebee finds refuge in a junkyard in a small Californian beach town. Charlie on the cusp of turning 18 and trying to find her place in the world discovers Bumblebee battle-scarred and broken., America's third political party the New Founding Fathers of America comes to power and conducts an experiment: no laws for 12 hours on Staten Island. No one has to stay on the island but $5000 is given to anyone who does., A group of friends who meet regularly for game nights find themselves entangled in a real-life mystery when the shady brother of one of them is seemingly kidnapped by dangerous gangsters., A security expert must infiltrate a burning skyscraper 225 stories above ground when his family is trapped inside by criminals., Jake Pentecost son of Stacker Pentecost reunites with Mako Mori to lead a new generation of Jaeger pilots including rival Lambert and 15-year-old hacker Amara against a new Kaiju threat., Young hero Thomas embarks on a mission to find a cure for a deadly disease known as "The Flare"., Lara Croft the fiercely independent daughter of a missing adventurer must push herself beyond her limits when she discovers the island where her father disappeared., When a young boy accidentally triggers the universe's most lethal hunters' return to Earth only a ragtag crew of ex-soldiers and a disgruntled scientist can prevent the end of the human race., The drug war on the U.S.-Mexico border has escalated as the cartels have begun trafficking terrorists across the US border. To fight the war federal agent Matt Graver re-teams with the mercurial Alejandro., Ballerina Dominika Egorova is recruited to 'Sparrow School' a Russian intelligence service where she is forced to use her body as a weapon. Her first mission targeting a C.I.A. agent threatens to unravel the security of both nations., A woman fights to protect her family during a home invasion., 12 Strong tells the story of the first Special Forces team deployed to Afghanistan after 9/11; under the leadership of a new captain the team must work with an Afghan warlord to take down the Taliban., A gritty crime saga which follows the lives of an elite unit of the LA County Sheriff's Dept. and the state's most successful bank robbery crew as the outlaws plan a seemingly impossible heist on the Federal Reserve Bank., An Insurance Salesman/Ex-Cop is caught up in a criminal conspiracy during his daily commute home., An elite American intelligence officer aided by a top-secret tactical command unit tries to smuggle a mysterious police officer with sensitive information out of Indonesia., Five years after her husband and daughter are killed in a senseless act of violence a woman comes back from self-imposed exile to seek revenge against those responsible and the system that let them go free., Dr. Paul Kersey is an experienced trauma surgeon a man who has spent his life saving lives. After an attack on his family Paul embarks on his own mission for justice., Audrey and Morgan are best friends who unwittingly become entangled in an international conspiracy when one of the women discovers the boyfriend who dumped her was actually a spy., A true story of survival as a young couple's chance encounter leads them first to love and then on the adventure of a lifetime as they face one of the most catastrophic hurricanes in recorded history., A war-hardened Crusader and his Moorish commander mount an audacious revolt against the corrupt English crown in a thrilling action-adventure packed with gritty battlefield exploits mind-blowing fight choreography and a timeless romance., A villain's maniacal plan for world domination sidetracks five teenage superheroes who dream of Hollywood stardom., A small group of American soldiers find horror behind enemy lines on the eve of D-Day., Mary (Taraji P. Henson) is a hit woman working for an organized crime family in Boston whose life is completely turned around when she meets a young boy whose path she crosses when a professional hit goes bad., When the puppet cast of a '90s children's TV show begin to get murdered one by one a disgraced LAPD detective-turned-private eye puppet takes on the case., With retirement on his mind a successful young drug dealer sets up one last big job while dealing with trigger-happy colleagues and the police., In a post-apocalyptic world where cities ride on wheels and consume each other to survive two people meet in London and try to stop a conspiracy., An untested American submarine captain teams with U.S. Navy Seals to rescue the Russian president who has been kidnapped by a rogue general., Young computer hacker Lisbeth Salander and journalist Mikael Blomkvist find themselves caught in a web of spies cybercriminals and corrupt government officials., Set in the near-future technology controls nearly all aspects of life. But when Grey a self-identified technophobe has his world turned upside down his only hope for revenge is an experimental computer chip implant called Stem., When the Universe decides what it wants it's pointless to resist. With his family's life at stake Joseph Steadman finds himself the unwilling test subject of a maniacal scientist in a battle that could save the world or destroy it., Set in riot-torn near-future Los Angeles 'Hotel Artemis' follows the Nurse who runs a secret members-only emergency room for criminals., A.X.L. is a top-secret robotic dog who develops a special friendship with Miles and will go to any length to protect his new companion., Thieves attempt a massive heist against the U.S. Treasury as a Category 5 hurricane approaches one of its Mint facilities., All Might and Deku accept an invitation to visit a floating man-made city called I Island where they meet a girl and battle against a villain who takes the island hostage.
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
# Remove numbers
docs <- tm_map(docs, removeNumbers)
## Warning in tm_map.SimpleCorpus(docs, removeNumbers): transformation drops
## documents
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, c("blabla1", "blabla2")):
## transformation drops documents
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
## Warning in tm_map.SimpleCorpus(docs, removePunctuation): transformation
## drops documents
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
## Warning in tm_map.SimpleCorpus(docs, stripWhitespace): transformation drops
## documents
# Text stemming
# docs <- tm_map(docs, stemDocument)
time for the fun stuff. Let’s start with a simple word map of the most used words. To do this I’ll next the help of several packages.
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
## word freq
## must must 9
## new new 9
## one one 8
## world world 8
## young young 7
## family family 6
## find find 6
## life life 6
## mission mission 6
## island island 5
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 3,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
You can kind of tell that all the top 50 movies have an element of action associated with them. “World”, “must”, “new”, and “one” all dominate the word cloud. Seems like someone must do something in the world at some point.
This concludes this super quick demo of making a world cloud using a previous dataset, next I’ll do something way cooler…I think.