Gabriela Gamez Sept 14, 2022
csggame001
Notes From HW1 R Markdown
When I loaded the PetNames.tsv into OpenRefine, I noticed it had some
issues compared to the in-class exercise(inconsistency of the names and
name duplication). One first attempt at cleaning the data was to use the
edit cells-> cluster and edit since durring
this method seems to be very effective at editing the columns containing
inconsistency of the names and name duplication. After I limit all the
options for the cluster and edit, I apply another
Facet strategy.
Selecting for method key collision and making the keying
function fingerprint. After I merged Selected and
Re-cluster, it said that 1635 cells were edited.
cluster and edit
Selecting for method key collision and making the keying
function ngram-fingerprint. It return as No
clusters were found with the selected method
cluster and edit
Selecting for method key collision and making the keying
function metaphone3. I got alot of name that did not seem
to be similar. I use the google search engine just in case some of the
same detect were in different language. After I merged Selected and
Re-cluster, it said that 1122 cells were edited.
Dog(1115 rows)
Dig
Doggo
Cats(2 rows)
Katze
Ca
Cow
Rabbit(12 rows)
Robot
Cat(492 rows)
God
cluster and edit
Selecting for method key collision and making the keying
function cologne-phonetic. I just gave me the same option
that I refuse to merge from metaphone3. cluster and
edit Selecting for method
key collision and
making the keying function Daitch-Mokotoff. I gave cluster
that were not even related for exampleBird and
Ferret. So i proceed to select the next keying function
Beider-Morse. After I merged Selected and Re-cluster, it
said that 496 cells were edited.
cluster and edit
selecting for method nearest neighbor and making the key
distance function levenshtein. It return as No
clusters were found with the selected method. I try using
key distance function PPM but I said the same thing. At
this point, I realize I have exhausted all the possible options for
cluster and edit so I process to the Facet
-> Text facet.
Opening the Text facet
Format of merging is
New Name
To merger inside the Text facet , I had to look at the
Breed column and determine from there what kind of animal it was.
Puppy
Pit bull
Sog
Dog, dog, dog , cat
God
dlg
Phoebe
Luna
(blank)
Mona
Kitten
Kitty Meow
I notice that some of the Other type did not had their breed was blank so I name there so instead of losing some of the data I decide to formatted.
Other: animal type
Other: Fish
Beta fish
Fish
Goldfish
Other (fish)
Indoor goldfish
Other - notice the breed said Goldfish
Other - notice the breed said Beta fish
Other: Chinchilla
Chinchilla
Chinchilla (Other)
Other - notice the breed said American standard Chinchilla
Other: Rabbit
Rabbit
Bunny
Other: Snake
Other: snake
Snake
Other: Guinea pig
Guinea pig
Other - notice the breed said Guinea pig
Other - notice the breed said American guinea pig
Other - notice the breed said Abyssinian Guineapig
Other: Prairie dog
Other: Lizard
Lizard
Leopard Gecko- this one it did not had a breed but i remember Lizard had this name as breed
Gecko
other - notice the breed said Leopard gecko
Other: Hamster
Hamster
Other - notice the breed said Hamster
Other: Horse
Horse
other - notice the breed said Mustang (horse)
Other: Bees
Other: Hermit Crab
Other: Chicken
Chicken
other - notice the breed said Speckledys
Other: Cow
Other: Elephant
Other: Ferret
Other: Frog
Other: Gerbil
Gerbil
Other - notice the breed said gerbil
Other: Hedgehog
Other: Tortoise
Other: Rat
Rat
Other - notice the breed said Rat
Other: Spider
Other: Spiney leaf insect
I notice that some of the Other type their breed were none animal.
Other: None animal
Robot
Roomba
Server
Virus
Car
Card Board Poster
selecting for method key collision and making the keying
function fingerprint. After I merged Selected and
Re-cluster, it said that 41 cells were edited.
cluster and edit
After the fingerprint, I try using a different key
function name ngram-fingerprint, but I felt like the
accuracy of is not as good as a fingerprint. For example,
there is a pet name Chich and another
Ichi. I did not feel comfortable merging these since I
feel they are entirely different names. After I merged Selected and
Re-cluster, it said that 6 cells were edited.
cluster and edit
After the ngram-fingerprint, I tried using a different
key function name, but the accuracy became even worst than before so I
decided to change the selected method to nearest neighbor
instead of key collision. For the distance function I
picked levenshtein. After I merged Selected and Re-cluster,
it said that 23 cells were edited.
cluster and edit
Lastly, I try the distance function PPM from the
selected method to nearest neighbor.PPM found
Stella and Stellabella I think this
are not the same.
cluster and edit
I realize I have exhausted all the possible options for
cluster and edit, so I process to the next phase using a
combination of edit cells->
Common tranforms ->
To text,edit cells->
Common tranforms -> To titlecase,
Facet -> Text facet and Facet
-> Custom text facet. I am doing this because there are
too many names to edit with just the Text facet.
Bella Sitka LASTNAM
if (value.contains(" LASTNAME"),value.replace(" LASTNAME",""),value)
if(value.contains(" \("),value.match(/(.*?)\((.*)\)(.*)/)[0],value)
GREL
For some odd reason it was not saving the changes I was so I decide after 6 hours to do it every change manually.
Format of merging is
New Name
Parentheses within the name and trimming
Ljus
Kookie Pants
Tessie Joan
Henry
Jenson Button
Professor McGongall
Alpha
Mao
Mya
Watson & Crick
Tyler John
Kitty
Lillie
Layla
Myles
Bella Nina
commas within the name and trimming
Buttercup Maplesde
Dr. Wolf Barker
Mephistopheles
Wyatt Daniel
Xena
Blanks rows
I remove line 333 because if did not had other information other then kind of pet
I enter Not available for the other Blank
Banks FullName
selecting for method key collision and making the keying
function fingerprint. After I merged Selected and
Re-cluster, it said that 105 cells were edited.
cluster and edit
selecting for method key collision and making the keying
function ngram-fingerprint. After I merged Selected and
Re-cluster, it said that 10 cells were edited.
cluster and edit
selecting for method nearest neighbor and making the key
distance function levenshtein. After I merged Selected and
Re-cluster, it said that 10 cells were edited.
cluster and edit
selecting for method nearest neighbor and making the key
distance function PPM. I did not merge
Prince and Princess because the are
name for different sex. After I merged Selected and Re-cluster, it said
that 13 cells were edited.
cluster and edit
I recognize I have finished all the possible options for
cluster and edit, so I process to the next phase using
Edit Cells -> Tranformation. I think I was
using the Facet before, which was why it was not saving the
changes. I began by first doing combination of
edit cells-> Common tranforms ->
To text and edit cells->
Common tranforms -> To titlecase.
if(value.contains(" \("),value.match(/(.*?)\((.*)\)(.*)/)[0],value)
if(value.contains(" ,"),value.split(",")[0],value)
if(value.contains(","),value.split(",")[0],value)
if(value.contains("/"),value.split("/")[0],value)
if(value.contains(" ").and(3 < length(value.split(" ")[0])).and(value != null),value.split(" ")[0],value)
if(value.contains(" Or"),value.split(" Or")[0],value)
if(value.contains(" And"),value.split(" And")[0],value)
After doing the major transformation form the GREL I move to the
Facet -> Text facet and process any
additional modification manual
Format for name
New Name
Tom
Moo
The Squeak
Blanks
Even though the the columns form the Pet’s everyday name (e.g. “Church”) is empty it contain information with other columns.
Banks
For the Pet’s age columns follow the instruction similar the class
exercise. I first convert the Pet’s age into numeric by
edit cells-> Common tranforms ->
To number. Then I open the Number Facet by
Facet -> Number facet and selected to
Non-numeric.
Pet’s Age
selecting for method key collision and making the keying
function fingerprint. After I merged Selected and
Re-cluster, it said that 27 cells were edited.
Pet’s Age
selecting for method key collision and making the keying
function ngram-fingerprint. After I merged Selected and
Re-cluster, it said that 10 cells were edited.
Pet’s Age
selecting for method nearest neighbor and making the key
distance function PPM. After I merged Selected and
Re-cluster, it said that 2 cells were edited.
Pet’s Age
convert the Pet’s age once again edit cells->
Common tranforms -> To number. Then I open
the Number Facet by Facet -> Number facet
and selected to Non-numeric.
I manually edit some format for name
New Name
Tom
Dead
Deceased at 13
No longer with us :(
11-12 (deceased)
She died at age 13 in 2005 :(
Died
~15 years (would be 17 years today, now deceased, rescue)
Died
I think they were 11 when they passed
Just passed away at age 11
Deceaced 12 years :(
30 (deceased at 13)
Dead pet
8.5
1.5
12.5
0
A few months
0,5
Unknown
Unsure
So so old
We rescued from a fountain. Unknown.
3-Feb
4-Mar
2
Two
2 YO
20
6
8
Eight
8(ish) (rescue)
about 8 or 9
Almost 8
We aren’t sure- around 8-10
7
Seven
I think he’s around 7 or 8
7yrs; 4yrs; 4mos
7y 9 mo
7 (?-picked up as a stray)
≈7
between 5 and 9 years
1
1 year
1+
<1
2.5
2.5 yrs
2,5
3.5
3-4
3 1/2
6.5
5.5
4.5
4 1/2
4-5
3
3, we think (he’s a rescue)
About 3
3y
9
13
12
15
Pet age years old
if(value.contains(" years").and(value != null),value.split(" ")[0],value)
if(value.contains(" yesrs").and(value != null),value.split(" ")[0],value)
if(value.endsWith("ish").and(value != null),value.split("ish")[0],value)
if(value.endsWith("yrs").and(value != null),value.split("yrs")[0],value)
if(value.endsWith("ish?").and(value != null),value.split("ish?")[0],value)
if(value.endsWith("?").and(value != null),value.split("?")[0],value)
if(value.startsWith("~").and(value != null),value.replace("~",""),value)
convert the Pet’s age once again edit cells->
Common tranforms -> To number. Then I open
the Number Facet by Facet -> Number facet
and selected to Non-numeric.
if(value.contains(" months").and(value != null),(toNumber(value.split(" ")[0])/12.0).round(),value)
if(value.contains(" mos").and(value != null),(toNumber(value.split(" ")[0])/12.0).round(),value)
if(value.contains(" mo").and(value != null),(toNumber(value.split(" ")[0])/12.0).round(),value)
convert the Pet’s age once again edit cells->
Common tranforms -> To number. Then I open
the Number Facet by Facet -> Number facet
and selected to Non-numeric.
Blanks
See the ages
if (value.match(/.*(\d{2}).*/)[0]!= null, value.match(/.*(\d{2}).*/)[0],value)
selecting for method key collision and making the keying
function fingerprint. After I merged Selected and
Re-cluster, it said that 718 cells were edited.
cluster and edit
selecting for method key collision and making the keying
function ngram-fingerprint. After I merged Selected and
Re-cluster, it said that 148 cells were edited.
is a mixed breed dog — a cross between the American Pit Bull Terrier and the Labrador Retriever dog breeds
cluster and edit
Selecting for method key collision and making the keying
function cologne-phonetic. I just use the scan through the
option cluster it found use Google search engine to determine the
appropriate name. After I merged Selected and Re-cluster, it said that
87 cells were edited.
is a mixed breed dog a cross between the Shih Tzu and Toy Poodle dog breeds.
cluster and edit
selecting for method key collision and making the keying
function cologne-phonetic. After I merged Selected and
Re-cluster, it said that 279 cells were edited.
is a mixed breed dog–a cross between the Chihuahua and Yorkshire Terrier dog breeds.
cluster and edit
selecting for method nearest neighbor and making the key
distance function levenshtein. After I merged Selected and
Re-cluster, it said that 178 cells were edited.
cluster and edit
selecting for method nearest neighbor and making the key
distance function PPM.
I did not merge:
I dont know much about dogs breed, so i use google search engine and I found Purebred Lab Or Not? How Can You Tell If Your Labrador Is Pure Bred? and Lab Mix Breeds: Top 25 Labrador Mixes. My understanding if the now is purebred Labrador Retriever and a Labrador Retriever Mix are not the same thing.
Treeing Walker Coonhound
Treeing Walker Coonhound Mix
iIuse google search engine and I found What is the Difference Between Labrador and Labrador Retriever. Because it the Labrador Retriever my judgement is the same as the previous decision.
i use google search engine and I found 6 Different Types of Calico Cats With Pictures. I dont think Shorthair cat and Shorthair calico are the same .
I merge:
Pitbull/Terrier mix
Pitbull/Terrier
I use google search engine and I found What Is a Pit Bull Terrier Mix? My understanding this a mixture of Old English Bull and a Old English Terrier. I merger and rename to American Pit Bull Terrier
I merger and rename to Tortoiseshell
After I merged Selected and Re-cluster, it said that 29 cells were edited.
cluster and edit
After merging all the possible from cluster and edit, I
proceed to the next phase using Edit Cells ->
Tranformation. I began by first doing combination of
edit cells-> Common tranforms ->
To text and edit cells->
Common tranforms -> To titlecase.
Format of merging is
New Name
To merger inside the Text facet , I had to look at the
Breed column and determine from there what kind of animal it was.
DSH using the Transformation Domestic Shorthair
if(value.contains("DSH"),value.split(" ")[0].replace("DSH","Domestic Shorthair"),value)
if(value.contains("Domestic Shorthair"),value.split("Domestic Shorthair")[0],value)
Removing the parentheses with in the name
if(value.contains(" \("),value.match(/(.*?)\((.*)\)(.*)/)[0],value)
Edit Manually
YorkiePoo
Torkie- is a mix between a Yorkshire Terrier
Yorkie mix
Toy Fox Terrier / Yorkie
Brittany Siberian Husky Mix
Shorthair calico
Unknown
8 years - move this 8 to the Pets age
Cat, he’s a cat, we found him on my dad’s tire… I think you can guess what brand tires my dad had
Unknown unknown Bali dog mix and tabby cat
who knows?
Street corner
Yes
Cat
Golden boy
fluffy
Evil
Dog
Black
Kitty cat
Just a regular cat
A mess
Gr
None
A
-
Taco
N /A
Shih Tzu Mix
Winter white hamster
Great Pyrenees
Great Pyr
Labrador & Great Pieraneese
German Shepherd Collie mix
American Pitbull Terrier
American Pit Bull Terrier
American pitbull terrier
American Staffordshire Terrier
Lab
Labrador
Labrador retriever
Black Lab
Chocolate lab
Chocolate Labrador
Yellow Lab
Yellow Labrador Retriever
Yellow Labrador
Black Labrador Retriever
Black Labrador
Black Lab/Angel
Golden Retriever Mix
Golden Mix
Golden lab mix
Retriever X labrador**-** Goldador is a mix between a purebred Golden Retriever and a purebred Labrador Retriever
Pit bull mix
Pit bull ish
Pit mix
Russian Dwarf Hamster
Jack Russell Terrier
Golden Labrador
Corgi Husky Mix
Tortie/Tortishell
Domestic Shorthair - Tortoiseshell
Domestic Shorthair-tortoise
Jack Russell Lab mix
German Shepherd Lab mix
Lab/German Shepard mix
Lab and German shepherd mix
Lab/shepherd
Kerry Blue Terrier
Terrier mix
Terrier something or other
Trrrier mixed
Terrierist
-tabby
Tabby
Orange tabby
edit cells-> Common tranforms ->
To titlecase
if(value.contains("Tabby"),value.split(" ")[0].replace(" ","Tabby Cat"),value)
Black Cat***-*** is a domestic cat with black fur that may be a mixed or specific breed
stray black cat
-black
Black Shorthair Cat
Minnie Jack is a mix between a Jack Russel Terrier and a Miniature Pinscher
Jack Chi is a mixed breed dog a cross between the Jack Russell Terrier and Chihuahua
Great Pyrenees Lab
Tuxedo Cat
Tuxedo garbage cat
Tuxedo moggy
Tuxedo shorthair
Tuxedo
Black and White
Black cat with white feet
Frenchton is a mixed breed dog a cross between the Boston Terrier and French Bulldog
French Bulldog/boston Terrier Mix
French/english Bulldog
Mutt - is any dog that’s a combination of different breeds
Pit bull/Lab/Beagle/Chihuahua Mix/General mutt
mutt/mixed
Some kind of mix
Golden Retriever/Husky/Shepherd/Jack Russell
German Shepherd/Husky/Beagle/Chow Chow
Husky/malamute/sneaky neighbor’s dog mix
Husky/sneaky neighbor’s dog mix
Lab/APBT/Great Dane/GSD/Australian Cattle Dog/Rottweiler
Rottweiler-German Shepherd-Mix
Mix
Beagle & Lab/German Shepherd mix
German Shepherd x Border Collie; not sure; German Shepherd
Mixed
Mixed breed
Mix/mutt/disaster
Mixed breed Boomer
Mutt - mostly Jack Russell
Rescue Mutt
Treeing walker coonhound/ Chinese Sharpei
Shih tzu/corgi
Shih Tzu x Bichon
mixes
Border collie/weimerheimer/mixed breed
English / French Bulldog mix
Pomeranian - Indian Mongrel Mix
Removing the /
I notice some of the mix breed they tend to have the name similar to the breed
if(value.contains("/"),value.replace("/"," "),value)
edit cells-> Common tranforms ->
To titlecase
Removing the X
this a notation for mix breed they tend to have the name similar to the breed
if(value.contains("X"),value.replace("X"," "),value)
edit cells-> Common tranforms ->
To titlecase
Replacing Lab with Labrador
this a notation for mix breed they tend to have the name similar to the breed
if(value.contains("Lab "),value.replace("Lab ","Labrador "),value)
edit cells-> Common tranforms ->
To titlecase
Blanks
if(isBlank(value.trim()), cells['What kind of pet is this (Dog, Cat, Bird, Other)'].value, value)Removing the Other:
if(value.contains("Other:"),value.replace("Other: ",""),value)
edit cells-> Common tranforms ->
To titlecase
Open the Text facet for the “What kind of pet is this (Dog, Cat,
Bird, Other)“, Facet -> Text facet. It
said I have 28 Kinds.
It said they are 1129
While selecting Dog from the the “What kind of pet is this (Dog,
Cat, Bird, Other)”, open the Text facet for
the”Pet’s breed (if applicable)“, Facet ->
Text facet. It said the 329.
Have the Text facet for the “Pet’s breed (if applicable)“ be
sorted by count, Facet -> Text facet and
selected sort by count. It said the Golden
Retriever.
Have the “Pet’s breed (if applicable)“ be sorted from
largest first will the the oldest dog alive, and the
youngest dog in the system is no even a year old. To see the Youngest
dog Open the Facet -> Number facet Select
Non-numeric
Range is: 22 years - 6 weeks
since they are only 13 i did not use the sort. Range is: 5 years - 1 year
There use to be a pet that was 30 years old but because it is dead i remove its age and wrote Dead. Currently now the oldest pet is 24 years (cat).
They are 11 betta fish and 5 goldfish. The betta fish are the most popular.
The most popular everyday name for a cat is Kitty.
I have two the most popular full name for dog Maggie or Sadie.
Every report must list the references that you consulted while completing the assignment. If you consulted a webpage, you must include the URL.
Reference 1,https://github.com/jgolbeck/petnames.git
Reference 2,https://openrefine.org
Reference 3,https://stackoverflow.com/questions/56519862/remove-outermost-parentheses
Reference 4,https://openrefine.org/download.html
Reference 5,https://docs.openrefine.org/manual/grelfunctions
Reference 6,https://www.youtube.com/watch?v=wGVtycv3SS0
Reference 7,https://docs.openrefine.org/manual/columnediting
Reference 8,https://groups.google.com/g/openrefine/c/20eDwEJwpn8
Reference 9,https://canvas.odu.edu/courses/115922/external_tools/39
Reference 10, http://web.archive.org/web/20190105063215/enipedia.tudelft.nl/wiki/OpenRefine_Tutorial
Reference 11, http://web.archive.org/web/20190512224358/https://github.com/OpenRefine/OpenRefine/wiki/Cell-Editing
Reference 12,https://evanwill.github.io/clean-your-data/4-demo.html
Reference 13,https://guides.library.illinois.edu/openrefine/grel
Reference 14,https://github.com/OpenRefine/OpenRefine/wiki/Recipes#combining-datasets
Reference 15,https://docs.openrefine.org/manual/grelfunctions
Reference 16,https://www.dictionary.com/e/slang/doggo/
Reference 17,https://en.wikipedia.org/wiki/Katze
Reference 18,https://www.thelabradorsite.com/is-my-dog-purebred/
Reference 19,https://www.marvelousdogs.com/labrador-mixes/
Reference 21,https://excitedcats.com/different-types-of-calico-cats/
Reference 23,https://www.britannica.com/animal/pit-bull-terrier
Reference 24,https://www.thesprucepets.com/tortoiseshell-cat-profile-554703
Reference 25,https://pets.thenest.com/dsh-cat-4310.html
Reference 26,https://wagwalking.com/breed/torkie
Reference 27,https://en.wikipedia.org/wiki/Labrador_Retrieve
Reference 28,https://www.losgatosvet.com/domestic-medium-hair-cats/
Reference 29 ,https://www.thelabradorsite.com/goldador/#:~:text=As%20we%20know%2C%20a%20Goldador,working%20dog%2C%20for%20many%20roles
Reference 30,https://dogable.net/shih-tzu-dachshund-mix-schweenie/
Reference 31,https://en.wikipedia.org/wiki/Tortoiseshell_cat
Reference 32,https://thehappypuppysite.com/jackapoo/
Reference 33,https://www.hepper.com/shih-poo/
Reference 34,https://www.hillspet.com/dog-care/dog-breeds/miniature-pinscher
Reference 35,https://en.wikipedia.org/wiki/Tabby_cat
Reference 36,https://en.wikipedia.org/wiki/Black_cat
Reference 37,https://jackrussellowner.com/minniejack-jack-russell-and-min-pin-mix-hybrid
Reference 38,https://animalcorner.org/dog-breeds/jack-russell-chihuahua-mix/
Reference 39,https://en.wikipedia.org/wiki/Bicolor_cat
Reference 40,https://animalcorner.org/dog-breeds/frenchton/
Reference 41,https://www.vocabulary.com/dictionary/mutt
Reference 42,https://www.petmd.com/dog/breeds/c_dg_golden_retriever
Reference 43,https://github.com/jgolbeck/petnames
Reference 44,https://docs.openrefine.org/manual/expressions#grel-general-refine-expression-language
Reference 45,https://github.com/odu-cs625-datavis/fall22-hw1-csggame001
Reference 46,https://github.com/odu-cs625-datavis/fall22-hw1-csggame001/tree/feedback