HW 2, CS 625, Fall 2022

Gabriela Gamez Sept 14, 2022

Git, GitHub

  1. What is your GitHub username?

csggame001


Notes From HW1 R Markdown


Data Cleaning

When I loaded the PetNames.tsv into OpenRefine, I noticed it had some issues compared to the in-class exercise(inconsistency of the names and name duplication). One first attempt at cleaning the data was to use the edit cells-> cluster and edit since durring this method seems to be very effective at editing the columns containing inconsistency of the names and name duplication. After I limit all the options for the cluster and edit, I apply another Facet strategy.

Edit the column “What kind of pet is this (Dog, Cat, Bird, Other)”

Selecting for method key collision and making the keying function fingerprint. After I merged Selected and Re-cluster, it said that 1635 cells were edited.

cluster and edit fingerprint

Selecting for method key collision and making the keying function ngram-fingerprint. It return as No clusters were found with the selected method

cluster and edit ngram-fingerprint

Selecting for method key collision and making the keying function metaphone3. I got alot of name that did not seem to be similar. I use the google search engine just in case some of the same detect were in different language. After I merged Selected and Re-cluster, it said that 1122 cells were edited.

cluster and edit metaphone3

Selecting for method key collision and making the keying function cologne-phonetic. I just gave me the same option that I refuse to merge from metaphone3. cluster and edit metaphone3 Selecting for method key collision and making the keying function Daitch-Mokotoff. I gave cluster that were not even related for exampleBird and Ferret. So i proceed to select the next keying function Beider-Morse. After I merged Selected and Re-cluster, it said that 496 cells were edited.

cluster and edit Beider-Morse

selecting for method nearest neighbor and making the key distance function levenshtein. It return as No clusters were found with the selected method. I try using key distance function PPM but I said the same thing. At this point, I realize I have exhausted all the possible options for cluster and edit so I process to the Facet -> Text facet.

Opening the Text facet Text facet

Format of merging is

To merger inside the Text facet , I had to look at the Breed column and determine from there what kind of animal it was.

I notice that some of the Other type did not had their breed was blank so I name there so instead of losing some of the data I decide to formatted.

I notice that some of the Other type their breed were none animal.

Completing the column What kind of pet is this (Dog, Cat, Bird, Other)Text facet

Edit the column “Pet’s Full Name (you don’t have to include your last name. Think “Winston Churchill” and not “Winston Churchill MYLASTNAME”)“

selecting for method key collision and making the keying function fingerprint. After I merged Selected and Re-cluster, it said that 41 cells were edited.

cluster and edit fingerprint

After the fingerprint, I try using a different key function name ngram-fingerprint, but I felt like the accuracy of is not as good as a fingerprint. For example, there is a pet name Chich and another Ichi. I did not feel comfortable merging these since I feel they are entirely different names. After I merged Selected and Re-cluster, it said that 6 cells were edited.

cluster and edit ngram-fingerprint

After the ngram-fingerprint, I tried using a different key function name, but the accuracy became even worst than before so I decided to change the selected method to nearest neighbor instead of key collision. For the distance function I picked levenshtein. After I merged Selected and Re-cluster, it said that 23 cells were edited.

cluster and edit levenshtein

Lastly, I try the distance function PPM from the selected method to nearest neighbor.PPM found Stella and Stellabella I think this are not the same.

cluster and edit PPM

I realize I have exhausted all the possible options for cluster and edit, so I process to the next phase using a combination of edit cells-> Common tranforms -> To text,edit cells-> Common tranforms -> To titlecase, Facet -> Text facet and Facet -> Custom text facet. I am doing this because there are too many names to edit with just the Text facet.

if (value.contains(" LASTNAME"),value.replace(" LASTNAME",""),value)

if(value.contains(" \("),value.match(/(.*?)\((.*)\)(.*)/)[0],value)

GREL GREL

For some odd reason it was not saving the changes I was so I decide after 6 hours to do it every change manually.

Format of merging is

Parentheses within the name and trimming

commas within the name and trimming

Blanks rows

Banks FullName fingerprint

Edit the column Pet’s everyday name(e.g.”Church”)

selecting for method key collision and making the keying function fingerprint. After I merged Selected and Re-cluster, it said that 105 cells were edited.

cluster and edit fingerprint

selecting for method key collision and making the keying function ngram-fingerprint. After I merged Selected and Re-cluster, it said that 10 cells were edited.

cluster and edit ngram-fingerprint

selecting for method nearest neighbor and making the key distance function levenshtein. After I merged Selected and Re-cluster, it said that 10 cells were edited.

cluster and edit levenshtein

selecting for method nearest neighbor and making the key distance function PPM. I did not merge Prince and Princess because the are name for different sex. After I merged Selected and Re-cluster, it said that 13 cells were edited.

cluster and edit PPM

I recognize I have finished all the possible options for cluster and edit, so I process to the next phase using Edit Cells -> Tranformation. I think I was using the Facet before, which was why it was not saving the changes. I began by first doing combination of edit cells-> Common tranforms -> To text and edit cells-> Common tranforms -> To titlecase.

if(value.contains(" \("),value.match(/(.*?)\((.*)\)(.*)/)[0],value)

if(value.contains(" ,"),value.split(",")[0],value)

if(value.contains(","),value.split(",")[0],value)

if(value.contains("/"),value.split("/")[0],value)

if(value.contains(" ").and(3 < length(value.split(" ")[0])).and(value != null),value.split(" ")[0],value)

if(value.contains(" Or"),value.split(" Or")[0],value)

if(value.contains(" And"),value.split(" And")[0],value)

After doing the major transformation form the GREL I move to the Facet -> Text facet and process any additional modification manual

Format for name

Blanks

Even though the the columns form the Pet’s everyday name (e.g. “Church”) is empty it contain information with other columns.

Banks BlanksEvery

Edit the column Pet’s age

For the Pet’s age columns follow the instruction similar the class exercise. I first convert the Pet’s age into numeric by edit cells-> Common tranforms -> To number. Then I open the Number Facet by Facet -> Number facet and selected to Non-numeric.

Pet’s Age PetAge

selecting for method key collision and making the keying function fingerprint. After I merged Selected and Re-cluster, it said that 27 cells were edited.

Pet’s Age PetAge

selecting for method key collision and making the keying function ngram-fingerprint. After I merged Selected and Re-cluster, it said that 10 cells were edited.

Pet’s Age PetAge

selecting for method nearest neighbor and making the key distance function PPM. After I merged Selected and Re-cluster, it said that 2 cells were edited.

Pet’s Age PetAge

convert the Pet’s age once again edit cells-> Common tranforms -> To number. Then I open the Number Facet by Facet -> Number facet and selected to Non-numeric.

I manually edit some format for name

Dead pet PetAge

if(value.contains(" years").and(value != null),value.split(" ")[0],value)

if(value.contains(" yesrs").and(value != null),value.split(" ")[0],value)

if(value.endsWith("ish").and(value != null),value.split("ish")[0],value)

if(value.endsWith("yrs").and(value != null),value.split("yrs")[0],value)

if(value.endsWith("ish?").and(value != null),value.split("ish?")[0],value)

if(value.endsWith("?").and(value != null),value.split("?")[0],value)

if(value.startsWith("~").and(value != null),value.replace("~",""),value)

convert the Pet’s age once again edit cells-> Common tranforms -> To number. Then I open the Number Facet by Facet -> Number facet and selected to Non-numeric.

if(value.contains(" months").and(value != null),(toNumber(value.split(" ")[0])/12.0).round(),value)

if(value.contains(" mos").and(value != null),(toNumber(value.split(" ")[0])/12.0).round(),value)

if(value.contains(" mo").and(value != null),(toNumber(value.split(" ")[0])/12.0).round(),value)

convert the Pet’s age once again edit cells-> Common tranforms -> To number. Then I open the Number Facet by Facet -> Number facet and selected to Non-numeric.

BlanksBlanksEvery

See the ages

if (value.match(/.*(\d{2}).*/)[0]!= null, value.match(/.*(\d{2}).*/)[0],value)

Edit Pet’s breed(if applicable)

selecting for method key collision and making the keying function fingerprint. After I merged Selected and Re-cluster, it said that 718 cells were edited.

cluster and edit fingerprint

selecting for method key collision and making the keying function ngram-fingerprint. After I merged Selected and Re-cluster, it said that 148 cells were edited.

cluster and edit ngram-fingerprint

Selecting for method key collision and making the keying function cologne-phonetic. I just use the scan through the option cluster it found use Google search engine to determine the appropriate name. After I merged Selected and Re-cluster, it said that 87 cells were edited.

selecting for method key collision and making the keying function cologne-phonetic. After I merged Selected and Re-cluster, it said that 279 cells were edited.

is a mixed breed dog–a cross between the Chihuahua and Yorkshire Terrier dog breeds.

cluster and edit cologne-phonetic

selecting for method nearest neighbor and making the key distance function levenshtein. After I merged Selected and Re-cluster, it said that 178 cells were edited.

cluster and edit levenshtein

selecting for method nearest neighbor and making the key distance function PPM.

I did not merge:

I dont know much about dogs breed, so i use google search engine and I found Purebred Lab Or Not? How Can You Tell If Your Labrador Is Pure Bred? and Lab Mix Breeds: Top 25 Labrador Mixes. My understanding if the now is purebred Labrador Retriever and a Labrador Retriever Mix are not the same thing.

iIuse google search engine and I found What is the Difference Between Labrador and Labrador Retriever. Because it the Labrador Retriever my judgement is the same as the previous decision.

i use google search engine and I found 6 Different Types of Calico Cats With Pictures. I dont think Shorthair cat and Shorthair calico are the same .

I merge:

I use google search engine and I found What Is a Pit Bull Terrier Mix? My understanding this a mixture of Old English Bull and a Old English Terrier. I merger and rename to American Pit Bull Terrier

I merger and rename to Tortoiseshell

After I merged Selected and Re-cluster, it said that 29 cells were edited.

cluster and edit PPM

After merging all the possible from cluster and edit, I proceed to the next phase using Edit Cells -> Tranformation. I began by first doing combination of edit cells-> Common tranforms -> To text and edit cells-> Common tranforms -> To titlecase.

Format of merging is

To merger inside the Text facet , I had to look at the Breed column and determine from there what kind of animal it was.

if(value.contains(" \("),value.match(/(.*?)\((.*)\)(.*)/)[0],value)

Edit Manually


Analyze Cleaned Data

1. How many types (kinds) of pets are there?

Open the Text facet for the “What kind of pet is this (Dog, Cat, Bird, Other)“, Facet -> Text facet. It said I have 28 Kinds.

kinds

2. How many dogs?

It said they are 1129

dogs

3. How many breeds of dogs?

While selecting Dog from the the “What kind of pet is this (Dog, Cat, Bird, Other)”, open the Text facet for the”Pet’s breed (if applicable)“, Facet -> Text facet. It said the 329.

breeds of dogs

Have the Text facet for the “Pet’s breed (if applicable)“ be sorted by count, Facet -> Text facet and selected sort by count. It said the Golden Retriever.

most popular dog breed

5. What’s the age range of the dogs?

Have the “Pet’s breed (if applicable)“ be sorted from largest first will the the oldest dog alive, and the youngest dog in the system is no even a year old. To see the Youngest dog Open the Facet -> Number facet Select Non-numeric

Range is: 22 years - 6 weeks

oldest

Youngest

6. What’s the age range of the guinea pigs?

since they are only 13 i did not use the sort. Range is: 5 years - 1 year

age range of the guinea pigs

7. What is the oldest pet?

There use to be a pet that was 30 years old but because it is dead i remove its age and wrote Dead. Currently now the oldest pet is 24 years (cat).

oldest pet

They are 11 betta fish and 5 goldfish. The betta fish are the most popular.

betta fish or goldfish

The most popular everyday name for a cat is Kitty.

most popular everyday name

I have two the most popular full name for dog Maggie or Sadie.

most popular full name


References

Every report must list the references that you consulted while completing the assignment. If you consulted a webpage, you must include the URL.