You have not submitted. You must earn 70/100 points to pass. Deadline Pass this assignment by September 4, 11:59 PM PDT
Description
In this programming assignment, you are required to implement the Apriori algorithm and apply it to mine frequent itemsets from a real-life data set.
The provided input file (“categories.txt”) consists of the category lists of 77,185 places in the US. Each line corresponds to the category list of one place, where the list consists of a number of category instances (e.g., hotels, restaurants, etc.) that are separated by semicolons.
An example line is provided below:
Local Services; IT Services & Computer Repair
In the example above, the corresponding place has two category instances: “Local Services” and “IT Services & Computer Repair”.
You need to implement the Apriori algorithm and use it to mine category sets that are frequent in the input data. When implementing the Apriori algorithm, you may use any programming language you like. We only need your result pattern file, not your source code file.
After implementing the Apriori algorithm, please set the relative minimum support to 0.01 and run it on the 77,185 category lists. In other words, you need to extract all the category sets that have an absolute support no smaller than 771.
Please output all the length-1 frequent categories with their absolute supports into a text file named “patterns.txt”. Every line corresponds to exactly one frequent category and should be in the following format:
support:category
For example, suppose a category (Fast Food) has an absolute support 3000, then the line corresponding to this frequent category set in “patterns.txt” should be:
3000:Fast Food
Please write all the frequent category sets along with their absolute supports into a text file named “patterns.txt”. Every line corresponds to exactly one frequent category set and should be in the following format:
support:category_1,category_2,category_3,...
For example, suppose a category set (Fast Food; Restaurants) has an absolute support 2851, then the line corresponding to this frequent category set in “patterns.txt” should be:
2851:Fast Food;Restaurants
Make sure that you format each line correctly in the output file. For instance, use a semicolon instead of another character to separate the categories for each frequent category set.
In the result pattern file, the order of the categories does not matter. For example, the following two cases will be considered equivalent by the grader:
Case 1:
2851:Fast Food;Restaurants
Case 2:
2851:Restaurants;Fast Food
Upload Files and Submit
To upload a file, click the part below. Then, submit the files. You can submit as many times as you like. You do not need to upload all parts in order to submit.
graph 2.1.1a : explore the top 20 items in the dataset.
graph 2.1.1b : explore the top 20 items in the dataset.
inspect(transDat[1:10]) # view the observations
## items transactionID
## 1 {Fashion} Accessories
## 2 {Professional Services} Accountants
## 3 {Active Life,
## Amateur Sports Teams,
## American (New),
## American (Traditional),
## Amusement Parks,
## Aquariums,
## Arcades,
## Archery,
## Arts & Entertainment,
## ATV Rentals/Tours,
## Automotive,
## Barre Classes,
## Bars,
## Batting Cages,
## Beaches,
## Beauty & Spas,
## Beer, Wine & Spirits,
## Bike Rentals,
## Boating,
## Boot Camps,
## Bowling,
## Boxing,
## Cafes,
## Car Wash,
## Challenge Courses,
## Churches,
## Climbing,
## Colleges & Universities,
## Counseling & Mental Health,
## Country Clubs,
## Cycling Classes,
## Dance Studios,
## Day Camps,
## Day Spas,
## Department Stores,
## Disc Golf,
## Diving,
## Dog Parks,
## Education,
## Escape Games,
## Event Planning & Services,
## Fast Food,
## Fire Departments,
## Fitness & Instruction,
## Food,
## Go Karts,
## Golf,
## Golf Lessons,
## Gun/Rifle Ranges,
## Gymnastics,
## Gyms,
## Hair Salons,
## Health & Medical,
## Hiking,
## Horse Boarding,
## Horse Racing,
## Horseback Riding,
## Hot Air Balloons,
## Hotels & Travel,
## Italian,
## Kids Activities,
## Lakes,
## Landmarks & Historical Buildings,
## Landscaping,
## Laser Tag,
## Leisure Centers,
## Local Flavor,
## Local Services,
## Martial Arts,
## Massage Therapy,
## Mini Golf,
## Mountain Biking,
## Museums,
## Music & DVDs,
## Nightlife,
## Nutritionists,
## Paddleboarding,
## Paintball,
## Parks,
## Party & Event Planning,
## Party Supplies,
## Performing Arts,
## Persian/Iranian,
## Pets,
## Pilates,
## Playgrounds,
## Pool Cleaners,
## Pool Halls,
## Preschools,
## Races & Competitions,
## Rafting/Kayaking,
## Recreation Centers,
## Resorts,
## Restaurants,
## Saunas,
## Shopping,
## Skate Parks,
## Skating Rinks,
## Skydiving,
## Soccer,
## Specialty Schools,
## Sporting Goods,
## Sports Clubs,
## Summer Camps,
## Sushi Bars,
## Swimming Pools,
## Tai Chi,
## Tennis,
## Thai,
## Trainers,
## Trampoline Parks,
## Venues & Event Spaces,
## Videos & Video Game Rental,
## Vocational & Technical School,
## Wedding Planning,
## Weight Loss Centers,
## Yoga,
## Zoos} Active Life
## 4 {Beauty & Spas,
## Colonics,
## Day Spas,
## Doctors,
## Hair Removal,
## Health & Medical,
## Massage Therapy,
## Medical Centers,
## Skin Care,
## Traditional Chinese Medicine} Acupuncture
## 5 {Arts & Entertainment,
## Bars,
## Breakfast & Brunch,
## Nightlife,
## Party & Event Planning} Adult Entertainment
## 6 {Halal,
## Mediterranean,
## Pakistani,
## Persian/Iranian,
## Restaurants} Afghan
## 7 {Caribbean,
## Moroccan,
## Southern} African
## 8 {Bars} Airport Lounges
## 9 {Limos} Airport Shuttles
## 10 {Doctors} Allergists
length(transDat) # get number of observations
## [1] 950
size(transDat[1:10]) # number of items in each observation
## [1] 1 1 118 10 5 5 3 1 1 1
## Endless proceed 3 hours due to length of list, here I omit LIST() and only process inspect().
#'@ LIST(transDat) # convert 'transactions' to a list, note the LIST in CAPS
inspect(transDat2[1:100]) # view the observations
## items
## 1 {American (Traditional),
## Breakfast & Brunch,
## Restaurants}
## 2 {Restaurants,
## Sandwiches}
## 3 {IT Services & Computer Repair,
## Local Services}
## 4 {Italian,
## Restaurants}
## 5 {Coffee & Tea,
## Food}
## 6 {Fast Food,
## Restaurants}
## 7 {Home Services,
## Mortgage Brokers,
## Real Estate}
## 8 {Brasseries,
## Restaurants}
## 9 {American (New),
## Bars,
## Chicken Wings,
## Nightlife,
## Restaurants,
## Sports Bars}
## 10 {Auto Detailing,
## Automotive,
## Wheel & Rim Repair,
## Windshield Installation & Repair}
## 11 {Auto Parts & Supplies,
## Automotive}
## 12 {CSA,
## Farmers Market,
## Food,
## Grocery}
## 13 {CPR Classes,
## Education,
## First Aid Classes,
## Specialty Schools}
## 14 {Event Planning & Services,
## Venues & Event Spaces}
## 15 {Furniture Stores,
## Home & Garden,
## Home Decor,
## Shopping}
## 16 {Books, Mags, Music & Video,
## Bookstores,
## Shopping}
## 17 {Auto Repair,
## Automotive}
## 18 {Dry Cleaning & Laundry,
## Local Services}
## 19 {American (New),
## Burgers,
## Restaurants}
## 20 {Pizza,
## Restaurants}
## 21 {Beauty & Spas,
## Massage}
## 22 {Food,
## Juice Bars & Smoothies}
## 23 {Pizza,
## Restaurants}
## 24 {Bars,
## Lounges,
## Nightlife}
## 25 {Bars,
## Champagne Bars,
## Lounges,
## Nightlife}
## 26 {Burgers,
## Restaurants}
## 27 {American (Traditional),
## Bars,
## Nightlife,
## Restaurants}
## 28 {Event Photography,
## Event Planning & Services,
## Photographers,
## Session Photography}
## 29 {Beauty & Spas,
## Convenience Stores,
## Cosmetics & Beauty Supply,
## Drugstores,
## Food,
## Shopping}
## 30 {Active Life,
## Parks}
## 31 {Food,
## Ice Cream & Frozen Yogurt}
## 32 {Fast Food,
## Restaurants}
## 33 {Beauty & Spas,
## Hair Salons}
## 34 {Doctors,
## Eyewear & Opticians,
## Health & Medical,
## Shopping}
## 35 {Eyewear & Opticians,
## Shopping}
## 36 {Beauty & Spas,
## Dermatologists,
## Doctors,
## Health & Medical,
## Skin Care}
## 37 {Community Service/Non-Profit,
## Local Services}
## 38 {Art Galleries,
## Arts & Entertainment,
## Shopping}
## 39 {Cafes,
## Coffee & Tea,
## Food,
## Restaurants}
## 40 {Barbers,
## Hair Salons,
## Mens Hair Salons;Beauty & Spas}
## 41 {Beauty & Spas,
## Blow Dry/Out Services,
## Hair Extensions,
## Hair Salons}
## 42 {Bakeries,
## Desserts,
## Food}
## 43 {Automotive,
## Car Wash}
## 44 {Barbers,
## Beauty & Spas,
## Hair Extensions,
## Hair Salons}
## 45 {Arts & Entertainment,
## Casinos,
## Event Planning & Services,
## Hotels,
## Hotels & Travel}
## 46 {Pizza,
## Restaurants}
## 47 {Bagels,
## Coffee & Tea,
## Food,
## Restaurants,
## Sandwiches}
## 48 {Beauty & Spas,
## Hair Salons}
## 49 {Delis,
## Fast Food,
## Restaurants,
## Sandwiches}
## 50 {Active Life,
## Fitness & Instruction,
## Yoga}
## 51 {Books, Mags, Music & Video,
## Shopping,
## Video Game Stores,
## Videos & Video Game Rental}
## 52 {Beauty & Spas,
## Nail Salons}
## 53 {Beauty & Spas,
## Nail Salons}
## 54 {Department Stores,
## Fashion,
## Shopping}
## 55 {Food,
## Ice Cream & Frozen Yogurt}
## 56 {Portuguese,
## Restaurants}
## 57 {Beauty & Spas,
## Spray Tanning,
## Tanning}
## 58 {British,
## Restaurants}
## 59 {Creperies,
## Restaurants}
## 60 {Food,
## Ice Cream & Frozen Yogurt}
## 61 {Bakeries,
## Food}
## 62 {Department Stores,
## Fashion,
## Shopping}
## 63 {American (New),
## Restaurants}
## 64 {Pizza,
## Restaurants}
## 65 {Italian,
## Pizza,
## Restaurants}
## 66 {Doctors,
## Health & Medical}
## 67 {Drugstores,
## Shopping}
## 68 {Chinese,
## Restaurants}
## 69 {Beauty & Spas,
## Hair Removal,
## Makeup Artists,
## Skin Care}
## 70 {Home Services,
## Real Estate,
## Real Estate Agents,
## Real Estate Services}
## 71 {Arts & Crafts,
## Fabric Stores,
## Home & Garden,
## Home Decor,
## Shopping}
## 72 {Womens Clothing;Fashion;Shopping}
## 73 {Auto Repair,
## Automotive,
## Tires}
## 74 {Italian,
## Restaurants}
## 75 {Contractors,
## Home Services}
## 76 {Dance Clubs,
## Nightlife}
## 77 {Active Life,
## Aquarium Services,
## Aquariums,
## Pet Services,
## Pet Stores,
## Pets}
## 78 {Health & Medical,
## Massage Therapy}
## 79 {Beer, Wine & Spirits,
## Food,
## Grocery}
## 80 {Active Life,
## Fitness & Instruction,
## Gyms,
## Trainers,
## Yoga}
## 81 {Fast Food,
## Restaurants}
## 82 {Bars,
## Italian,
## Nightlife,
## Restaurants,
## Wine Bars}
## 83 {Bakeries,
## Candy Stores,
## Coffee & Tea,
## Food,
## Specialty Food}
## 84 {Food,
## Meat Shops,
## Specialty Food}
## 85 {Coffee & Tea,
## Food,
## Restaurants,
## Sandwiches}
## 86 {Buffets,
## Restaurants}
## 87 {American (New),
## Restaurants}
## 88 {Active Life,
## Sports Clubs}
## 89 {Italian,
## Pizza,
## Restaurants}
## 90 {Coffee & Tea,
## Food}
## 91 {Doctors,
## Family Practice,
## Health & Medical}
## 92 {Restaurants,
## Steakhouses}
## 93 {Active Life,
## Fitness & Instruction,
## Gyms}
## 94 {Automotive,
## Marinas}
## 95 {Auto Repair,
## Automotive,
## Tires}
## 96 {Restaurants,
## Thai}
## 97 {Mexican,
## Restaurants}
## 98 {Active Life,
## Fitness & Instruction,
## Yoga}
## 99 {Doctors,
## Health & Medical,
## Medical Centers}
## 100 {Mexican,
## Restaurants}
length(transDat2) # get number of observations
## [1] 77185
size(transDat2[1:100]) # number of items in each observation
## [1] 3 2 2 2 2 2 3 2 6 4 2 4 4 2 4 3 2 2 3 2 2 2 2 3 4 2 4 4 6 2 2 2 2 4 2
## [36] 5 2 3 4 3 4 3 2 4 5 2 5 2 4 3 4 2 2 3 2 2 3 2 2 2 2 3 2 2 3 2 2 2 4 4
## [71] 5 1 3 2 2 2 6 2 3 5 2 5 5 3 4 2 2 2 3 2 3 2 3 2 3 2 2 3 3 2
## Endless proceed 3 hours due to length of list, here I omit LIST() and only process inspect().
#'@ LIST(transDat) # convert 'transactions' to a list, note the LIST in CAPS
head(transDat)
## transactions in sparse format with
## 6 transactions (rows) and
## 830 items (columns)
head(transDat2)
## transactions in sparse format with
## 6 transactions (rows) and
## 1048 items (columns)
frequentItems <- eclat(transDat, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.07 1 15 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 66
##
## create itemset ...
## set transactions ...[830 item(s), 950 transaction(s)] done [0.00s].
## sorting and recoding items ... [3 item(s)] done [0.00s].
## creating bit matrix ... [3 row(s), 950 column(s)] done [0.00s].
## writing ... [3 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
itemFrequencyPlot(transDat, topN = 10, type = 'absolute', col = rainbow(4)) # plot frequent items
graph 2.1.2a : top 10 items in dataset.
frequentItems <- eclat(transDat2, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.07 1 15 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 5402
##
## create itemset ...
## set transactions ...[1048 item(s), 77185 transaction(s)] done [0.03s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating bit matrix ... [4 row(s), 77185 column(s)] done [0.00s].
## writing ... [4 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
itemFrequencyPlot(transDat2, topN = 10, type = 'absolute', col = rainbow(4)) # plot frequent items
graph 2.1.2b : top 10 items in dataset.
# Get the rules
rules <- apriori(transDat, parameter = list(supp = 0.01, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.5 0.1 1 none FALSE TRUE 0.01 1 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[830 item(s), 950 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [2236 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Show the top 5 rules, but only 2 digits
options(digits=2)
inspect(sort(subset(rules[1:5], subset = lift > 6), by = 'confidence'))
## lhs rhs support confidence lift
## 4 {Salad} => {Restaurants} 0.011 1.00 9.3
## 1 {Buffets} => {Restaurants} 0.013 0.92 8.6
## 3 {Chicken Wings} => {Restaurants} 0.012 0.92 8.5
## 2 {Chicken Wings} => {Barbeque} 0.011 0.83 24.7
## 5 {Caribbean} => {Restaurants} 0.011 0.83 7.8
# Get the rules
rules <- apriori(transDat2, parameter = list(supp = 0.01, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.5 0.1 1 none FALSE TRUE 0.01 1 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 771
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1048 item(s), 77185 transaction(s)] done [0.02s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [57 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
# Show the top 5 rules, but only 2 digits
options(digits=2)
inspect(sort(subset(rules[1:5], subset = lift > 6), by = 'confidence'))
## lhs rhs support confidence
## 3 {Ice Cream & Frozen Yogurt} => {Food} 0.013 1.00
## 4 {General Dentistry} => {Dentists} 0.011 1.00
## 5 {Dentists} => {General Dentistry} 0.011 0.69
## lift
## 3 8.3
## 4 64.6
## 5 64.6
I will also conducting a market basket analysis in Betting Strategy and Model Validation with regards sportsbook betting.
It’s useful to record some information about how your file was created.
[1] “2016-09-22 01:39:40 JST”
All plot()
funtion for arules
in rmarkdown unable work but run chunk-by-chunk is working fine.