A thesis presented to the Bulacan State University in Malolos for the requirements for the degree of Master of Business Administration.
MBA student of Bulacan State University
Fundamental Information Technology Engineer (FE)
Consultant Data Science Architect of Peritus Knowledge Services Corporation
Business Manager for IT and Support of Mediablast Corporation Specializations completed while taking MBA:
The study is targeted to general consumers which the population is Bulacan. The respondents are not constrained with the popular stores nor with the small players. To avoid bias, the research is mostly conducted at food chains not within the store.
Excluded are the POS transactions from the cashiers.
The survey is targeted on basic necessities, prime commodities and Department of Trade and Industry (DTI) list which were merged to form a selection of items.Frequency is the number of times the items are bought together with the target product or item. This shows the most related item to the target item. Support on the other hand shows which pairs are purchased a lot that supports relation to the target item, although low support doesn’t mean that the item will not be ordered together. The confidence is the most important one as this shows the relationship of support and frequency. But the real interesting item to know are those with support and confidence of both high or low values, which gives importance to lift, these are the interesting values that supports association.
A type of analysis which you are looking for items that go together, especially those in the purchased transactions. Data recorded are analyzed on how they are related among other data. This analysis is part of data mining applied into retail business which is known as market basket analysis. Other application includes fraud examination and detection for those that will deviate from the rule sets inferred from the model.
Predicting the item that is most relevant to the previous choices is the goal of recommendation. Hardesty (2014) reviewed the group at MIT’s Laboratory for Information and Decision Systems (LIDS). LIDS specializes in analyzing how social networks process information. They are doing predictions, but they used collaborative filtering as their algorithm. Their theory is not the typical collaborative filter, wherein they just assign the likelihood of choice among different samples. They produce sorted probability clusters as a new approach, and also those that do not select an item are assigned a cluster, otherwise choice so it would affect the correlation of choices of the customers. This approach is more principled according to Sujay Sanghavi, a professor of electrical and computer engineering at the University of Texas, Austin. Recommendation is also applied to services that Netflix and Amazon. Pennock, et al. (2000), have studied the collaborative filtering (CF), which have been greatly used in the internet as confirmed by Microsoft Research, also they’ve found out that it is hard to measure performance of CF.
The study is a quantitative study targeted at consumers in Bulacan. The use of questionnaire with choices and use of scales are incorporated with the questionnaire. Most market basket analysis studies used POS transactions, while this study uses the data answered through questionnaire. This approach removes bias to brands.
Respondents are from Bulacan with a population of 3,292,000 people. For a sample to be valid it should be less than ten percent (10%) of the population and at least 30 respondents. A modified slovin formula was used for the computation. Selected towns for the study are Malolos, Baliwag, Bustos, Plaridel and Pulilan, where is care is taken as using proper proportion per town.
Survey-questionnaire method is used in this study. Use of R Statistical Language was used in the data analysis and presentation of the data.
Survey was performed by using random sampling, while maintaining the proper proportion of respondent per town which is based on the population per town. An interviewer asked questions and recorded the answer on the paper questionnaire. Data is then encoded into the computer by using Google Forms to simplify the encoding.
Data processing was done using an open-source software R. R is a popular statistical programming language used by international schools and universities in Australia, New Zealand and US. To speed up the computations, use of the likert package for R was used, ggplot2 package for visualization, flexdashboard package for presentation and data storytelling, and shiny package for the dashboard application. The process used for statistical treatment is the standard data analysis, however best practices for data science and analytics was used to produce correct output and make sure that the steps can be reproduced in an instant even at the time of this presentation. RStudio is a graphical user interface aids in debugging the statistical program code used for analysis.
[1] "Timestamp"
[2] "Name..Optional."
[3] "Location"
[4] "Civil.Status."
[5] "Age"
[6] "Gender"
[7] "Monthly.Net.Income"
[8] "Highest.Educational.Attainment"
[9] "Public.Transportation"
[10] "Proximity"
[11] "Parking.Space"
[12] "Cheap.Prices"
[13] "Grocery.Store"
[14] "Sari.sari.Store"
[15] "Mall...Hypermarket"
[16] "Supermarket"
[17] "Wet.Market"
[18] "Dry.Market"
[19] "Items"
[20] "Who.are.the.consumers.of.purchased.products."
[21] "Most.preferred.Method.of.Payment"
[22] "Most.effective.Sales.Device.for.You"
These are the variable names in the CSV File.
'data.frame': 328 obs. of 22 variables:
$ Timestamp : Factor w/ 328 levels "2017/07/16 11:51:23 AM GMT+8",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Name..Optional. : Factor w/ 315 levels "---","----","-----",..: 296 244 236 247 291 259 270 286 127 277 ...
$ Location : Factor w/ 5 levels "Baliwag","Bustos",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Civil.Status. : Factor w/ 4 levels "Married","Separated",..: 1 1 3 1 3 3 2 4 3 1 ...
$ Age : int 63 62 65 78 22 22 54 62 28 50 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 1 1 1 1 1 2 ...
$ Monthly.Net.Income : int 2000 20000 6000 5000 9000 8000 5000 15000 10000 2000 ...
$ Highest.Educational.Attainment : Factor w/ 4 levels "College","Elementary",..: 2 1 1 3 1 1 3 1 1 1 ...
$ Public.Transportation : int 2 3 2 2 1 2 3 2 1 2 ...
$ Proximity : int 2 3 2 2 2 2 5 3 1 2 ...
$ Parking.Space : int 3 3 2 1 1 2 4 2 1 2 ...
$ Cheap.Prices : int 1 1 1 1 2 3 2 1 1 1 ...
$ Grocery.Store : int 3 2 1 1 1 1 2 2 3 2 ...
$ Sari.sari.Store : int 2 2 1 1 1 1 2 2 2 1 ...
$ Mall...Hypermarket : int 3 3 1 2 4 4 4 2 2 4 ...
$ Supermarket : int 3 1 2 2 4 3 3 3 3 4 ...
$ Wet.Market : int 2 1 1 2 4 4 2 3 3 1 ...
$ Dry.Market : int 2 1 1 2 4 4 2 2 3 1 ...
$ Items : Factor w/ 322 levels "Bread in any shape and name; excluding pastries and cakes;Coffee;Sugar;Powdered; liquid; bar; laundry and detergent soap;Candie"| __truncated__,..: 152 308 293 58 317 123 37 101 200 181 ...
$ Who.are.the.consumers.of.purchased.products.: Factor w/ 25 levels "Children","Children;Parents;Relatives",..: 19 17 5 16 23 23 15 3 10 15 ...
$ Most.preferred.Method.of.Payment : Factor w/ 3 levels "Cash","Credit Card",..: 1 1 1 1 1 1 1 1 2 1 ...
$ Most.effective.Sales.Device.for.You : Factor w/ 4 levels "Demonstratrion (free taste, free sample, etc.)",..: 3 2 3 3 3 3 3 4 3 3 ...
NULL
The quick overview of data.
Baliwag | Bustos | Malolos | Plaridel | Pulilan |
---|---|---|---|---|
74 | 33 | 123 | 53 | 45 |
[1] "Location" "Civil.Status."
[3] "Age" "Gender"
[5] "Monthly.Net.Income" "Highest.Educational.Attainment"
Demographic Variables
Town | Frequency | Percentage |
---|---|---|
Baliwag | 74 | 22.56098 |
Bustos | 33 | 10.06098 |
Malolos | 123 | 37.50000 |
Plaridel | 53 | 16.15854 |
Pulilan | 45 | 13.71951 |
The profile of the respondents is taken from the five towns. Majority of the respondents are from Malolos (37.5%), followed by Baliwag (22.56%), Plaridel (16.16%), and Bustos (10.06%). The total number of respondents were 328 from a population of 3,292,000 in Bulacan.
Town | Frequency | Percentage |
---|---|---|
Single | 96 | 29.268293 |
Married | 196 | 59.756098 |
Widowed | 22 | 6.707317 |
Separated | 14 | 4.268293 |
Most of the respondents are married (59.76%) and the least number of respondents are separated (4.27%).
[1] "Note: [10,20) means 10 <= age < 20"
Min. 1st Qu. Median Mean 3rd Qu. Max.
19.00 28.00 36.00 38.61 46.00 81.00
[1] "Estimated mean from interval: 38.6585365853659; Estimtated SD from interval: 13.6815861092013"
Age Range | Frequency | Percentage |
---|---|---|
[10,20) | 3 | 0.9146341 |
[20,30) | 105 | 32.0121951 |
[30,40) | 93 | 28.3536585 |
[40,50) | 64 | 19.5121951 |
[50,60) | 29 | 8.8414634 |
[60,70) | 29 | 8.8414634 |
[70,80) | 3 | 0.9146341 |
[80,90) | 2 | 0.6097561 |
Youngest respondent was 19 years old, and the oldest was 81 years old. The mean age is 39 years old and the median age is 36 years old.
Gender | Frequency | Percentage |
---|---|---|
Female | 178 | 54.26829 |
Male | 150 | 45.73171 |
There are 178 female respondents (54.27%) and 150 male respondents (45.73%).
Min. 1st Qu. Median
54.0000000000000 3375.0000000000000 6000.0000000000000
Mean 3rd Qu. Max.
7731.2621951219517 10250.0000000000000 24000.0000000000000
[1] "Estimated mean from interval: 8064.0243902439; Estimtated SD from interval: 5494.13239632921"
Monthly Income Range | Frequency | Percentage |
---|---|---|
[0,5000) | 118 | 35.9756097560975618 |
[5000,10000) | 106 | 32.3170731707317103 |
[10000,15000) | 65 | 19.8170731707317067 |
[15000,20000) | 27 | 8.2317073170731714 |
[20000,25000) | 12 | 3.6585365853658534 |
The median monthly net income in Bulacan is 6,000.00 pesos and the average monthly net income is 7,731.26 pesos.
Highest Education Level | Frequency | Percentage |
---|---|---|
Elementary | 32 | 9.7560975609756095 |
High School | 150 | 45.7317073170731732 |
Vocational | 1 | 0.3048780487804878 |
College | 145 | 44.2073170731707350 |
Master’s Degree | 0 | 0.0000000000000000 |
Doctor’s Degree | 0 | 0.0000000000000000 |
Others | 0 | 0.0000000000000000 |
Among the respondents’ educational attainment are as follows: majority has completed High School (457.73%), followed by College degree (44.21%), Elementary education (9.76%) and Vocational (0.30%). It seems that from the sample most of the respondents haven’t had a graduate degree, it may have happened as the survey was conducted during working hours and the possible respondent with a graduate degree was teaching in an educational institution or working in large firms.
Timestamp |
Name..Optional. |
Location |
Civil.Status. |
Age |
Gender |
Monthly.Net.Income |
Highest.Educational.Attainment |
Public.Transportation |
Proximity |
Parking.Space |
Cheap.Prices |
Grocery.Store |
Sari.sari.Store |
Mall…Hypermarket |
Supermarket |
Wet.Market |
Dry.Market |
Items |
Who.are.the.consumers.of.purchased.products. |
Most.preferred.Method.of.Payment |
Most.effective.Sales.Device.for.You |
transpo HI: 47 CI:170 SI:100 LI: 11 NI: 0
proximity HI: 43 CI: 77 SI: 97 LI:101 NI: 10
parking HI: 19 CI: 70 SI:121 LI: 81 NI: 37
cheap HI:245 CI: 80 SI: 3 LI: 0 NI: 0
Item | HI | CI | SI | LI | NI |
---|---|---|---|---|---|
transpo | 14.3292682926829258 | 51.829268292682926 | 30.48780487804878092 | 3.3536585365853662 | 0.0000000000000000 |
proximity | 13.1097560975609753 | 23.475609756097558 | 29.57317073170731447 | 30.7926829268292686 | 3.0487804878048781 |
parking | 5.7926829268292686 | 21.341463414634145 | 36.89024390243902474 | 24.6951219512195124 | 11.2804878048780495 |
cheap | 74.6951219512195053 | 24.390243902439025 | 0.91463414634146334 | 0.0000000000000000 | 0.0000000000000000 |
Abbreviation | Description |
---|---|
HI | Highly Important |
CI | Considerably Important |
SI | Somewhat Important |
LI | Little Importance |
NI | Not at All |
Looking into the buying factors, in terms of public transportation, proximity of the store, ample parking space and cheap prices are taken as follows: most respondents have considered that cheap prices are highly important and the least is parking.
Grocery AT : 43 MT :138 NO/NS:115 ST : 31
Sari-stari Store AT : 69 MT :157 NO/NS: 87 ST : 14
Hypermarket AT : 10 MT : 71 NO/NS:125 ST :120
Supermarket AT : 18 MT : 67 NO/NS:122 ST :119
Wet Market AT : 19 MT :100 NO/NS:128 ST : 77
Dry Market AT : 21 MT : 97 NO/NS:126 ST : 81
Grocery NA : 1
Sari-stari Store NA : 1
Hypermarket NA : 2
Supermarket NA : 2
Wet Market NA : 4
Dry Market NA : 3
Item | AT | MT | NO/NS | ST | NA |
---|---|---|---|---|---|
Grocery | 13.1097560975609753 | 42.073170731707314 | 35.060975609756099 | 9.4512195121951219 | 0.30487804878048780 |
Sari-stari Store | 21.0365853658536572 | 47.865853658536587 | 26.524390243902442 | 4.2682926829268295 | 0.30487804878048780 |
Hypermarket | 3.0487804878048781 | 21.646341463414632 | 38.109756097560975 | 36.5853658536585371 | 0.60975609756097560 |
Supermarket | 5.4878048780487809 | 20.426829268292682 | 37.195121951219512 | 36.2804878048780495 | 0.60975609756097560 |
Wet Market | 5.7926829268292686 | 30.487804878048781 | 39.024390243902438 | 23.4756097560975583 | 1.21951219512195119 |
Dry Market | 6.4024390243902438 | 29.573170731707314 | 38.414634146341463 | 24.6951219512195124 | 0.91463414634146334 |
Abbreviation | Description |
---|---|
AT | All the Time |
MT | Most of the Time |
NO/NS | Not Often / Not Seldom |
ST | Some Time |
NA | Not at All |
Consumers have different frequency of visits to different store types. The following store types are included in the questionnaire: grocery store, sari-sari store, hypermarket, supermarket, wet market, and dry market. Most people prefers to purchase from sari-sari store and the least at hypermarket.
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport
0.80000000000000004 0.10000000000000001 1 none FALSE TRUE
maxtime support minlen maxlen target ext
5 0.050000000000000003 1 20 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.10000000000000001 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 16
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[37 item(s), 328 transaction(s)] done [0.00s].
sorting and recoding items ... [30 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.03s].
writing ... [44926 rule(s)] done [0.04s].
creating S4 object ... done [0.07s].
Item | Count |
---|---|
Rice | 308 |
Fresh Vegetables / Root Crops | 212 |
Powdered,liquid,bar,laundry and detergent soap | 198 |
Poultry Meat (like chicken) | 182 |
Fresh Eggs | 181 |
Cooking Oil | 169 |
Fresh Fish,Dried Fish,Canned Fish | 148 |
Fresh Pork / Beef | 145 |
Instant Noodles and other noodles | 131 |
Coffee | 120 |
Sugar | 114 |
Fresh Fruits | 110 |
Bread in any shape and name,excluding pastries and cakes | 106 |
Garlic | 98 |
Onions | 98 |
Bath Shampoo | 96 |
Fresh and Processed Milk | 90 |
Patis / Fish sauce | 80 |
Candies / Chocolates | 77 |
Soy sauce | 73 |
Junk foods / chips | 69 |
Vinegar | 64 |
Toilet Soap | 52 |
Diaper (baby or adult) | 43 |
Salt | 41 |
Coffee creamer | 37 |
Pepper | 30 |
Pet foods | 29 |
Chili | 28 |
Other Marine products (crabs,mussels,oysters) | 21 |
Pasta | 14 |
Other Spices | 10 |
Corn | 8 |
Charcoal | 7 |
Other sauces | 4 |
Flour | 2 |
Batteries | 1 |
Association Rules mining, parameters used: support: 0.05, confidence: 0.8
Network Graph of all items. Please use the Shiny App for this presentation.
Application URL: https://wenmi-escience.shinyapps.io/market_basket_analysis_in_bulacan_2017_rowen_remis_r_iral/
For Whom | Counts | Percentage |
---|---|---|
Own Self | 322 | 36.3841807909604 |
Children | 219 | 24.7457627118644 |
Partner (Wife/Husband) | 201 | 22.7118644067797 |
Parents | 94 | 10.6214689265537 |
Relatives | 38 | 4.29378531073446 |
Senior Citizen (age 60 and above) | 10 | 1.12994350282486 |
Friends | 1 | 0.112994350282486 |
Consumers purchase products for their own self (36.38%), followed by for children (24.75%), partner (wife/husband) (22.71%), parents (10.62%), relatives (4.29%), senior citizen (1.13%), and friends (0.11%).
Payment Method | Frequency | Percentage |
---|---|---|
Cash | 325 | 99.0853658536585300 |
Credit Card | 2 | 0.6097560975609756 |
Debit Card | 1 | 0.3048780487804878 |
Most preferred method of purchase in Bulacans shows that most prefers cash method (99.09%), followed by credit card (0.61%), and debit card (0.30%).
Sales Device | Frequency | Percentage |
---|---|---|
Demonstratrion (free taste, free sample, etc.) | 76 | 23.1707317073170742 |
Display (color, labels, shelf arrangement) | 32 | 9.7560975609756095 |
Pricing (discounts, buy 1 take 1 promos, etc.) | 190 | 57.9268292682926784 |
Sales Talk | 30 | 9.1463414634146343 |
Consumers prefers cheap prices and discounts, next is the product demonstrations, while the display and sales talk were considered low by the consumers in Bulacan.
Thank you.
Rowen Remis R. Iral
http://wenup.wordpress.com
2017 December
Created using R and Flex Dashboard package.