Here we have taken a data set that consists of around 2380 purchase records of yogurt products.We have id of each purchase , the time of purchase , the number of different yogurts purchased by the purchase id and the total price of purchase of the purchase id.

Here, i have used histograms , line plots and scatter plots to understand the purchase behavior.

setwd("D:/Raviteja/Raviteja Professional/Data Science/EDA_Course_Materials")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
#1.Yogurt Data set :
  
yo<-read.csv("yogurt.csv")
yo$id<-factor(yo$id)
str(yo)
## 'data.frame':    2380 obs. of  9 variables:
##  $ obs        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id         : Factor w/ 332 levels "2100081","2100370",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ time       : int  9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
##  $ strawberry : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ blueberry  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pina.colada: int  0 0 0 0 1 2 0 0 0 0 ...
##  $ plain      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mixed.berry: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price      : num  59 59 65 65 49 ...
#2.Price Histogram: 
  
unique(yo$price)
##  [1] 58.96 65.04 48.96 68.96 39.04 24.96 50.00 45.04 33.04 44.00 33.36
## [12] 55.04 62.00 20.00 49.60 49.52 33.28 63.04 33.20 33.52
#so, there are  few levels in the prices. Let us understand which price is having more purchases.

qplot(data= yo,x= price, binwidth= 0.25)+scale_x_continuous(limits=c(20,70),breaks=seq(20,70,1))+scale_y_continuous(limits=c(0,800),breaks=seq(0,800,25))
## Warning: Removed 2 rows containing missing values (geom_bar).

# More purchases happened with prices of 65.04 & 68.96 . 

#3.Getting more clarity on the consumer behavior/price:

# As shown in the data set there are 5 types of yogurts.Let us understand how many number of yogurts have been sold more.

# As shown below, a seperate variable named all.purchases has been created with the total number of yogurts purchased per purchase id.

yo<- transform(yo, all.purchases= strawberry+blueberry +pina.colada +plain + mixed.berry) 

qplot(x= all.purchases,data=yo,binwidth=0.1)+scale_x_continuous(limits=c(1,11),breaks=seq(1,11,1))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

#from this graph we can understand that, Many customers bought 1 and 2 yogurts at a time.

#Till now we understood that the customers are buying 1 and 2 types of yogurts more freequently and more people's purchase value is over 65.

#4.Let us understand how the price variation is there over the time.

ggplot(aes(x=time/3600,y=price),data=yo)+geom_jitter(alpha=1/2)

#From this graph we can understand that the purchase order value/prices are increasing over the period.Some lower values in the same period indicates that the customers might used coupns.

#4.Let us see how often individual households buy yogurts and how many they buy:
  
set.seed(4230)
sample.ids<- sample(levels(yo$id),16)

sample.ids
##  [1] "2107953" "2123463" "2167320" "2127605" "2124750" "2133066" "2134676"
##  [8] "2141341" "2107706" "2151829" "2119693" "2122705" "2115006" "2143271"
## [15] "2101980" "2101758"
ggplot(aes(x= time/3600,y= price),data=subset(yo,id %in% sample.ids))+facet_wrap(~id,scales= "free_x")+geom_line()+ geom_point(aes(size= all.purchases),pch=2)

#This graph shows the individual purchase behavior of a particular customer.For example, The graph tells us that the customer with a purchase id of 2126847 (may not be there in the graph shown above) has bought 1 item/yogurt at around 2.8hrs at a price of around 65.