A raw data for media buying campaigns' daily performance from 10/1-10/9.
114 entries, 12 columns including 9 numerical variables, 2 object variables and 1 datetime variable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from plotly import graph_objects as go
df=pd.read_excel('C:/data Challenge/CelerX-User Acquisition Data Analyst Assignment.xlsx',sheet_name='Raw')
catCol = df.select_dtypes(include=["object"]).columns #create an index for all categorical columns
numCol=df.select_dtypes(include=[np.number]).columns #create an index for all numerical columns
print(catCol, numCol)
df.head()
df.info()
df.describe() #descriptive statistics for numerical val
Question:
# count missing value
df.isnull().sum().sort_values(ascending=False)
# patterns in date, media source, campaigns, installs, deporsitor/installs for missing value
miss=df[df.isnull().T.any()] # return all the rows with missing value
print('There are total '+str(len(miss))+' rows with missing value.')
for column in ['Date','Media_Source','Campaign']:
table=pd.merge(left=miss[column].value_counts().reset_index(),right=df[column].value_counts().reset_index(),how='left',on='index')
# count values for each factor, count missing value for each factor. merge the result
table.columns=[column, 'missing count','value count']
# rename columns
print(table)
1. There's no obvious missing trends for date.
2. All the missings are made up of 'organic' campaign and 'Campaign 3" .
3. Most missings are for "Organic" Campaign and from "Cross Promotion" media source.
4. All "Cross Promotion", "Website Download", "Organic", "App Referral" media source are missing.
1. Campaign 3 is technically hard to track.
2. Very few users from campaign 3. It is ineffective. Missing values stand for 0.
3. There's no cost and eCPI for organic users, It should be 0.
4. Most organic users directly install our game, so there's no records for Impressions, Click, CTR. They should be 0.
5. "Cross Promotion", "Website Download", "Organic", "App Referral" media source are the source of our organic user. So there are missing values.
df[df['Campaign']=='Campaign_3'].describe()
Installs , Depositor/Installs, Depositor are still avaiable for campaign 3. Others are all missing. "Technically error" is validate.
The installs, Depositor/Installs, Depositor are lower than the average. It is ineffective. "Ineffective campaign" is validate.
We can fill these missing with 0.
df[df['Campaign']=='Organic'].describe()
Hypothesis are validate. We can fill na with 0. And the missing trends of organic are the same as the campaign 3. Hypothesis: there's probability that miss record the "organic" as "campaign 3".
df=df.fillna(0) # fill blank
df.isnull().sum()
1. Campaign 3 is technically hard to track.
2. Very few users from campaign 3. It is ineffective.
3. There's no cost and eCPI for organic users and most of organic users directly install our game, so there's no records for Impressions, Click, CTR.
4. "Cross Promotion", "Website Download", "Organic", "App Referral" media source are the source of our organic user.
5. there's probability that miss record the "organic" as "campaign 3".
len(df.drop_duplicates(subset=['Date','Media_Source','Campaign']))==len(df[['Date','Media_Source','Campaign']]) # check uniqueness
# to see which campaigns are duplicates.
df.groupby(['Date','Media_Source','Campaign']).filter(lambda g: len(g) > 1).groupby(['Date','Media_Source','Campaign']).size()
df[df['Click_Through_Rate']>=1]
# mis-record the data. clicks greater than impression.
df[df['Campaign']=='Campaign_9']['Click_Through_Rate'].mean()
When compared the average ctr for campaign 9, there's great probability that mis-enter the 'impression' from "50" to '5'
df.loc[df['Click_Through_Rate']>=1, 'Impressions']=50
df.loc[df['Click_Through_Rate']>=1, 'Click_Through_Rate']=0.18
df[df['Click_t_ Install_Conversion_Rate']>=1]
df[df['Depositor/Installs']>=1]
df.describe()
org=df[df['Campaign']=='Organic'].groupby('Date').agg({'Impressions':'sum','Clicks':'sum','Installs':'sum','Depositor':'sum',"Cost":'sum'}).reset_index()
org['Click_Through_Rate']=0
org['Click_t_ Install_Conversion_Rate']=org['Installs']/org['Clicks']
org['Depositor/Installs']=org['Depositor']/org['Installs']
org['eCPI']=org['Cost']/org['Installs']
org['Campaign']='Organic'
org['Media_Source']='Organic'
data=pd.concat([df[df['Campaign']!='Organic'],org], axis=0,ignore_index=True)
data.to_csv('C:/data Challenge/data_celer.csv',index=False)
org
funnel_s=df.agg({'Impressions':'sum','Clicks':'sum','Installs':'sum','Depositor':'sum',"Cost":'sum'}).reset_index()
funnel_s
CTR=1.255359e+05/2.219688e+06
Conversion=1.361176e+04/1.255359e+05
PPU=8.933791e+02/1.361176e+04
eCPI=8.753978e+04/1.361176e+04
d={'Impressions':2.219688e+06,'CTR':CTR,'Conversion_Rate':Conversion, 'PPU':PPU, 'eCPI':eCPI }
d
KPI=pd.DataFrame({'Impressions':2.219688e+06,'CTR':CTR,'Conversion_Rate':Conversion, 'PPU':PPU, 'eCPI':eCPI },index=[0])
KPI.to_csv('C:/data Challenge/KPI.csv',index=False)
fig = go.Figure()
fig.add_trace(go.Funnel(
name = 'Funnel Analysis',
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [2.219688e+06, 1.255359e+05, 1.361176e+04, 8.933791e+02],
textinfo = "value+percent previous+percent total"))
fig.show()
The CTR is 6%.
m1=df.groupby('Media_Source')['Impressions','Clicks','Installs','Depositor',"Cost"].sum().sort_values(by='Impressions')
m2=df[df['Media_Source'].str.contains('Source')].groupby('Media_Source')['Impressions','Clicks','Installs','Depositor','Cost'].sum().sort_values(by='Impressions').reset_index()
m3=m1[m1['Impressions']==0 ].sum().to_frame().T.reset_index().rename(columns={"index": "Media_Source"})
m3.iloc[0,0]='Organic'
#make table for funnel analysis of Media Source
funnel=pd.concat([m2,m3],axis=0,ignore_index=True)
funnel.to_csv('C:/data Challenge/funnel.csv',index=False)
Issues: Why does Clicks and Installs have decimals? After reviewing the original data, I found install= Cost/ eCPI in the sheet.
from plotly.subplots import make_subplots
fig = go.Figure()
fig.add_trace(go.Funnel(
name = 'Source_1',
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [874309, 101099, 2938, 84],
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Source_3',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [856077, 9737, 3522, 315],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Source_4',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [415681, 11527, 2244, 92],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Source_2',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [73621, 3143, 2071, 59],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Organic',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [0, 30, 2837, 343],
textposition = "inside",
textinfo = "value+percent previous"))
fig.update_layout( title_text="Funnel Analysis for Media Source")
fig.show()
# Separately analyze the organic user and campaign.
df1=df[df['Campaign']!="Organic"]
# make table for funnel analysis of Campaign
c1=df[df['Campaign']!='Organic'].groupby('Campaign')['Impressions','Clicks','Installs','Depositor',"Cost"].sum().sort_values(by='Impressions')
#m2=df[df['Media_Source'].str.contains('Source')].groupby('Media_Source')['Impressions','Clicks','Installs','Depositor','Cost'].sum().sort_values(by='Impressions').reset_index()
#m3=m1[m1['Impressions']==0 ].sum().to_frame().T.reset_index().rename(columns={"index": "Media_Source"})
#m3.iloc[0,0]='Organic'
c1
fig = go.Figure()
fig.add_trace(go.Funnel(
name = 'Campaign_8',
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [758870, 76826, 2500, 80],
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Campaign_10',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [415681, 11527, 2244, 92],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Campaign_5',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [367802, 2607, 864, 104],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Campaign_7',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [262319, 1856, 695, 127],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Campaign_4 ',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [152281, 4587, 1705, 45],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Campaign_9',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [115439, 24273, 438, 4],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Campaign_6 ',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [73675, 686, 252, 38],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Campaign_2 ',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [69726, 2834, 960, 54],
textposition = "inside",
textinfo = "value+percent previous"))
fig.add_trace(go.Funnel(
name = 'Campaign_1 ',
orientation = "h",
y = ["Impressions", "Clicks", "Installs", "Depositor"],
x = [3895, 309,1110 , 5],
textposition = "inside",
textinfo = "value+percent previous"))
fig.update_layout(height=600, width=1000, title_text="Funnel Analysis for Media Source")
fig.show()
perf=data[['Media_Source', 'Campaign']].groupby(['Media_Source', 'Campaign']).size().to_frame('Duration')
perf
perf['Impressions']=[0,'High',0,0,0,0,0,'High',0,'High','High']
perf['New_User']=[0,'High',0,0,'High',0,'High',0,0,0,'High']
perf['Revenue']=['High',"High",0,0,0,0,0,'High',0,'High','High']
perf
funnel (impressions->clicks->installs) for 7 days.
funnel analysis (impressions->clicks->installs) for each media source
funnel analysis (impressions->clicks->installs) for each campaign
(impressions->clicks->installs) for campaign/media by date
df=data.groupby('Date')['Cost'].sum().to_frame('Cost')
plt.style.use('ggplot')
df.plot()
plt.show()
newdf=data.groupby('Date').agg({'Cost':'sum', 'Campaign':'size'})
#newdf=data.groupby('Date').agg({'Cost':'sum', 'Campaign':'size'})
newdf
source=data.groupby(['Date','Media_Source']).size().to_frame('count').reset_index()
one=source[source['Media_Source']=='Source_1']['count'].values
two=source[source['Media_Source']=='Source_2']['count'].values
three=source[source['Media_Source']=='Source_3']['count'].values
four=source[source['Media_Source']=='Source_4']['count'].values
newdf['Source_1']=one
newdf['Source_2']=two
newdf['Source_3']=three
newdf['Source_4']=four
newdf['Weekday']=[1,1,0,0,1,1,1,1,1]
newdf.reset_index()
newdf.to_excel('C:/data Challenge/new.xlsx',index=True)
# since the count of Source 1,2,4 are all the same. They are meaningless for prediction.
del newdf['Source_1']
del newdf['Source_2']
del newdf['Source_4']
newdf.corr()
# check correlation between campaigns and source and weekday
np.corrcoef(three, newdf['Campaign'].values)
np.corrcoef(three, newdf['Weekday'].values)
np.corrcoef(newdf['Cost'].values, newdf['Weekday'].values)
del newdf['Campaign']
del newdf['Source_3'] # Since Campaign and Source and weekdays or not are highly correlated, del