Ⅰ. Background and Purpose

A raw data for media buying campaigns' daily performance from 10/1-10/9.

  1. Understand the daily performance for different media sources and campaigns.
  2. Are there any interesting/alarming metrics/performance metrics?
  3. Project the October monthly total ads spending based on current data from 10/1-10/9.

Ⅱ. Understanding the Raw Data

114 entries, 12 columns including 9 numerical variables, 2 object variables and 1 datetime variable.

In [151]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from plotly import graph_objects as go
In [152]:
df=pd.read_excel('C:/data Challenge/CelerX-User Acquisition Data Analyst Assignment.xlsx',sheet_name='Raw')
In [153]:
catCol = df.select_dtypes(include=["object"]).columns #create an index for all categorical columns
numCol=df.select_dtypes(include=[np.number]).columns #create an index for all numerical columns 
print(catCol, numCol)
Index(['Media_Source', 'Campaign'], dtype='object') Index(['Impressions', 'Clicks', 'Click_Through_Rate', 'Installs',
       'Click_t_ Install_Conversion_Rate', 'Depositor', 'Depositor/Installs',
       'Cost', 'eCPI'],
      dtype='object')
In [154]:
df.head()
Out[154]:
Date Media_Source Campaign Impressions Clicks Click_Through_Rate Installs Click_t_ Install_Conversion_Rate Depositor Depositor/Installs Cost eCPI
0 2020-10-01 Source_2 Campaign_1 272.0 22.0 0.0809 136.986301 0.4545 1.0 0.1000 1300.00 9.4900
1 2020-10-01 Source_4 Campaign_10 41701.0 1204.0 0.0289 223.000465 0.1852 6.0 0.0269 1341.66 6.0164
2 2020-10-01 Source_2 Campaign_2 4375.0 267.0 0.0610 96.000000 0.3596 4.0 0.0417 956.73 9.9659
3 2020-10-01 Source_3 Campaign_3 NaN NaN NaN 1.000000 NaN 1.0 1.0000 NaN NaN
4 2020-10-01 Source_3 Campaign_6 38882.0 345.0 0.0089 131.000000 0.3797 24.0 0.1832 2416.21 18.4444
In [155]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114 entries, 0 to 113
Data columns (total 12 columns):
Date                                114 non-null datetime64[ns]
Media_Source                        114 non-null object
Campaign                            114 non-null object
Impressions                         63 non-null float64
Clicks                              72 non-null float64
Click_Through_Rate                  63 non-null float64
Installs                            114 non-null float64
Click_t_ Install_Conversion_Rate    72 non-null float64
Depositor                           114 non-null float64
Depositor/Installs                  105 non-null float64
Cost                                63 non-null float64
eCPI                                63 non-null float64
dtypes: datetime64[ns](1), float64(9), object(2)
memory usage: 10.8+ KB
In [ ]:
 
In [156]:
df.describe() #descriptive statistics for numerical val
Out[156]:
Impressions Clicks Click_Through_Rate Installs Click_t_ Install_Conversion_Rate Depositor Depositor/Installs Cost eCPI
count 63.000000 72.00000 63.000000 114.000000 72.000000 114.000000 105.000000 63.000000 63.000000
mean 35232.428571 1743.55360 0.095881 119.401369 0.252293 7.836659 0.100293 1389.520317 10.368398
std 31306.429734 3615.40309 0.236435 136.024350 0.196997 9.189105 0.159273 718.222144 6.370049
min 5.000000 1.00000 0.005400 0.000000 0.000000 0.000000 0.000000 13.450000 2.830700
25% 3809.500000 39.50000 0.016400 17.750000 0.035550 0.000000 0.008500 1011.835000 4.829550
50% 38992.000000 317.50000 0.056700 98.500000 0.289650 5.500000 0.050500 1383.700000 8.816800
75% 51246.500000 1154.50000 0.092650 171.250000 0.381350 13.000000 0.123800 1507.575000 14.743250
max 92086.000000 20899.00000 1.800000 911.446998 1.000000 61.000000 1.000000 3400.000000 30.273900

Question:

  1. The large variance indicates that there may have outliers in impressions,clicks.
  2. The max "Click Through Rate" is 1.8, it is impossible.
  3. The "Click to Install Conversion Rate" and "Depositor/Installs" is 1 which indicates that there are some very efficient campaigns with 100% conversion. It is possible but very rare.

data completeness (missing value)

In [157]:
# count missing value
df.isnull().sum().sort_values(ascending=False)
Out[157]:
eCPI                                51
Cost                                51
Click_Through_Rate                  51
Impressions                         51
Click_t_ Install_Conversion_Rate    42
Clicks                              42
Depositor/Installs                   9
Depositor                            0
Installs                             0
Campaign                             0
Media_Source                         0
Date                                 0
dtype: int64
In [158]:
# patterns in date, media source, campaigns, installs, deporsitor/installs for missing value
miss=df[df.isnull().T.any()] # return all the rows with missing value
print('There are total '+str(len(miss))+' rows with missing value.')

for column in ['Date','Media_Source','Campaign']:
    table=pd.merge(left=miss[column].value_counts().reset_index(),right=df[column].value_counts().reset_index(),how='left',on='index')
    # count values for each factor, count missing value for each factor. merge the result 
    table.columns=[column, 'missing count','value count']
    # rename columns 
    print(table)
   
There are total 51 rows with missing value.
        Date  missing count  value count
0 2020-10-05              7           14
1 2020-10-04              6           14
2 2020-10-03              6           14
3 2020-10-02              6           12
4 2020-10-08              6           13
5 2020-10-01              6           12
6 2020-10-09              5           12
7 2020-10-06              5           12
8 2020-10-07              4           11
       Media_Source  missing count  value count
0   Cross Promotion             19           19
1      App Referral              9            9
2  Website Download              9            9
3           Organic              9            9
4          Source_3              5           23
     Campaign  missing count  value count
0     Organic             46           46
1  Campaign_3              5            5

Findings:

1. There's no obvious missing trends for date.
2. All the missings are made up of 'organic' campaign and 'Campaign 3" . 
3. Most missings are for "Organic" Campaign and from "Cross Promotion" media source.  
4. All "Cross Promotion", "Website Download", "Organic", "App Referral" media source are missing.    


Hypothesis: Does the missing value stands for untrackable or 0?

1. Campaign 3 is technically hard to track. 
2. Very few users from campaign 3. It is ineffective. Missing values stand for 0.
3. There's no cost and eCPI for organic users, It should be 0.
4. Most organic users directly install our game, so there's no records for Impressions, Click, CTR. They should be 0. 
5. "Cross Promotion", "Website Download", "Organic", "App Referral" media source are the source of our organic user. So there are missing values. 
In [159]:
df[df['Campaign']=='Campaign_3'].describe()
Out[159]:
Impressions Clicks Click_Through_Rate Installs Click_t_ Install_Conversion_Rate Depositor Depositor/Installs Cost eCPI
count 0.0 0.0 0.0 5.000000 0.0 5.000000 5.000000 0.0 0.0
mean NaN NaN NaN 1.200000 NaN 0.200000 0.200000 NaN NaN
std NaN NaN NaN 0.447214 NaN 0.447214 0.447214 NaN NaN
min NaN NaN NaN 1.000000 NaN 0.000000 0.000000 NaN NaN
25% NaN NaN NaN 1.000000 NaN 0.000000 0.000000 NaN NaN
50% NaN NaN NaN 1.000000 NaN 0.000000 0.000000 NaN NaN
75% NaN NaN NaN 1.000000 NaN 0.000000 0.000000 NaN NaN
max NaN NaN NaN 2.000000 NaN 1.000000 1.000000 NaN NaN

Installs , Depositor/Installs, Depositor are still avaiable for campaign 3. Others are all missing. "Technically error" is validate.
The installs, Depositor/Installs, Depositor are lower than the average. It is ineffective. "Ineffective campaign" is validate. We can fill these missing with 0.

In [160]:
df[df['Campaign']=='Organic'].describe()
Out[160]:
Impressions Clicks Click_Through_Rate Installs Click_t_ Install_Conversion_Rate Depositor Depositor/Installs Cost eCPI
count 0.0 9.000000 0.0 46.000000 9.0 46.000000 37.000000 0.0 0.0
mean NaN 3.333333 NaN 61.673913 0.0 7.464763 0.149392 NaN NaN
std NaN 1.870829 NaN 73.681312 0.0 10.992466 0.197339 NaN NaN
min NaN 1.000000 NaN 0.000000 0.0 0.000000 0.000000 NaN NaN
25% NaN 2.000000 NaN 5.000000 0.0 0.000000 0.000000 NaN NaN
50% NaN 3.000000 NaN 44.000000 0.0 1.000000 0.071400 NaN NaN
75% NaN 4.000000 NaN 60.750000 0.0 13.498950 0.222200 NaN NaN
max NaN 7.000000 NaN 228.000000 0.0 61.000000 1.000000 NaN NaN

Hypothesis are validate. We can fill na with 0. And the missing trends of organic are the same as the campaign 3. Hypothesis: there's probability that miss record the "organic" as "campaign 3".

In [161]:
df=df.fillna(0) # fill blank
In [162]:
df.isnull().sum()
Out[162]:
Date                                0
Media_Source                        0
Campaign                            0
Impressions                         0
Clicks                              0
Click_Through_Rate                  0
Installs                            0
Click_t_ Install_Conversion_Rate    0
Depositor                           0
Depositor/Installs                  0
Cost                                0
eCPI                                0
dtype: int64

Insights:

1. Campaign 3 is technically hard to track. 
2. Very few users from campaign 3. It is ineffective.
3. There's no cost and eCPI for organic users and most of organic users directly install our game, so there's no records for Impressions, Click, CTR. 
4. "Cross Promotion", "Website Download", "Organic", "App Referral" media source are the source of our organic user.
5. there's probability that miss record the "organic" as "campaign 3".

Data Uniqueness

In [163]:
len(df.drop_duplicates(subset=['Date','Media_Source','Campaign']))==len(df[['Date','Media_Source','Campaign']]) # check uniqueness
Out[163]:
False
In [164]:
# to see which campaigns are duplicates. 
df.groupby(['Date','Media_Source','Campaign']).filter(lambda g: len(g) > 1).groupby(['Date','Media_Source','Campaign']).size()
Out[164]:
Date        Media_Source     Campaign
2020-10-01  Cross Promotion  Organic     2
2020-10-02  Cross Promotion  Organic     2
2020-10-03  Cross Promotion  Organic     2
2020-10-04  Cross Promotion  Organic     2
2020-10-05  Cross Promotion  Organic     3
2020-10-06  Cross Promotion  Organic     2
2020-10-08  Cross Promotion  Organic     3
2020-10-09  Cross Promotion  Organic     2
dtype: int64

Outliers

In [165]:
df[df['Click_Through_Rate']>=1]
# mis-record the data. clicks greater than impression.
Out[165]:
Date Media_Source Campaign Impressions Clicks Click_Through_Rate Installs Click_t_ Install_Conversion_Rate Depositor Depositor/Installs Cost eCPI
95 2020-10-08 Source_1 Campaign_9 5.0 9.0 1.8 4.0 0.4444 0.0 0.0 34.22 8.555
In [166]:
df[df['Campaign']=='Campaign_9']['Click_Through_Rate'].mean()
Out[166]:
0.38666666666666666

When compared the average ctr for campaign 9, there's great probability that mis-enter the 'impression' from "50" to '5'

In [167]:
df.loc[df['Click_Through_Rate']>=1, 'Impressions']=50
df.loc[df['Click_Through_Rate']>=1, 'Click_Through_Rate']=0.18
In [168]:
df[df['Click_t_ Install_Conversion_Rate']>=1] 
Out[168]:
Date Media_Source Campaign Impressions Clicks Click_Through_Rate Installs Click_t_ Install_Conversion_Rate Depositor Depositor/Installs Cost eCPI
84 2020-10-07 Source_1 Campaign_9 79.0 6.0 0.0759 6.0 1.0 0.0 0.0 29.12 4.8533
In [169]:
df[df['Depositor/Installs']>=1]
Out[169]:
Date Media_Source Campaign Impressions Clicks Click_Through_Rate Installs Click_t_ Install_Conversion_Rate Depositor Depositor/Installs Cost eCPI
3 2020-10-01 Source_3 Campaign_3 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0
110 2020-10-09 Cross Promotion Organic 0.0 0.0 0.0 61.0 0.0 61.0 1.0 0.0 0.0
In [ ]:
 
In [170]:
df.describe()
Out[170]:
Impressions Clicks Click_Through_Rate Installs Click_t_ Install_Conversion_Rate Depositor Depositor/Installs Cost eCPI
count 114.000000 114.000000 114.000000 114.000000 114.000000 114.000000 114.000000 114.000000 114.000000
mean 19470.947368 1101.191747 0.038776 119.401369 0.159343 7.836659 0.092375 767.892807 5.729904
std 29109.151520 2987.718744 0.076774 136.024350 0.198307 9.189105 0.155194 874.412088 7.005505
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 17.750000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 276.500000 26.500000 0.007050 98.500000 0.033050 5.500000 0.047500 268.350000 4.131680
75% 41089.000000 372.750000 0.061675 171.250000 0.345675 13.000000 0.119450 1420.362500 9.508750
max 92086.000000 20899.000000 0.666700 911.446998 1.000000 61.000000 1.000000 3400.000000 30.273900

Data Exploration

In [171]:
org=df[df['Campaign']=='Organic'].groupby('Date').agg({'Impressions':'sum','Clicks':'sum','Installs':'sum','Depositor':'sum',"Cost":'sum'}).reset_index()
org['Click_Through_Rate']=0
org['Click_t_ Install_Conversion_Rate']=org['Installs']/org['Clicks']
org['Depositor/Installs']=org['Depositor']/org['Installs']
org['eCPI']=org['Cost']/org['Installs']
org['Campaign']='Organic'
org['Media_Source']='Organic'
data=pd.concat([df[df['Campaign']!='Organic'],org], axis=0,ignore_index=True)
data.to_csv('C:/data Challenge/data_celer.csv',index=False)
C:\Users\Vivi\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


In [172]:
org
Out[172]:
Date Impressions Clicks Installs Depositor Cost Click_Through_Rate Click_t_ Install_Conversion_Rate Depositor/Installs eCPI Campaign Media_Source
0 2020-10-01 0.0 3.0 306.0 36.8160 0.0 0 102.000000 0.120314 0.0 Organic Organic
1 2020-10-02 0.0 4.0 316.0 26.2500 0.0 0 79.000000 0.083070 0.0 Organic Organic
2 2020-10-03 0.0 1.0 285.0 32.6665 0.0 0 285.000000 0.114619 0.0 Organic Organic
3 2020-10-04 0.0 2.0 345.0 27.6000 0.0 0 172.500000 0.080000 0.0 Organic Organic
4 2020-10-05 0.0 2.0 340.0 31.2486 0.0 0 170.000000 0.091908 0.0 Organic Organic
5 2020-10-06 0.0 7.0 264.0 35.7500 0.0 0 37.714286 0.135417 0.0 Organic Organic
6 2020-10-07 0.0 5.0 266.0 26.3322 0.0 0 53.200000 0.098993 0.0 Organic Organic
7 2020-10-08 0.0 4.0 383.0 39.7158 0.0 0 95.750000 0.103697 0.0 Organic Organic
8 2020-10-09 0.0 2.0 332.0 87.0000 0.0 0 166.000000 0.262048 0.0 Organic Organic
In [173]:
funnel_s=df.agg({'Impressions':'sum','Clicks':'sum','Installs':'sum','Depositor':'sum',"Cost":'sum'}).reset_index()
funnel_s
Out[173]:
index 0
0 Impressions 2.219688e+06
1 Clicks 1.255359e+05
2 Installs 1.361176e+04
3 Depositor 8.933791e+02
4 Cost 8.753978e+04
In [174]:
CTR=1.255359e+05/2.219688e+06
Conversion=1.361176e+04/1.255359e+05
PPU=8.933791e+02/1.361176e+04
eCPI=8.753978e+04/1.361176e+04
In [175]:
d={'Impressions':2.219688e+06,'CTR':CTR,'Conversion_Rate':Conversion, 'PPU':PPU, 'eCPI':eCPI }
d
Out[175]:
{'Impressions': 2219688.0,
 'CTR': 0.056555651064473925,
 'Conversion_Rate': 0.10842922223841946,
 'PPU': 0.0656328865627957,
 'eCPI': 6.431187443798597}
In [176]:
KPI=pd.DataFrame({'Impressions':2.219688e+06,'CTR':CTR,'Conversion_Rate':Conversion, 'PPU':PPU, 'eCPI':eCPI },index=[0])
KPI.to_csv('C:/data Challenge/KPI.csv',index=False)
In [177]:
fig = go.Figure()



fig.add_trace(go.Funnel(
    name = 'Funnel Analysis',
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [2.219688e+06, 1.255359e+05, 1.361176e+04, 8.933791e+02],
    textinfo = "value+percent previous+percent total"))


fig.show()

The CTR is 6%.

In [178]:
m1=df.groupby('Media_Source')['Impressions','Clicks','Installs','Depositor',"Cost"].sum().sort_values(by='Impressions')
m2=df[df['Media_Source'].str.contains('Source')].groupby('Media_Source')['Impressions','Clicks','Installs','Depositor','Cost'].sum().sort_values(by='Impressions').reset_index()
m3=m1[m1['Impressions']==0 ].sum().to_frame().T.reset_index().rename(columns={"index": "Media_Source"})
m3.iloc[0,0]='Organic'
#make table for funnel analysis of Media Source
In [179]:
funnel=pd.concat([m2,m3],axis=0,ignore_index=True)
funnel.to_csv('C:/data Challenge/funnel.csv',index=False)

Issues: Why does Clicks and Installs have decimals? After reviewing the original data, I found install= Cost/ eCPI in the sheet.

In [180]:
from plotly.subplots import make_subplots

fig = go.Figure()



fig.add_trace(go.Funnel(
    name = 'Source_1',
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [874309, 101099, 2938, 84],
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Source_3',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [856077, 9737, 3522, 315],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Source_4',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [415681, 11527, 2244, 92],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Source_2',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [73621, 3143, 2071, 59],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Organic',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [0, 30, 2837, 343],
    textposition = "inside",
    textinfo = "value+percent previous"))
fig.update_layout( title_text="Funnel Analysis for Media Source")

fig.show()
In [181]:
# Separately analyze the organic user and campaign. 
df1=df[df['Campaign']!="Organic"]
In [182]:
# make table for funnel analysis of Campaign
c1=df[df['Campaign']!='Organic'].groupby('Campaign')['Impressions','Clicks','Installs','Depositor',"Cost"].sum().sort_values(by='Impressions')
#m2=df[df['Media_Source'].str.contains('Source')].groupby('Media_Source')['Impressions','Clicks','Installs','Depositor','Cost'].sum().sort_values(by='Impressions').reset_index()
#m3=m1[m1['Impressions']==0 ].sum().to_frame().T.reset_index().rename(columns={"index": "Media_Source"})
#m3.iloc[0,0]='Organic'
c1
Out[182]:
Impressions Clicks Installs Depositor Cost
Campaign
Campaign_3 0.0 0.0000 6.000000 1.0 0.00
Campaign_1 3895.0 309.0000 1110.631186 5.0 12817.73
Campaign_2 69726.0 2834.0000 960.000000 54.0 9655.98
Campaign_6 73675.0 686.0000 252.000000 38.0 4853.63
Campaign_9 115439.0 24273.0000 438.000000 4.0 2243.97
Campaign_4 152281.0 4587.8592 1705.123878 45.0 6500.00
Campaign_7 262319.0 1856.0000 695.000000 127.0 15062.57
Campaign_5 367802.0 2607.0000 864.000000 104.0 11990.09
Campaign_10 415681.0 11527.0000 2244.001039 92.0 12950.71
Campaign_8 758870.0 76826.0000 2500.000000 80.0 11465.10
In [183]:
fig = go.Figure()
fig.add_trace(go.Funnel(
    name = 'Campaign_8',
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [758870, 76826, 2500, 80],
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Campaign_10',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [415681, 11527, 2244, 92],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Campaign_5',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [367802, 2607, 864, 104],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Campaign_7',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [262319, 1856, 695, 127],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Campaign_4	',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [152281, 4587, 1705, 45],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Campaign_9',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [115439, 24273, 438, 4],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Campaign_6	',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [73675, 686, 252, 38],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Campaign_2	',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [69726, 2834, 960, 54],
    textposition = "inside",
    textinfo = "value+percent previous"))

fig.add_trace(go.Funnel(
    name = 'Campaign_1	',
    orientation = "h",
    y = ["Impressions", "Clicks", "Installs", "Depositor"],
    x = [3895, 309,1110 , 5],
    textposition = "inside",
    textinfo = "value+percent previous"))
fig.update_layout(height=600, width=1000, title_text="Funnel Analysis for Media Source")

fig.show()
In [184]:
perf=data[['Media_Source', 'Campaign']].groupby(['Media_Source', 'Campaign']).size().to_frame('Duration')
perf
Out[184]:
Duration
Media_Source Campaign
Organic Organic 9
Source_1 Campaign_8 9
Campaign_9 9
Source_2 Campaign_1 9
Campaign_2 9
Source_3 Campaign_3 5
Campaign_4 2
Campaign_5 8
Campaign_6 2
Campaign_7 6
Source_4 Campaign_10 9
In [185]:
perf['Impressions']=[0,'High',0,0,0,0,0,'High',0,'High','High']
perf['New_User']=[0,'High',0,0,'High',0,'High',0,0,0,'High']
perf['Revenue']=['High',"High",0,0,0,0,0,'High',0,'High','High']
perf
Out[185]:
Duration Impressions New_User Revenue
Media_Source Campaign
Organic Organic 9 0 0 High
Source_1 Campaign_8 9 High High High
Campaign_9 9 0 0 0
Source_2 Campaign_1 9 0 0 0
Campaign_2 9 0 High 0
Source_3 Campaign_3 5 0 0 0
Campaign_4 2 0 High 0
Campaign_5 8 High 0 High
Campaign_6 2 0 0 0
Campaign_7 6 High 0 High
Source_4 Campaign_10 9 High High High

Ⅲ. Method

  1. User Acquisition Health Overview:
    funnel (impressions->clicks->installs) for 7 days.
  2. Which Media Source is Better for User Acquisition:
    funnel analysis (impressions->clicks->installs) for each media source
  3. Which Campaign is Better for User Acquisition:
    funnel analysis (impressions->clicks->installs) for each campaign
  4. Were the campaign and media's performance Funnel analysis affected by data (weekends, holiday)? Is it same as the overall performance ?
    (impressions->clicks->installs) for campaign/media by date

Ⅳ Prediction

In [186]:
df=data.groupby('Date')['Cost'].sum().to_frame('Cost')
plt.style.use('ggplot') 
df.plot()
plt.show()
In [187]:
newdf=data.groupby('Date').agg({'Cost':'sum', 'Campaign':'size'})
#newdf=data.groupby('Date').agg({'Cost':'sum', 'Campaign':'size'})
newdf
Out[187]:
Cost Campaign
Date
2020-10-01 7251.21 8
2020-10-02 6201.27 8
2020-10-03 12350.00 10
2020-10-04 12452.42 10
2020-10-05 10061.74 9
2020-10-06 9455.32 8
2020-10-07 10081.58 8
2020-10-08 10054.15 8
2020-10-09 9632.09 8
In [188]:
source=data.groupby(['Date','Media_Source']).size().to_frame('count').reset_index()
In [189]:
one=source[source['Media_Source']=='Source_1']['count'].values
two=source[source['Media_Source']=='Source_2']['count'].values
three=source[source['Media_Source']=='Source_3']['count'].values
four=source[source['Media_Source']=='Source_4']['count'].values
In [190]:
newdf['Source_1']=one
newdf['Source_2']=two
newdf['Source_3']=three
newdf['Source_4']=four
newdf['Weekday']=[1,1,0,0,1,1,1,1,1]
In [191]:
newdf.reset_index()
Out[191]:
Date Cost Campaign Source_1 Source_2 Source_3 Source_4 Weekday
0 2020-10-01 7251.21 8 2 2 2 1 1
1 2020-10-02 6201.27 8 2 2 2 1 1
2 2020-10-03 12350.00 10 2 2 4 1 0
3 2020-10-04 12452.42 10 2 2 4 1 0
4 2020-10-05 10061.74 9 2 2 3 1 1
5 2020-10-06 9455.32 8 2 2 2 1 1
6 2020-10-07 10081.58 8 2 2 2 1 1
7 2020-10-08 10054.15 8 2 2 2 1 1
8 2020-10-09 9632.09 8 2 2 2 1 1
In [192]:
newdf.to_excel('C:/data Challenge/new.xlsx',index=True)
In [193]:
# since the count of Source 1,2,4 are all the same. They are meaningless for prediction.
del newdf['Source_1']
del newdf['Source_2']
del newdf['Source_4']
In [194]:
newdf.corr()
Out[194]:
Cost Campaign Source_3 Weekday
Cost 1.000000 0.766548 0.766548 -0.743267
Campaign 0.766548 1.000000 1.000000 -0.928571
Source_3 0.766548 1.000000 1.000000 -0.928571
Weekday -0.743267 -0.928571 -0.928571 1.000000
In [195]:
# check correlation between campaigns and source and weekday
np.corrcoef(three, newdf['Campaign'].values)
np.corrcoef(three, newdf['Weekday'].values)
np.corrcoef(newdf['Cost'].values, newdf['Weekday'].values)
Out[195]:
array([[ 1.        , -0.74326723],
       [-0.74326723,  1.        ]])
In [196]:
del newdf['Campaign'] 
del newdf['Source_3'] # Since Campaign and Source and weekdays or not are highly correlated, del
In [ ]:
 
In [ ]: