Importing the dataset Throughout this course, you will use pandas to manipulate data and calculate summary statistics!
In this exercise, you will begin by importing pandas and the marketing dataset into your environment.
Import pandas into your environment with the alias pd. Import the ‘marketing.csv’ dataset.
# Import pandas into the environment
import pandas as pd
# Import marketing.csv
marketing = pd.read_csv('marketing.csv')
Now that you’ve imported the data, you will examine it to ensure it is valid. While it may be tempting to dive into analysis, by validating your dataset immediately, you ensure that the insights you derive are both true and actionable. Ensuring the validity of your dataset should always be the first step you take when beginning analysis on a new dataset.
The marketing DataFrame is available in your workspace.
# Print the first five rows of the DataFrame
print(marketing.head())
user_id date_served marketing_channel variant converted \
0 a100000029 1/1/18 House Ads personalization True
1 a100000030 1/1/18 House Ads personalization True
2 a100000031 1/1/18 House Ads personalization True
3 a100000032 1/1/18 House Ads personalization True
4 a100000033 1/1/18 House Ads personalization True
language_displayed language_preferred age_group date_subscribed \
0 English English 0-18 years 1/1/18
1 English English 19-24 years 1/1/18
2 English English 24-30 years 1/1/18
3 English English 30-36 years 1/1/18
4 English English 36-45 years 1/1/18
date_canceled subscribing_channel is_retained
0 NaN House Ads True
1 NaN House Ads True
2 NaN House Ads True
3 NaN House Ads True
4 NaN House Ads True
# Print the statistics of all columns
print(marketing.describe())
user_id date_served marketing_channel variant converted \
count 10037 10021 10022 10037 10022
unique 7309 31 5 2 2
top a100000882 1/15/18 House Ads control False
freq 12 789 4733 5091 8946
language_displayed language_preferred age_group date_subscribed \
count 10037 10037 10037 1856
unique 4 4 7 31
top English English 19-24 years 1/16/18
freq 9793 9275 1682 163
date_canceled subscribing_channel is_retained
count 577 1856 1856
unique 115 5 2
top 4/2/18 Instagram True
freq 15 600 1279
# Check column data types and non-missing values
print(marketing.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10037 entries, 0 to 10036
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 10037 non-null object
1 date_served 10021 non-null object
2 marketing_channel 10022 non-null object
3 variant 10037 non-null object
4 converted 10022 non-null object
5 language_displayed 10037 non-null object
6 language_preferred 10037 non-null object
7 age_group 10037 non-null object
8 date_subscribed 1856 non-null object
9 date_canceled 577 non-null object
10 subscribing_channel 1856 non-null object
11 is_retained 1856 non-null object
dtypes: object(12)
memory usage: 941.1+ KB
None
## Data types and data merging
You will often want to modify the data type of columns to suit your needs. Pandas and Python have special functionalities around manipulating data types that can make your analysis easier and more robust, such as concatenating strings and summing boolean values. However, this functionality is only possible if columns are stored as the intended data type.
In this exercise, you will modify the data type of the is_retained column.
# Check the data type of is_retained
print(marketing['is_retained'].dtype)
object
Update the data type of the is_retained column to boolean and then check its data type again.
# Convert is_retained to a boolean
marketing['is_retained'] = marketing['is_retained'].astype("bool")
# Check the data type of is_retained, again
print(marketing['is_retained'].dtype)
bool
Adding new columns that derive information from existing data or based on domain knowledge is known as Feature Engineering. Even in relatively simple datasets, there are always new characteristics you could pull out to create a more in-depth analysis.
One of the most critical skills a data scientist needs to learn is how to identify opportunities for feature engineering.
In this exercise, you will add two columns to marketing:
day_of_week: represents the day of the week as an integer (we added a new column, day that represents the day of the week and defined a dictionary that maps the day of the week to each numerical value from 0 to 6 for this purpose.) is_correct_lang: conveys whether the ad was shown to the user in their preferred language
Add a new column, channel_code, which maps the values in the subscribing_channel column to a numeric scale using the channel_dict dictionary.
# Mapping for channels
channel_dict = {"House Ads": 1, "Instagram": 2,
"Facebook": 3, "Email": 4, "Push": 5}
# Map the channel to a channel code
marketing['channel_code'] = marketing['subscribing_channel'].map(channel_dict)
Import numpy with the alias np. Add a new column, is_correct_lang, which is ‘Yes’ if the user was shown the ad in their preferred language, ‘No’ otherwise.
# Import numpy
import numpy as np
# Add the new column is_correct_lang
marketing['is_correct_lang'] = np.where(
marketing['language_preferred'] == marketing['language_displayed'],
'Yes',
'No'
)
Currently, the date columns in the marketing DataFrame are being incorrectly read as objects. We need to convert these columns to date columns to be able to use Python and pandas’ robust date manipulation and formatting capabilities.
In this exercise, you will practice reading the CSV with proper date columns and create a day of the week column.
Import pandas into the environment with the alias pd. Read ‘marketing.csv’ into your environment correctly identifying date columns,‘date_served’ ‘date_subscribed’ and ‘date_canceled’ within your call to read_csv(). Create a day of week column from ‘date_subscribed’ using dt.dayofweek.
# Import pandas into the environment
import pandas as pd
# Import marketing.csv with date columns
marketing = pd.read_csv("marketing.csv", parse_dates = ['date_served', 'date_subscribed', 'date_canceled'])
# Add a DoW column
marketing['DoW'] = marketing["date_subscribed"].dt.dayofweek
In this exercise, you’ll determine how many users are seeing the marketing assets each day. This is crucial to understand how effective our marketing efforts have been over the past month.
You’ll use .groupby() and .nunique() to calculate daily users.
Group the marketing DataFrame by ‘date_served’ and count the number of unique user IDs.
# Group by date_served and count number of unique user_id's
daily_users = marketing.groupby(['date_served'])['user_id'].nunique()
# Print head of daily_users
print(daily_users.head())
date_served
2018-01-01 362
2018-01-02 374
2018-01-03 348
2018-01-04 323
2018-01-05 319
Name: user_id, dtype: int64
In the previous exercise, you created a daily_subscribers DataFrame, which contained the number of users who subscribed each day. While this was a great first step, it is challenging to interpret daily trends by looking at a table. To make it easier for you and your business stakeholders to notice subscriber trends, you will visualize your results using a line plot.
Use the .plot() method to visualize the results of date_served. Add the title ‘Daily users’ and the y-axis label ‘Number of users’. Rotate the x-axis labels by 45 degrees. Display the plot.
import matplotlib.pyplot as plt
# Plot daily_subscribers
daily_users.plot()
# Include a title and y-axis label
plt.title('Daily users')
plt.ylabel('Number of users')
# Rotate the x-axis labels by 45 degrees
plt.xticks(rotation = 45)
# Display the plot
plt.show()
png
In this exercise, you will practice calculating conversion rate, which is often the first metric you’ll want to calculate when evaluating how a marketing campaign performed.
On marketing teams, conversion rate is typically the most important metric. It is one of the best ways to determine how effective a marketing team was at gaining new customers.
As a reminder, conversion rate is the percentage of the users who saw our marketing assets and subsequently became subscribers.
Calculate the number of unique user_ids in marketing DataFrame. Calculate the number of people who subscribed using the converted column. Calculate the conversion rate.
# Calculate the number of people we marketed to
total = marketing["user_id"].nunique()
# Calculate the number of people who subscribed
subscribers = marketing[marketing['converted'] == True]['user_id'].nunique()
# Calculate the conversion rate
conversion_rate = subscribers/total
print(round(conversion_rate*100, 2), "%")
In this exercise, you will calculate the retention rate, or the number of remaining subscribers from the users who converted to your product. This can give you a sense of whether your marketing campaign converted subscribers who were actually interested in the product.
Conversion rate and retention rate function hand-in-hand; you could create a business with a high conversion rate by giving users a free trial, but have a low retention rate once users are charged for your services. This isn’t inherently a bad thing, but it is important to provide your business stakeholders with insight into what percentage of users remain subscribers.
Calculate the number of subscribers using the user_id and converted columns in the marketing DataFrame. Calculate the number of retained subscribers using the boolean columns user_id and is_retained. Calculate the retention rate.
# Calculate the number of subscribers
total_subscribers = marketing[marketing["converted"] == True]\
['user_id'].nunique()
# Calculate the number of people who remained subscribed
retained = marketing[marketing['is_retained'] == True]\
['user_id'].nunique()
# Calculate the retention rate
retention_rate = retained/total_subscribers
print(round(retention_rate*100, 2), "%")
The marketing team wants to determine how effective the campaign was on converting English speakers.
In this exercise, you will isolate the data for English speakers and calculate the conversion rate much like in the previous exercises. Remember, the formula for conversion rate is:
Once you have the conversion rate for English speakers, you can compare it to the overall conversion rate to gain a sense of how effective the marketing campaign was among this group compared to the overall population.
Using the marketing DataFrame, include only the rows where language_displayed is English. Calculate the total number of users in the english_speakers DataFrame. Calculate the number of conversions in the english_speakers DataFrame.
# Isolate english speakers
english_speakers = marketing[marketing['language_displayed'] == 'English']
# Calculate the total number of english speaking users
total = english_speakers['user_id'].nunique()
# Calculate the number of english speakers who converted
subscribers = english_speakers[english_speakers['converted'] == True]\
['user_id'].nunique()
# Calculate conversion rate
conversion_rate = subscribers/total
print('English speaker conversion rate:', round(conversion_rate*100,2), '%')
Next, you want to look at the conversion rate by the language that the marketing asset was shown in. While many of your users speak English, some prefer another language. Let’s check to make sure marketing material translated well across languages.
You can analyze metrics by specific demographics using .groupby(). Rather than looking at the overall conversion rate in the dataset, you instead group by language preference, which allows you to determine whether the marketing campaign was more effective in certain languages.
Group the marketing DataFrame by language_displayed and count the number of unique users in the dataset. Group marketing by language_displayed again, this time calculating the number of unique users who converted. Calculate the conversion rate for all languages.
# Group by language_displayed and count unique users
total = marketing.groupby(['language_displayed'])\
['user_id'].nunique()
# Group by language_displayed and count unique conversions
subscribers = marketing[marketing['converted'] == True]\
.groupby(['language_displayed'])\
['user_id'].nunique()
# Calculate the conversion rate for all languages
language_conversion_rate = subscribers/total
print(language_conversion_rate)
The marketing team wants to know whether there is any difference in the conversion rate based on when in the month, your users saw an ad. In this exercise, you will practice .groupby() again, this time looking at how metrics have evolved.
Group the marketing DataFrame by date_served and count the number of unique users per day. Isolate converted users in marketing while grouping by date_served and counting the number of unique converted users per day.
# Group by date_served and count unique users
total = marketing.groupby(['date_served'])['user_id']\
.nunique()
# Group by date_served and count unique converted users
subscribers = marketing[marketing['converted'] == True]\
.groupby(['date_served'])\
['user_id'].nunique()
# Calculate the conversion rate per day
daily_conversion_rate = subscribers/total
print(daily_conversion_rate)
When you calculated conversion rate by age group in the previous exercise, you printed out a DataFrame called language_conversion_rate containing your results. While this table provided useful information, it is much easier to compare relative conversion rates visually.
It is critical for data scientists to communicate results clearly to business stakeholders. A strong foundation in data visualization is a key aspect to conveying results, especially when engaging with people who are not as comfortable interpreting data on their own.
In this exercise, you will build upon the work you did to create language_conversion_rate by visualizing your results in a bar chart.
Create a bar chart using the language_conversion_rate DataFrame. Add the title ‘Conversion rate by language’ to your chart with font size 16. Add an x-axis label, ‘Language’, and a y-axis label, ‘Conversion rate (%)’, both with font size 14. Display the plot.
# Create a bar chart using language_conversion_rate DataFrame
language_conversion_rate.plot(kind = "bar")
# Add a title and x and y-axis labels
plt.title('Conversion rate by language\n', size = 16)
plt.xlabel('Language', size = 14)
plt.ylabel('Conversion rate (%)', size = 14)
# Display the plot
plt.show()
To understand trends over time, you will create a new DataFrame that includes the conversion rate each day. You will follow essentially the same steps as before when you calculated the overall conversion rate, this time also grouping by the date a user subscribed.
Looking at the daily conversion rate is crucial to contextualize whether the conversion rate on a particular day was good or bad. Additionally, looking at conversion rate over time can help to surface trends such as a conversion rate that appears to be going down over time. These kinds of trends are crucial to identify for your marketing stakeholders as early as possible.
Group marketing by ‘date_served’ and calculate the unique number of user IDs. Select only the rows in marketing where converted equals True. Group the result by ‘date_served’ and calculate the unique number of user IDs.
# Group by date_served and count unique users
print(marketing.head())
total = marketing.groupby(['date_served'])['user_id']\
.nunique()
# Group by date_served and calculate subscribers
subscribers = marketing[marketing['converted'] == True]\
.groupby(['date_served'])\
['user_id'].nunique()
# Calculate the conversion rate for all languages
daily_conversion_rates = subscribers/total
When you want to understand how your campaign performed, it is essential to look at how key metrics changed throughout the campaign. Your key metrics can help you catch problems that may have happened during the campaign, such as a bug in the checkout system that led to a dip in conversion toward the end of your campaign. Metrics over time can also surface trends like gaining more subscribers over the weekends or on specific holidays.
In this exercise, you will build upon the daily conversion rate Series daily_conversion_rates you built in a previous exercise. Before you can begin visualizing, you need to transform your data into an easier format to use with pandas and matplotlib.
Reset the index of the daily_conversion_rates Series and use pd.DataFrame() to convert the results into a DataFrame named daily_conversion_rate. Rename the columns in the new DataFrame daily_conversion_rate to be ‘date_served’ and ‘conversion_rate’.
# Reset index to turn the results into a DataFrame
daily_conversion_rate = pd.DataFrame(daily_conversion_rates.reset_index())
# Rename columns
daily_conversion_rate.columns = ['date_served', 'conversion_rate']
Now that your formatted the data into a more manageable format for visualization, you will proceed to create a line chart. Strong visualization skills are crucial for a data scientist because it will allow you and your marketing stakeholders to derive deeper insights from the data. In this case, creating a line plot will make it much easier to notice peaks and valleys in our conversion rate over time as well as any overall trends.
Create a line chart using the daily_conversion_rate DataFrame. Set the y-axis of your chart to begin at 0. Display the chart.
# Create a line chart using daily_conversion_rate
daily_conversion_rate.plot("date_served", "conversion_rate")
plt.title('Daily conversion rate\n', size = 16)
plt.ylabel('Conversion rate (%)', size = 14)
plt.xlabel('Date', size = 14)
# Set the y-axis to begin at 0
plt.ylim(0)
# Display the plot
plt.show()
Some marketing stakeholders want to know if their marketing channels are reaching all users equally or if some marketing channels are serving specific age demographics.
Within a marketing team, it is common to get requests that require quick analysis and visualization. The better you are at visualizing the results, the more likely that you will effectively communicate your findings to your stakeholders.
In this exercise, you will create a grouped bar chart showing how many people each marketing channel reached by age group.
Unstack channel_age with level = 1 and transform the result into a DataFrame. Plot channel_age as a grouped bar chart. Add a legend in the upper right and set the labels equal to channel_age_df.columns.values.
channel_age = marketing.groupby(['marketing_channel', 'age_group'])\
['user_id'].count()
# Unstack channel_age and transform it into a DataFrame
channel_age_df = pd.DataFrame(channel_age.unstack(level = 1))
# Plot channel_age
channel_age_df.plot(kind = 'bar')
plt.title('Marketing channels by age group')
plt.xlabel('Age Group')
plt.ylabel('Users')
# Add a legend to the plot
plt.legend(loc = 'upper right',
labels = channel_age_df.columns.values)
plt.show()
Stakeholders have begun competing to see whose channel had the best retention rate from the campaign. You must first determine how many subscribers came from the campaign and how many of those subscribers have stayed on the service.
It’s important to identify how each marketing channel is performing because this will affect company strategy going forward. If one channel is outperforming others, it might signal the channel merits further investment.
You will build on what we have learned about .groupby() in previous exercises, this time grouping by multiple columns.
Use .groupby() to calculate subscribers by subscribing_channel and date_subscribed.
retention_total = marketing.groupby(['date_subscribed', 'subscribing_channel'])['user_id'].nunique()
# Print results
print(retention_total.head())
Use .groupby() to calculate retained subscribers by subscribing_channel and date_subscribed.
# Count the retained subs by subscribing channel and date subscribed
retention_subs = marketing[marketing['is_retained'] == True]\
.groupby(['date_subscribed',
'subscribing_channel'])\
['user_id'].nunique()
# Print results
print(retention_subs.head())
Now that you’ve calculated the total subscribers and retained subscribers, you can settle the question of which channel had the highest retention rate.
In this exercise, you will calculate the retention rate for each subscription channel and visualize your results.
Divide retention_subs by retention_total to get the retention rate by subscribing channel by date subscribed. Add the title ‘Retention Rate by Subscribing Channel’. Name the x-axis label and y-axis label ‘Date Subscribed’ and ‘Retention Rate (%)’ respectively. Place the legend in the upper right and display the plot.
# Divide retained subscribers by total subscribers
retention_rate = retention_subs/retention_total
retention_rate_df = pd.DataFrame(retention_rate.unstack(level=1))
# Plot retention rate
retention_rate_df.plot()
# Add a title, x-label, y-label, legend and display the plot
plt.title('Retention Rate by Subscribing Channel')
plt.xlabel('Date Subscribed')
plt.ylabel('Retention Rate (%)')
plt.legend(loc = 'upper right',
labels = retention_rate_df.columns.values)
plt.show()
You’ve been doing a lot of repetitive calculations. Anytime you notice repetition in your work, consider automation. The more of the low-variance work you can automate, the more time you will have to explore new and interesting data science topics at work. This will both impress your marketing stakeholders and be more fun!
Since you know the format of the marketing DataFrame will remain the same over time, you can build a function to enable you to calculate conversion rate across any sub-segment you want on the fly.
In this exercise, you will build a function that takes a DataFrame and list of column names and outputs the conversion rate across the column(s).
Isolate rows in the user inputted dataframe where users were converted, then group by the list of user inputted column_names and count the number of unique converted users. Group the user inputted dataframe by the list of user inputted column_names and calculate the total number of users. Fill any missing values in conversion_rate with 0.
def conversion_rate(dataframe, column_names):
# Total number of converted users
column_conv = dataframe[dataframe['converted'] == True]\
.groupby(column_names)['user_id'].nunique()
# Total number users
column_total = dataframe.groupby(column_names)['user_id'].nunique()
# Conversion rate
conversion_rate = column_conv/column_total
# Fill missing values with 0
conversion_rate = conversion_rate.fillna(0)
return conversion_rate
### Test and visualize conversion function You’ve done the hard work of building your conversion rate function—now it’s time to test it out! Automating your analyses can be time-consuming up front, but this is where it all pays off.
In this exercise, you’ll see how quickly you can calculate the conversion rate. A task that in previous lessons took multiple steps. By automating the repetitive parts of your work, you’ll be able to spend more time doing complex analyses.
Use your conversion_rate() function to calculate the conversion rate in marketing by date_served and age_group and store your results in age_group_conv. Unstack age_group_conv at level equal to 1 and wrap that in a call to pd.DataFrame() to create age_group_df. Create a line chart to display your results from age_group_df.
# Calculate conversion rate by age_group
age_group_conv = conversion_rate(marketing, ['date_served', 'age_group'])
print(age_group_conv)
# Unstack and create a DataFrame
age_group_df = pd.DataFrame(age_group_conv.unstack(level = 1))
# Visualize conversion by age_group
age_group_df.plot()
plt.title('Conversion rate by age group\n', size = 16)
plt.ylabel('Conversion rate', size = 14)
plt.xlabel('Age group', size = 14)
plt.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-fa186f395e8d> in <module>
1 # Calculate conversion rate by age_group
----> 2 age_group_conv = conversion_rate(marketing, ['date_served', 'age_group'])
3 print(age_group_conv)
4
5 # Unstack and create a DataFrame
NameError: name 'conversion_rate' is not defined
Now that you’ve looked at conversion rate by age, you want to see if that trend has changed over time. Marketing has been changing their strategy and wants to make sure that their new method isn’t alienating age groups that are less comfortable with their product. However, to do so, you need to create a plotting function to make it easier to visualize your results.
In this exercise, you will build a function to plot the results of your conversion rate function.
Create a for loop for each column in the dataframe. Plot a line chart of the column by the DataFrame’s index. Show the plot.
def plotting_conv(dataframe):
for column in dataframe:
# Plot column by dataframe's index
plt.plot(dataframe.index, dataframe[column])
plt.title('Daily ' + str(column) + ' conversion rate\n',
size = 16)
plt.ylabel('Conversion rate', size = 14)
plt.xlabel('Date', size = 14)
# Show plot
plt.show()
plt.clf()
Your marketing stakeholders have requested a report of the daily conversion rate for each age group, and they need it as soon as possible. They want you to refresh this report on a monthly cadence. This is a perfect opportunity to utilize your functions. Not only will the functions help you get this report out promptly today, but it will also help each month when it’s time for a refresh of the data.
Remember, conversion_rate() takes a DataFrame and a list of columns to calculate the conversion rate.
Using your conversion_rate() function, create a new DataFrame called age_group_conv which contains conversion rate by date_served and age_group from the marketing DataFrame. Unstack age_group_conv to create a DataFrame with each age group as a column. This step has already been done for you. Use your plotting_conv() function to plot the conversion rates for each age group.
# Calculate conversion rate by date served and age group
age_group_conv = conversion_rate(marketing, ['date_served', 'age_group'])
# Unstack age_group_conv and create a DataFrame
age_group_df = pd.DataFrame(age_group_conv.unstack(level=1))
# Plot the results
plotting_conv(age_group_df)
png
png
png
png
png
png
png
<Figure size 432x288 with 0 Axes>
### House ads conversion rate The house ads team has become worried about some irregularities they’ve noticed in conversion rate. It is common for stakeholders to come to you with concerns they’ve noticed around changing metrics. As a data scientist, it’s your job to determine whether these changes are natural fluctuations or if they require further investigation.
In this exercise, you’ll try out your conversion_rate() and plotting_conv() functions out on marketing looking at conversion rate by ‘date_served’ and ‘marketing_channel’.
Use your conversion_rate() function on marketing to determine conversion rate by ‘date_served’ and ‘marketing_channel’.
# Calculate conversion rate by date served and channel
daily_conv_channel = conversion_rate(marketing, ['date_served', 'marketing_channel'])
print(daily_conv_channel.head())
date_served marketing_channel
2018-01-01 Email 1.000000
Facebook 0.117647
House Ads 0.084656
Instagram 0.106667
Push 0.083333
Name: user_id, dtype: float64
Unstack daily_conv_channel with level equal to one and convert the result into a DataFrame. Try out your plotting_conv() function on daily_conv_channel.
# Calculate conversion rate by date served and channel
daily_conv_channel = conversion_rate(marketing, ['date_served',
'marketing_channel'])
# Unstack daily_conv_channel and convert it to a DataFrame
daily_conv_channel = pd.DataFrame(daily_conv_channel.unstack(level = 1))
# Plot results of daily_conv_channel
plotting_conv(daily_conv_channel)
png
png
png
png
png
<Figure size 432x288 with 0 Axes>
Now that you have confirmed that house ads conversion has been down since January 11, you will try to identify potential causes for the decrease.
As a data scientist supporting a marketing team, you will run into fluctuating metrics all the time. It’s vital to identify if the fluctuations are due to expected shifts in user behavior (i.e., differences across the day of the week) versus a larger problem in technical implementation or marketing strategy.
In this exercise, we will begin by checking whether users are more likely to convert on weekends compared with weekdays and determine if that could be the cause for the changing house ads conversion rate.
Add a day of week column to the marketing DataFrame using dt.dayofweek based on the ‘date_served’ column. Use conversion_rate to calculate conversion by the day of week and marketing channel and store the results in DoW_conversion. Create a line plot of the results, set the y-axis to begin at 0 and display the plot.
# Add day of week column to marketing
marketing['DoW_served'] = marketing['date_served'].dt.dayofweek
print(marketing.head())
# Calculate conversion rate by day of week
DoW_conversion = conversion_rate(marketing, ['DoW_served', 'marketing_channel'])
# Unstack channels
DoW_df = pd.DataFrame(DoW_conversion.unstack(level=1))
# Plot conversion rate by day of week
DoW_df.plot()
plt.title('Conversion rate by day of week\n')
plt.ylim(0)
plt.show()
user_id date_served marketing_channel variant converted \
0 a100000029 2018-01-01 House Ads personalization True
1 a100000030 2018-01-01 House Ads personalization True
2 a100000031 2018-01-01 House Ads personalization True
3 a100000032 2018-01-01 House Ads personalization True
4 a100000033 2018-01-01 House Ads personalization True
language_displayed language_preferred age_group date_subscribed \
0 English English 0-18 years 2018-01-01
1 English English 19-24 years 2018-01-01
2 English English 24-30 years 2018-01-01
3 English English 30-36 years 2018-01-01
4 English English 36-45 years 2018-01-01
date_canceled subscribing_channel is_retained DoW DoW_served
0 NaT House Ads True 0.0 0.0
1 NaT House Ads True 0.0 0.0
2 NaT House Ads True 0.0 0.0
3 NaT House Ads True 0.0 0.0
4 NaT House Ads True 0.0 0.0
png
Nice. As you can see, email is particularly high and may be reflective of a tracking error, but house ads appear stable across the week with a slight peak on Tuesday. You will investigate further
### House ads conversion by language Now that you’ve ruled out natural fluctuations across the day of the week a user saw our marketing assets as they cause for decreasing house ads conversion, you will take a look at conversion by language over time. Perhaps the new marketing campaign does not apply broadly across different cultures.
Ideally, the marketing team will consider cultural differences prior to launching a campaign, but sometimes mistakes are made, and it will be your job to identify the cause. Often data scientists are the first line of defense to determine what went wrong with a marketing campaign. It’s your job to think creatively to identify the cause.
Create a new DataFrame house_ads by isolating the rows from marketing where marketing_channel equals ‘House Ads’. Run your conversion_rate() function on the house_ads DataFrame grouping by date_served and language_displayed. Use the plotting_conv() function on conv_lang_df to display your results.
# Isolate the rows where marketing channel is House Ads
house_ads = marketing[marketing['marketing_channel'] == 'House Ads']
# Calculate conversion by date served, and language displayed
conv_lang_channel = conversion_rate(house_ads, ['date_served', 'language_displayed'])
# Unstack conv_lang_channel
conv_lang_df = pd.DataFrame(conv_lang_channel.unstack(level=1))
# Use your plotting function to display results
plotting_conv(conv_lang_df)
png
png
png
png
<Figure size 432x288 with 0 Axes>
Interesting. As you can see, the English conversion rate drops around the 11th, and there do not appear to be ads served in other languages for a two week period. You will investigate further.
The house ads team is concerned because they’ve seen their conversion rate drop suddenly in the past few weeks. In the previous exercises, you confirmed that conversion is down because you noticed a pattern around language preferences.
As a data scientist, it is your job to provide your marketing stakeholders with as specific feedback as possible as to what went wrong to maximize their ability to correct the problem. It is vital that you not only say “looks like there’s a language problem,” but instead identify what the problem is specifically so that the team doesn’t repeat their mistake.
Use np.where() to create a new column in house_ads called ‘is_correct_lang’ whose values are ‘Yes’ if ‘language_displayed’ is equal to ‘language_preferred’ and ‘No’ otherwise. Group by date_served and is_correct_lang to get a daily count of the ads served.
# Add the new column is_correct_lang
house_ads['is_correct_lang'] = np.where(
house_ads['language_preferred'] == house_ads['language_displayed'],
'Yes',
'No')
# Groupby date_served and is_correct_lang
language_check = house_ads.groupby(['date_served','is_correct_lang'])['is_correct_lang'].count()
# Unstack language_check and fill missing values with 0's
language_check_df = pd.DataFrame(language_check.unstack(level=1)).fillna(0)
# Print results
print(language_check_df)
is_correct_lang No Yes
date_served
2018-01-01 2.0 189.0
2018-01-02 3.0 247.0
2018-01-03 0.0 220.0
2018-01-04 0.0 168.0
2018-01-05 0.0 160.0
2018-01-06 1.0 151.0
2018-01-07 2.0 130.0
2018-01-08 0.0 154.0
2018-01-09 0.0 157.0
2018-01-10 0.0 170.0
2018-01-11 20.0 135.0
2018-01-12 18.0 130.0
2018-01-13 26.0 122.0
2018-01-14 20.0 131.0
2018-01-15 16.0 192.0
2018-01-16 28.0 127.0
2018-01-17 21.0 127.0
2018-01-18 31.0 121.0
2018-01-19 22.0 127.0
2018-01-20 28.0 124.0
2018-01-21 14.0 100.0
2018-01-22 13.0 72.0
2018-01-23 16.0 69.0
2018-01-24 13.0 83.0
2018-01-25 19.0 74.0
2018-01-26 24.0 92.0
2018-01-27 18.0 149.0
2018-01-28 28.0 136.0
2018-01-29 24.0 142.0
2018-01-30 23.0 145.0
2018-01-31 23.0 135.0
<ipython-input-32-b78e61281c75>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
house_ads['is_correct_lang'] = np.where(
Interesting. As you can see, the English conversion rate drops around the 11th, and there do not appear to be ads served in other languages for a two week period. You will investigate further.
Now that you’ve created a DataFrame that checks whether users see ads in the correct language let’s calculate what percentage of users were not being served ads in the right language and plot your results.
Add a pct column to language_check_df which divides the count where language is correct by the row sum extracted using the .sum() method. Make a line plot with the date as the x-axis and the pct column as the y-axis and show your results.
# Divide the count where language is correct by the row sum
language_check_df['pct'] = language_check_df['Yes']/language_check_df.sum(axis=1)
# Plot and show your results
plt.plot(language_check_df.index.values, language_check_df['pct'])
plt.show()
png
Great work! You have shown that house ads have been underperforming due to serving all ads in English rather than each user’s preferred language
### Setting up conversion indexes Now that you’ve determined that language is, in fact, the issue with House Ads conversion, stakeholders need to know how many subscribers they lost as a result of this bug.
In this exercise, you will index non-English language conversion rates against English conversion rates in the time period before the language bug arose.
Create a new DataFrame, house_ads_bug, that contains only the rows from house_ads with ‘date_served’ prior to ‘2018-01-11’. Use your conversion_rate() function on the house_ads_bug DataFrame and the ‘language_displayed’ column. Using the appropriate row, divide the Spanish, Arabic and German columns of lang_conv by the English column.
# Calculate pre-error conversion rate
house_ads_bug = house_ads[house_ads['date_served'] < '2018-01-11']
print(house_ads_bug.head())
lang_conv = conversion_rate(house_ads_bug, "language_displayed")
print(lang_conv.head())
# Index other language conversion rate against English
spanish_index = lang_conv['Spanish']/lang_conv['English']
arabic_index = lang_conv['Arabic']/lang_conv['English']
german_index = lang_conv['German']/lang_conv['English']
print("Spanish index:", spanish_index)
print("Arabic index:", arabic_index)
print("German index:", german_index)
user_id date_served marketing_channel variant converted \
0 a100000029 2018-01-01 House Ads personalization True
1 a100000030 2018-01-01 House Ads personalization True
2 a100000031 2018-01-01 House Ads personalization True
3 a100000032 2018-01-01 House Ads personalization True
4 a100000033 2018-01-01 House Ads personalization True
language_displayed language_preferred age_group date_subscribed \
0 English English 0-18 years 2018-01-01
1 English English 19-24 years 2018-01-01
2 English English 24-30 years 2018-01-01
3 English English 30-36 years 2018-01-01
4 English English 36-45 years 2018-01-01
date_canceled subscribing_channel is_retained DoW DoW_served \
0 NaT House Ads True 0.0 0.0
1 NaT House Ads True 0.0 0.0
2 NaT House Ads True 0.0 0.0
3 NaT House Ads True 0.0 0.0
4 NaT House Ads True 0.0 0.0
is_correct_lang
0 Yes
1 Yes
2 Yes
3 Yes
4 Yes
language_displayed
Arabic 0.500000
English 0.099093
German 0.444444
Spanish 0.166667
Name: user_id, dtype: float64
Spanish index: 1.681924882629108
Arabic index: 5.045774647887324
German index: 4.485133020344287
Great! Now that you’ve created indexes for each language compared with English, you can assess what conversion rate should have been during the rest of the month.
### Analyzing user preferences To understand the true impact of the bug, it is crucial to determine how many subscribers we would have expected had there been no language error. This is crucial to understanding the scale of the problem and how important it is to prevent this kind of error in the future.
In this step, you will create a new DataFrame that you can perform calculations on to determine the expected number of subscribers. This DataFrame will include how many users prefer each language by day. Once you have the DataFrame, you can begin calculating how many subscribers you would have expected to have had the language bug not occurred.
Group house_ads by date_served and language_preferred. Use a dictionary within a call to .agg() to calculate the number of unique users and sums the number of converted users. Unstack converted at level = 1
# Group house_ads by date and language
converted = house_ads.groupby(["date_served", "language_preferred"])\
.agg({'user_id':'nunique',
'converted':'sum'})
# Unstack converted
converted_df = pd.DataFrame(converted.unstack(level = 1))
print(converted_df.head())
user_id converted \
language_preferred Arabic English German Spanish Arabic English German
date_served
2018-01-01 2.0 171.0 5.0 11.0 2 13 1
2018-01-02 3.0 200.0 5.0 10.0 0 14 3
2018-01-03 2.0 179.0 3.0 8.0 0 15 1
2018-01-04 2.0 149.0 2.0 14.0 0 12 0
2018-01-05 NaN 143.0 1.0 14.0 NaN 17 False
language_preferred Spanish
date_served
2018-01-01 0
2018-01-02 0
2018-01-03 1
2018-01-04 3
2018-01-05 3
Now that you’ve created an index to compare English conversion rates against all other languages, you will build out a DataFrame that will estimate what daily conversion rates should have been if users were being served the correct language.
An expected conversion DataFrame named converted has been created for you grouping house_ads by date and preferred language. It contains a count of unique users as well as the number of conversions for each language, each day.
For example, you can access the number of Spanish-speaking users who received house ads using converted[(‘user_id’,‘Spanish’)].
Use .loc to create the column english_conv_rate in converted with the English conversion rate between ‘2018-01-11’ and ‘2018-01-31’. Create expected conversion columns for each language by multiplying english_conv_rate by each language index (spanish_index, arabic_index or german_index). Multiply each language’s expected conversion rate by the number of users who should have received house ads.
# Create English conversion rate column for affected period
converted['english_conv_rate'] = converted_df.loc['2018-01-11':'2018-01-31'][('converted','English')]
# Create expected conversion rates for each language
converted['expected_spanish_rate'] = converted['english_conv_rate']*spanish_index
converted['expected_arabic_rate'] = converted['english_conv_rate']*arabic_index
converted['expected_german_rate'] = converted['english_conv_rate']*german_index
# Multiply number of users by the expected conversion rate
converted['expected_spanish_conv'] = converted['expected_spanish_rate']/100*converted_df[('user_id','Spanish')]
converted['expected_arabic_conv'] = converted['expected_arabic_rate']/100*converted_df[('user_id','Arabic')]
converted['expected_german_conv'] = converted['expected_german_rate']/100*converted_df[('user_id','German')]
It’s time to calculate how many subscribers were lost due to mistakenly serving users English rather than their preferred language. Once the team has an estimate of the impact of this error, they can determine whether it’s worth putting additional checks in place to avoid this in the future—you might be thinking, of course, it’s worth it to try to prevent errors! In a way, you’re right, but every choice a company makes requires work and funding. The more information your team has, the better they will be able to evaluate this trade-off.
The DataFrame converted has already been loaded for you. It contains expected subscribers columns for Spanish, Arabic and German language speakers named expected_spanish_conv, expected_arabic_conv and expected_german_conv respectively.
Create the converted DataFrame by using .loc to select only rows where the date is between ‘2018-01-11’ and ‘2018-01-31’. Sum the expected subscribers columns for each language in converted and store the results in expected_subs. Sum the actual subscribers for each language in converted and store the results in actual_subs. Subtract actual_subs from expected_subs to determine how many subscribers were lost due to the bug.
# Use .loc to slice only the relevant dates
converted = converted.loc['2018-01-11':'2018-01-31']
# Sum expected subscribers for each language
expected_subs = converted['expected_spanish_conv'].sum() + converted['expected_arabic_conv'].sum() + converted['expected_german_conv'].sum()
# Calculate how many subscribers we actually got
actual_subs = converted_df[('converted','Spanish')].sum() + converted_df[('converted','Arabic')].sum() + converted_df[('converted','German')].sum()
# Subtract how many subscribers we got despite the bug
lost_subs = expected_subs - actual_subs
print(lost_subs)
-62
The email portion of this campaign was actually run as an A/B test. Half the emails sent out were generic upsells to your product while the other half contained personalized messaging around the users’ usage of the site.
Before you begin analyzing the results, you will check to ensure users were allocated equally to the test and control groups.
Isolate the rows of marketing where the ‘marketing_channel’ is ‘Email’ and store the results in email. Group marketing by variant and sum the unique users and store the results in alloc. Plot the results of alloc in a bar chart.
Isolate the rows of marketing where the ‘marketing_channel’ is ‘Email’ and store the results in email. Group marketing by variant and sum the unique users and store the results in alloc. Plot the results of alloc in a bar chart.
# Subset the DataFrame
email = marketing[marketing['marketing_channel'] == 'Email']
# Group the email DataFrame by variant
alloc = email.groupby(['variant'])['user_id'].nunique()
# Plot a bar chart of the test allocation
alloc.plot(kind = "bar")
plt.title('Personalization test allocation')
plt.ylabel('# participants')
plt.show()
png
Now that we know allocation is relatively even let’s look at the conversion rate for the control and personalization. Since we chose conversion rate as our key metrics for this test, it is highly important that we evaluate whether or not conversion was higher in the personalization treatment compared with the control. While we will dive in deeper in subsequent exercises, measuring the difference between the key metric in the control and the treatment is the most important part of evaluating the success of an A/B test.
The DataFrame email has been loaded in your workspace which contains only rows from the marketing DataFrame where marketing_channel is ‘Email’.
Group the email DataFrame by user_id and variant while selecting the maximum value of the converted column and store the results in subscribers. Drop missing values from the control column of subscribers_df. Drop missing values from the personalization column of subscribers_df. Calculate the conversion rate for both personalization and control using the appropriate function for each.
# Group marketing by user_id and variant
subscribers = email.groupby(['user_id',
'variant'])['converted'].max()
subscribers_df = pd.DataFrame(subscribers.unstack(level=1))
# Drop missing values from the control column
control = subscribers_df['control'].dropna()
# Drop missing values from the personalization column
personalization = subscribers_df['personalization'].dropna()
print('Control conversion rate:', np.mean(control))
print('Personalization conversion rate:', np.mean(personalization))
Control conversion rate: 0.2814814814814815
Personalization conversion rate: 0.3908450704225352