plot(cars) #cars is a built-in dataset available in R.
Introduction to Quarto
1 Introduction
This is a Quarto file. When you execute code within a Quarto file, the results appear beneath the code. Within R Studio, try executing the following chunk by clicking the Run button within the chunk (the green triangle on the right-hand side) or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
You can add a new chunk by clicking the Insert Chunk button on the toolbar (the green box with the c) or by pressing Ctrl+Alt+I.
To produce a complete report containing all commentary, code and output, click the Render button in the toolbar across the top of R Studio (look for the blue arrow). This will create a HTML file, which should open automatically and also appear in the folder where you have saved this Quarto file.
1.1 How to Format Text
The visual editor in RStudio provides a WYSIWYM interface for adjusting the content of your report. This includes formatting your text similar to how you would in MS Word, i.e. you can underline text, put words in bold font or italics, add bullet points, insert images, etc… While the visual editor displays your content with your desired formatting, if you choose the source editor, then you can see your content in plain Markdown language. You can switch back and forth between the visual and source editors to view and edit your content using either tool.
If you are editing text in the source editor, then the following summarises how to apply various formatting:
Putting the hashtag symbol “#” followed by a space creates a new section. Use “##” for a second level heading and “###” for a third level heading.
Putting a pair of asterisks around a word causes the word to be printed in italics.
Putting a pair of double asterisks around a word causes the word to be printed in bold.
You can use single dashes to create bullet points, for example:
- This is the first bullet point.
- This is the second bullet point.
If you want to use numbered bullet points, then you use the following:
- This is the first numbered bullet point.
- This is the second numbered bullet point.
If you wish to include a hyperlink with a name as opposed to the actual website address, then write what you want to call the hyperlink inside square brackets and then include the link in round brackets directly after, for example RTE News.
You can also include .png image files in your Quarto file. This is similar to how you include a hyperlink, except you put an exclamation point at the start. The caption for the image goes in the square brackets and then the name of the .png file goes in the round brackets. Note that the .png file must be saved in the same location as where you have saved this Quarto file.
2 Example of R Code, Output & Commentary Together
Let’s take the dataset from an article called Predict Customer Churn with R. The dataset is called churn.csv. To import this dataset into R, we first load a few required packages (tidyverse, grid, etc…) and then use the usual read_csv() function. It’s important to make sure the churn.csv file is saved in the same location as where you have saved this Quarto file - setting the working directory doesn’t apply when using Quarto.
REMEMBER: When carrying out your analysis, all R code must be included in an R code chunk, which is most easily created by pressing Ctrl+Alt+I. All of your commentary and interpretation goes outside the R code chunk.
library(tidyverse)
library(grid)
library(gridExtra) #Contains the grid.arrange function for laying out several plots in the same plot window.
<- read_csv("churn.csv") churn
Within R Studio, look at the Quarto file and notice that we used message = F and warning = F in the chunk option area. Knit the file with and without these bits of code to see what it does. With Quarto, it is good practice to prevent messages and warnings from appearing in the output HTML document. This is usually achieved by adding message = F and warning = F to the R chunk option area. See Section 29.5 from R for Data Science for more information on chunk options
The data contains 7032 rows (customers) and 19 columns (variables). Each row represents a customer and each column represents the following customer attributes:
- churn: whether the customer churned or not (Yes, No).
- tenure_group: how long the customer has been with the company (0-12 months, 12-24 months, 24-48 months, 48-60 months, > 60 months).
- gender: gender of customer (female, male)
- senior_citizen: whether the customer is a senior citizen or not (Yes, No).
- partner: whether the customer has a partner or not (Yes, No).
- dependents: whether the customer has dependents or not (Yes, No).
- phone_service: whether the customer has a phone service or not (Yes, No).
- multiple_lines: whether the customer has multiple lines or not (Yes, No).
- internet_service: customer’s internet service provider (DSL, Fiber optic, No).
- online_security: whether the customer has online security or not (Yes, No).
- online_backup: whether the customer has online backup or not (Yes, No).
- device_protection: whether the customer has device protection or not (Yes, No).
- tech_support: whether the customer has tech support or not (Yes, No).
- streaming_tv: whether the customer has streaming TV or not (Yes, No).
- streaming_movies: whether the customer has streaming movies or not (Yes, No).
- contract: the contract term of the customer (Month-to-month, One year, Two year).
- paperless_billing: whether the customer has paperless billing or not (Yes, No).
- payment_method: the customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)))
- monthly_charges: the amount charged to the customer monthly.
In case you are unfamiliar with the term churn: customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred to as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location. Later in this module we will predict which customers are likely to churn. For now, we are going to get a feel for the data by carrying out some exploratory data analysis using the ggplot2 and dplyr packages (i.e. create some graphs and summary tables).
2.1 Initial Graphs
First we can start off by making some bar charts of the following categorical variables: Gender, Senior Citizen, Partner and Dependents (the same could be done for the remaining categorical variables including Phone Service, Internet Services, etc…). From these graphs, we can see the following:
- There is an almost equal number of males and females.
- There are far less senior citizens.
- Close to an equal number of customers do and do not have a partner.
- Nearly twice as many customers do not have a dependent when compared to the customers who do have a dependent.
<- ggplot(churn) +
p1 geom_bar(aes(y = gender), width = 0.5) +
ggtitle("Gender") +
ylab("Gender") +
xlab("# Customers") +
theme_minimal()
<- ggplot(churn) +
p2 geom_bar(aes(y = senior_citizen), width = 0.5) +
ggtitle("Senior Citizen") +
ylab("Senior Citizen") +
xlab("# Customers") +
theme_minimal()
<- ggplot(churn) +
p3 geom_bar(aes(y = partner), width = 0.5) +
ggtitle("Partner") +
ylab("Partner") +
xlab("# Customers") +
theme_minimal()
<- ggplot(churn) +
p4 geom_bar(aes(y = dependents), width = 0.5) +
ggtitle("Dependents") +
ylab("Dependents") +
xlab("# Customers") +
theme_minimal()
grid.arrange(p1, p2, p3, p4, ncol = 2)
The above barcharts show the number of customers in each category. However, it can often be more informative to show the percentage of customers in each category. To do this, you could run the following code instead, where y takes a special formula that calculates the percentage of rows in each category of the x variable (i.e. internet_service). Now we can see that roughly 22% of customers have no internet services, whereas roughly 44% of customers have fiber optic internet.
ggplot(churn) +
geom_bar(aes(x = 100*(..count..)/sum(..count..), y = internet_service), width = 0.5) +
ggtitle("Internet Service") +
ylab("Internet Service") +
xlab("Percentage") +
theme_minimal()
Now we are interested in looking at the monthly charges variable for different groups, but first we’ll create a histogram to see how monthly charges are distributed. From the histogram below we can see there is a spike in the number of customers having a monthly charge around €15-25, although there is a generally even distribution of customers having monthly charges from €45-100.
ggplot(churn) +
geom_histogram(aes(x = monthly_charges), bins = 25) +
ggtitle("Distribution of Monthly Charges") +
xlab("Monthly Charges")
2.2 Formatting Tables
Let’s see how the average monthly charge differs by tenure group. The table below shows that the longer the tenure of the customer, the higher the average monthly charge.
%>%
churn group_by(tenure_group) %>%
summarise(avg_monthly_charge = mean(monthly_charges)) %>%
arrange(tenure_group)
# A tibble: 5 × 2
tenure_group avg_monthly_charge
<chr> <dbl>
1 Four to Five Years 70.6
2 More than Five Years 76.0
3 One to Two Years 61.4
4 Two to Four Years 65.9
5 Up to One Year 56.2
Notice that tenure_group is an ordinal variable, but the categories are not sorted appropriately, i.e. they should range from the shortest time period to the longest. We need to tell R how the categories of the tenure_group variable should be ordered. We do this using a function called factor(), which specifies the order in which the categories of the tenure group should be listed.
After using factor() to specify the correct order of the categories, we can run the exact same code from above, but this time the categories will be sorted correctly.
$tenure_group <- factor(churn$tenure_group, levels = c("Up to One Year", "One to Two Years", "Two to Four Years", "Four to Five Years", "More than Five Years"))
churn
%>%
churn group_by(tenure_group) %>%
summarise(avg_monthly_charge = mean(monthly_charges)) %>%
arrange(tenure_group)
# A tibble: 5 × 2
tenure_group avg_monthly_charge
<fct> <dbl>
1 Up to One Year 56.2
2 One to Two Years 61.4
3 Two to Four Years 65.9
4 Four to Five Years 70.6
5 More than Five Years 76.0
Notice that the table appears in the HTML file in the same format as when it appears in the console window. However, We can make the table look a lot nicer by using a function called kable() from the knitr package and setting the following parameters:
col.names is used to rename the column headings.
digits is used to specify the number of decimal places that each column should be rounded to. Here we have set digits = c(0,2), which will round the first column to have 0 decimal places (which makes sense because the first column contains text) and will round the second column to have 2 decimal places.
align is used to specify the alignment of each column. Here we use “lr”, which means the first columns should be left-aligned and the second column should be right-aligned (text columns should always be left-aligned and numerical columns should always be right-aligned).
caption is used to give the table a caption.
<- churn %>%
table_1 group_by(tenure_group) %>%
summarise(avg_monthly_charge = mean(monthly_charges)) %>%
arrange(tenure_group)
::kable(table_1,
knitrdigits = c(0,2),
align = "lr",
col.names = c("Tenure Group", "Avg Monthly Charge"),
caption = "Average Monthly Charge by Tenure Group")
Tenure Group | Avg Monthly Charge |
---|---|
Up to One Year | 56.17 |
One to Two Years | 61.36 |
Two to Four Years | 65.93 |
Four to Five Years | 70.55 |
More than Five Years | 75.95 |
If knitting to HTML, then the table can sometimes appear “squashed” making it difficult to read each individual column or can appear too spread out. However, we can use functions from the kableExtra package to apply further formatting. For example, if knitting to HTML, then we can tell Quarto the table should not be the width of the screen by (i) specifying format = “html”, (ii) telling Quarto to not process the table by adding table.attr = ‘data-quarto-disable-processing = “true”’ within the kable() function (Quarto has a special processing method that we don’t want to apply here) and (iii) adding kable_styling(full_width = F) using %>%.
You can also use functions such as row_spec() and column_spec() to format the table in different ways. This can be particularly helpful if you want to draw the readers attention to a particularly interesting figure in the table.
library(kableExtra) #You may need to install this package using install.packages("kableExtra")
::kable(table_1,
knitrformat = "html",
digits = c(0,2),
align = "lr",
col.names = c("Tenure Group", "Avg Monthly Charge"),
caption = "Average Monthly Charge by Tenure Group",
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE, color = "white", background = "black") %>%
column_spec(1, underline = TRUE, italic = TRUE, bold = TRUE, color = "red")
Tenure Group | Avg Monthly Charge |
---|---|
Up to One Year | 56.17 |
One to Two Years | 61.36 |
Two to Four Years | 65.93 |
Four to Five Years | 70.55 |
More than Five Years | 75.95 |
2.3 Compare Churn Rates
More relevant to this analysis is how the rate of churn varies for different variables. If we can find variables with categories having very different churn rates, then these could be useful for our predictive modelling.
Let’s compare the churn rate for gender, tenure group, contract type and internet service by running the following code.
<- ggplot(churn) +
p1 geom_bar(aes(y = gender, fill = churn), position = "fill") +
ylab("Gender") +
xlab("% Customers")
<- ggplot(churn) +
p2 geom_bar(aes(y = tenure_group, fill = churn), position = "fill") +
ylab("Tenure Group") +
xlab("% Customers")
<- ggplot(churn) +
p3 geom_bar(aes(y = contract, fill = churn), position = "fill") +
ylab("Contract Type") +
xlab("% Customers")
<- ggplot(churn) +
p4 geom_bar(aes(y = internet_service, fill = churn), position = "fill") +
ylab("Internet Service") +
xlab("% Customers")
grid.arrange(p1, p2, p3, p4, ncol = 2)
The barcharts above show the following:
- The % of males and females who churn is almost equal, so it does not appear that gender has any impact on the likelihood of a customer to churn.
- For Tenure Group, there is an obvious trend that the longer the customer has been with the company, the lower the churn rate.
- For Contract Type, there again seems to be a trend that the longer the contract the customer takes out, the less likely they are to churn (which makes sense really…).
- For Internet Service, it seems that customers with Fiber Optic broadband are the most likely to churn with nearly 40% of customers in this group churning.
Finally, we could also compare the average of numerical variables for customers who did and did not churn. For example the following code compares the average monthly charges for customers who did and did not churn. We can see that customers who did churn had a higher average monthly charge of €74.44. Maybe they found a cheaper provider?
<- churn %>%
table_2 group_by(churn) %>%
summarise(avg_monthly_charge = mean(monthly_charges))
::kable(table_2,
knitrformat = "html",
digits = 2,
align = "lr",
col.names = c("Churn Group", "Avg Monthly Charge"),
caption = "Average Monthly Charge by Churn Group",
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE, color = "white", background = "grey")
Churn Group | Avg Monthly Charge |
---|---|
No | 61.31 |
Yes | 74.44 |
2.4 Further Analysis
We could also investigate the following:
- How many customers are availing of the different contract types? Have paperless billing?
- What percentage of customers stream movies? Avail of device protection?
- What is the average monthly charge for customers who churn compared to those who do not churn?
- Does payment method appear to have any impact on a customer’s likelihood of churning? What about whether or not they avail of tech support?
3 Publishing Your Report
There are many ways to publish documents created using Quarto, which are discussed at this link. The following gives instructions on how to publish your report using two options: RPubs and Quarto Pub.
3.1 RPubs
- Create a free account with RPubs and make sure you are logged into your RPubs account.
- Open the Quarto file that you want to publish within RStudio.
- Across the top of R Studio, look for a blue icon beside the green “Add Chunk” and “Run” buttons.
- Click the drop-down and select “Publish Document”.
- A box will appear asking where you want to publish your report. Choose “RPubs”. Then click “Publish”.
- Your default internet browser will open a page, asking you to confirm various details such as the title and description of your report. After entering these details, click “Continue”.
- Your report should now be available to view within RPubs.
3.2 Quarto Pub
- Create a free account with Quarto Pub.
- Open the Quarto file that you want to publish within RStudio.
- In the bottom left of RStudio, click the Terminal tab and copy and paste in the following code: quarto publish quarto-pub
- You will be asked to authorise your account and to choose the name you want to use for the published file.
- RStudio will then render your file and make it available within your Quarto Pub account. You will then be able to share a hyperlink to your file with relevant stakeholders.
4 Further Reading
To learn more about Quarto, please see the following resources:
- Quarto Chapter in the R for Data Science book by Wickham and Grolemund.
- Quarto formats Chapter in the R for Data Science book by Wickham and Grolemund.
- Quarto Website.
The following articles use R (specifically the tidyverse) to analyse datasets. I advise that you read through these articles to see some examples of what you can do when analysing data. You may not be able to replicate all of this code because data may not be available and sometimes there are bugs in the code, but you should still find some useful tips on visualising and manipulating data using R. The analyses are pretty interesting too!
Predict Customer Churn with R. Note that we have not yet covered the predictive analytics techniques that are shown in this article. So, for now, I advise that you focus on the content up to the section called Logistic Regression.
Three Different Premier League Stories. This has nothing to do with marketing and is a bit outdated, but might be of interest to sports fans…
Analysis of FIFA 18 data. Again, nothing to do with marketing, but may be of interest… The R code used here is pretty good too.
ggplot ’Em All - Pokemon! Even less related to marketing, but some good code used here.