Introduction

My name is İlker Zeybek. I am currently studying my senior year in the Industrial Engineering Department of Boğaziçi University. This RMarkdown document is prepared for hiring process of Narcade. It includes answers for the 6 questions given in the Case Study of Narcade. R language is used for solving the questions.

Question 1

In this question, we are asked to fit a simple linear regression model to given data. Our predictor variable (x) is the training time of a worker and our response variable (y) is the time to complete project. Workers are assumed to be identical, so it makes sense to fit a simple linear regression without considering additional variability from other sources.

Firstly, we will load necessary packages needed for this question.

# Loading the necessary package for visualization
library(ggplot2)

After that, we will create the response and predictor vectors and create a data frame from them. df variable is our data frame with columns TrainingTime and CompleteTime.

# Creating the predictor and response variables.
training_time <- c(22, 18, 30, 16, 25, 20, 10, 14)
complete_time <- c(18.4, 19.2, 14.5, 19.0, 16.6, 17.7, 24.4, 21.0)
df <- data.frame(TrainingTime = training_time, CompleteTime = complete_time)

Part a

The estimated regression line is found with the lm() function in the base R distribution.

# Simple Linear Regression
fit <- lm(CompleteTime ~ TrainingTime, data = df)
summary(fit)
## 
## Call:
## lm(formula = CompleteTime ~ TrainingTime, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35086 -0.41411  0.00559  0.46054  1.38093 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.46608    1.13830  24.129 3.33e-07 ***
## TrainingTime -0.44470    0.05617  -7.916 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9431 on 6 degrees of freedom
## Multiple R-squared:  0.9126, Adjusted R-squared:  0.8981 
## F-statistic: 62.67 on 1 and 6 DF,  p-value: 0.0002157
  • As a result of the simple linear regression, our \(R^2\) = 0.8981. This means training time as a predictor variable captures the 89.81% of the variability in the completion time of the project.
  • Our p-value = 0.0002157. This means training time as a predictor varible is statistically significant to use in the simple linear regression fit.
  • These statistical metrics are highly significant and suggests a strong linear relationship between training time and a completion time of the project.

Part b and c

To predict the outcomes from this simple linear regression fit, predict() function is used. 28 and 50 hours of training times provided in the data frame form in newdata argument.

# Predicted Completion Times for identical workers with 28 and 50 hours of training
predictions <- predict(fit, newdata = data.frame(TrainingTime = c(28, 50)))

Prediction for 28 hours of training time.

predictions[1]
##        1 
## 15.01446

Prediction for 50 hours of training time.

predictions[2]
##        2 
## 5.231042

Visualization of Question 1

To see the simple linear fit clearly between training time of a worker and the completion time of the project, we will visualize it using the ggplot2 package. The red line indicates the fitted least squares linear regression line and the grey area around the regression line is the confidence interval of it.

# Visualization of predictor vs response
g <- ggplot(df, aes(x = TrainingTime, y = CompleteTime)) +
        geom_point() +
        geom_smooth(formula = y ~ x, method = "lm", col = "red", lwd = 1.25) +
        xlab("Training Time (hr)") +
        ylab("Complete Time (hr)") +
        labs(title = "Training Time vs Complete Time") +
        theme(plot.title = element_text(hjust = 0.5))
g

Question 2

Ad-Campaign: Success or Fail ?

In order to decide whether the ad-campaign is successful or not, we have to determine whether the revenues exceed the expenses of an ad-campaign. This is the simplest thought that it comes to mind at first. At day 90, Average Revenue Per User is 1.7$ and the churn rate of the gained customers throughout the ad-campaign is 95%. To calculate Life Time Value:

  • Life Time Value = Average Revenue Per User / Churn Rate

can be used.

CPI <- 2
ARPU <- 1.7
churn_rate <- 0.95
LTV <- ARPU / churn_rate
paste(round(LTV, 2), "$", sep = "")
## [1] "1.79$"

As can be seen above, Life Time Value of an user is 1.79$. To check if our ad-campaign is successful in day 90 or not, we have to compare it with Cost Per Install value.

ratio <- LTV / CPI
round(ratio, 2)
## [1] 0.89

The ratio of LTV and CPI indicates that our income generated from users that installed our application from the ad-campaign are lower than their individual cost to us. This means our ad-campaign is a failure in the day 90.

What are the Solutions ?

The two simple ways to increase the LTV / CPI ratio, which determines how successful our ad-campaign is:

  • Increase the Life Time Value of the user
  • Lower the Cost Per Install

Increasing the LTV

To increase LTV, we should either generate more revenue in average from per user or we should increase our retention rate. To increase our retention rate, we can push notifications to the user with a pre-determined time frame. We can offer free boosts in-game with these notifications or we can offer limited time promotions for in-game purchasables. Another way to increase our retention rate is to add some extra excitements into the game, side quests, side achievements, or even some weekly or monthly competitive aspect. People usually love competition and little competitions to keep them playing is always a good idea.

Lower the Cost Per Install

This may be a harder option to do, since it varies a lot with the ad provider company. I do not have extensive knowledge in this field, i.e which companies are industry leaders etc. But simply having a good reach with relatively low ad-campaign cost is always a boost in the ratio of LTV / CPI.

Question 3

We are asked to do series of plotting with the data set named Salaries, which is available online. First step is the downloading the data set and loading it into R.

url <- "http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv"
download.file(url, "Salaries.csv")
data <- read.csv("Salaries.csv")

After that we will load the necessary packages for visualizations.

library(ggplot2)
library(plotly)
library(ggbeeswarm)
library(GGally)
library(htmltools)

First 10 observations of the data is shown below with head() function for exploring the dataset.

head(data, 10)
##         rank discipline phd service  sex salary
## 1       Prof          B  56      49 Male 186960
## 2       Prof          A  12       6 Male  93000
## 3       Prof          A  23      20 Male 110515
## 4       Prof          A  40      31 Male 131205
## 5       Prof          B  20      18 Male 104800
## 6       Prof          A  20      20 Male 122400
## 7  AssocProf          A  20      17 Male  81285
## 8       Prof          A  18      18 Male 126300
## 9       Prof          A  29      19 Male  94350
## 10      Prof          A  51      51 Male  57800

Basic summary statistics of the numerical variables in the dataset are calculated with summary() function.

summary(data$phd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.25   18.50   19.71   27.75   56.00
summary(data$service)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.25   14.50   15.05   20.75   51.00
summary(data$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57800   88613  104671  108024  126775  186960

Histogram of Salary variable created with the:

# Histogram with base plotting system
hist(data$salary, xlab = "Salary", main = "Histogram of Salaries", col = 'skyblue')

# Histogram with ggplot2 package
g <- ggplot(data, aes(x = salary)) +
        geom_histogram(bins = 8, col = "black", fill = "skyblue") +
        xlab("Salary") +
        ylab("Frequency") +
        labs(title = "Histogram of Salaries") +
        theme(plot.title = element_text(hjust = 0.5))
g

# Histogram with plotly package
g2 <- plot_ly(x = ~salary, data = data, type = "histogram")
g2 <- g2 %>% layout(barmode = "overlay", bargap = 0.1)
g2 <- div(g2, align = "center")
g2

Barplot of the various categorical variables are created with ggplot2 package.

# Barplot with ggplot2
g3 <- ggplot(data, aes(x = rank)) +
        geom_bar(col = "black", fill = "skyblue") +
        xlab("Rank") +
        ylab("Frequency") +
        labs(title = "Barplot of Rank") +
        theme(plot.title = element_text(hjust = 0.5))
g3

g4 <- ggplot(data, aes(x = discipline)) +
        geom_bar(col = "black", fill = "skyblue") +
        xlab("Discipline") +
        ylab("Frequency") +
        labs(title = "Barplot of Discipline") +
        theme(plot.title = element_text(hjust = 0.5))
g4

g5 <- ggplot(data, aes(x = sex)) +
        geom_bar(col = "black", fill = "skyblue") +
        xlab("Sex") +
        ylab("Frequency") +
        labs(title = "Barplot of Sex") +
        theme(plot.title = element_text(hjust = 0.5))
g5

g6 <- ggplot(data, aes(x = rank)) +
        geom_bar(aes(fill = discipline)) +
        xlab("Rank") +
        ylab("Frequency") +
        labs(title = "Barplot of Rank with Discipline") +
        theme(plot.title = element_text(hjust = 0.5))
g6

g7 <- ggplot(data, aes(x = rank)) +
        geom_bar(aes(fill = sex)) +
        xlab("Rank") +
        ylab("Frequency") +
        labs(title = "Barplot of Rank with Sex") +
        theme(plot.title = element_text(hjust = 0.5))
g7

Scatter plots of the numerical variables with embedded loess regression smoothing curve and linear regression are created with:

# Scatterplot with base plotting system
plot(data$phd, data$salary, xlab = "PhD", ylab = "Salary", main = "PhD vs Salary",
     pch = 19)

plot(data$service, data$salary, xlab = "Service", ylab = "Salary", main = "Service vs. Salary",
     pch = 19)

# Scatterplot with ggplot2 package
g8 <- ggplot(data, aes(x = phd, y = salary)) +
        geom_point(aes(color = sex)) +
        geom_smooth(formula = y ~ x, method = "loess", lwd = 0.5, col = "red") +
        xlab("PhD") +
        ylab("Salary") +
        labs(title = "PhD vs Salary") +
        theme(plot.title = element_text(hjust = 0.5))
g8

g9 <- ggplot(data, aes(x = service, y = salary)) +
        geom_point(aes(color = sex)) +
        geom_smooth(formula = y ~ x, method = "loess", lwd = 0.5, col = "red") +
        xlab("Service") +
        ylab("Salary") +
        labs(title = "Service vs Salary") +
        theme(plot.title = element_text(hjust = 0.5))
g9

g10 <- ggplot(data, aes(x = phd, y = salary)) +
        geom_point(aes(color = sex)) +
        geom_smooth(formula = y ~ x, method = "lm", lwd = 0.5, col = "red") +
        xlab("PhD") +
        ylab("Salary") +
        labs(title = "PhD vs Salary") +
        theme(plot.title = element_text(hjust = 0.5))
g10

g11 <- ggplot(data, aes(x = service, y = salary)) +
        geom_point(aes(color = sex)) +
        geom_smooth(formula = y ~ x, method = "lm", lwd = 0.5, col = "red") +
        xlab("Service") +
        ylab("Salary") +
        labs(title = "Service vs Salary") +
        theme(plot.title = element_text(hjust = 0.5))
g11

# Scatterplot with plotly package, simple but effective. Makes the plot interactive.
g8_plotly <- ggplotly(g8)
g8_plotly <- div(g8_plotly, align = "center")
g8_plotly
g9_plotly <- ggplotly(g9)
g9_plotly <- div(g9_plotly, align = "center")
g9_plotly
g10_plotly <- ggplotly(g10)
g10_plotly <- div(g10_plotly, align = "center")
g10_plotly
g11_plotly <- ggplotly(g11)
g11_plotly <- div(g11_plotly, align = "center")
g11_plotly

Boxplots are created with the:

# Boxplot with base plotting system
boxplot(data$salary, main = "Boxplot of Salary", xlab = "Salary", ylab = "Salary in $")

# Boxplot with ggplot2 package
g12 <- ggplot(data, aes(x = sex, y = salary)) +
        geom_boxplot() +
        xlab("Sex") +
        ylab("Salary") +
        labs(title = "Boxplot of Salary with Sex") +
        theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
g12

# Boxplot with plotly package
g13 <- plot_ly(x = ~sex, y = ~salary, color = ~sex, data = data, type = "box")
g13 <- div(g13, align = "center")
g13

Swarm plot created with ggbeeswarm package, which is an extension to the ggplot2 package.

# Swarm plot with ggbeeswarm
g14 <- ggplot(data, aes(x = sex, y = salary)) +
        geom_beeswarm(aes(col = sex)) +
        xlab("Sex") +
        ylab("Salary") +
        labs(title = "Swarm Plot of Salary with Sex") +
        theme(plot.title = element_text(hjust = 0.5))
g14

Pairs plots are created with:

# Pairs plot with base plotting system
pairs(~ phd + service + salary, data = data)

# Pairs plot with GGAlly
g15 <- ggpairs(data, columns = c(3, 4, 6), aes(col = sex))
g15

Question 4

Winner of A/B Test

If we have to calculate the remaining users with chain multiplication of the retention rates given in the A/B test table, remining users are:

remaining_control <- 10000 * 0.51 * 0.28 * 0.18 * 0.11
remaining_a <- 10000 * 0.50 * 0.29 * 0.18 * 0.13
remaining_b <- 10000 * 0.51 * 0.29 * 0.18 * 0.10
remaining <- c(Control = remaining_control, A = remaining_a, B = remaining_b)
which.max(remaining)
## A 
## 2

The purchase per users are equal, therefore by the end of the day 14, case A has the most users remained. The winner is the case A.

If day 8-14 retention rates are the final retention rates in 10000 initial users, case A is the winner again, because it kept 13% of the users, while the control case kept 11% and the case B kept 10%.

Comments on Winner Case

The case A is the winner in this A/B test, therefore it means that case A has the most impact on the keeping gamers hooked to the game. Extra changes in the case A has to be kept in the game as an official update. However, there may be other aspects to the A/B test. We don't know the actual maintaining costs of these cases in this question. Hence, the above results are only valid if the maintaining costs of the improvements in the test cases are assumed to be equal.

Question 5

(IF(B2="Farm Bubbles IOS",VLOOKUP(F2,'CPM 7'!A:AG,2,FALSE),IF(B2="Farm
Bubbles Android",VLOOKUP(F2,'CPM 7'!A:AG,5,FALSE))))

The corresponding Excel formula takes the IF() into account first. In this questions given table row, it will evaluate the second IF inside the first IF, because first IF condition is evaluated as FALSE. Second IF condition only have the execution command if it is TRUE, because it has to be TRUE if the first IF condition is FALSE. VLOOKUP function is executed and the function searches for F column country code in the column A of the sheet named CPM 7. It returns the value in the E column with the matched country codes.

Question 6

We are asked to plot a data contained in a table, which are multivariate time series data. Table has been copied to an Excel file and exported a .csv output of the table. Then it is loaded into R.

# Loading the necessary packages for plotting
library(ggplot2)
library(plotly)
# Reading the data into R
data <- read.csv("q6.csv", header = T, sep = ";")

Since we have a date variables in the Date column, it should be converted into Date class to show it properly on the x-axis of the plot.

# Converting the Date column to Date class
data$Date <- as.Date(data$Date, format = "%d.%m.%Y")

After that, since the dates are in the wrong order, whole data frame is flipped. Then, to be able to plot the lines, the count data contained in the Total, Organic, and Paid columns are transformed into numeric values. gsub() function is used to replace commas with nothing in order to achieve a numerical form. Then they are transformed with as.numeric() function.

# Transforming the columns to numeric class. This is done to be able to plot numeric values.
data <- data[nrow(data):1, ]
data$Total <- as.numeric(gsub(",", "", data$Total))
data$Organic <- as.numeric(gsub(",", "", data$Organic))
data$Paid <- as.numeric(gsub(",", "", data$Paid))

Time series plot of the data is created with the ggplot2 package and converted into plotly form, in order to achieve interactivity in the plot.

# Time series plot of the data using ggplot2 package
g16 <- ggplot(data, aes(x = Date, y = Total)) +
        geom_line(aes(col = "Total"), lwd = 1.5) +
        geom_line(aes(y = Organic, col = "Organic"), lwd = 1.5) +
        geom_line(aes(y = Paid, col = "Paid"), lwd = 1.5) +
        scale_x_date(date_labels = "%Y %b %d") +
        xlab("Date") +
        ylab("Install") +
        labs(title = "Time Series Plot of Install Types") +
        theme(plot.title = element_text(hjust = 0.5))
g16_plotly <- ggplotly(g16)
g16_plotly <- div(g16_plotly, align = "center")
g16_plotly