Problem/Task Description

Financial marketing data of 267 start-up companies working in the tech sector will be analyzed in order to inform a new tech start-up who is interested in advertising and its relationship with revenue.

The Data

The data is a .csv file that contains 267 observations. Each observation is a tech company’s money spent on advertising and the revenue generated during their first year. Below in Figure 1 we see all 267 observations in a scatterplot.

Figure 1

Figure 1

One observation to note is the red colored point in the bottom right of the scatterplot. This is the following observation:

##       Spend Revenue
## 257 1255897       0

The above data point can be considered an outlier. Perhaps the data entry was incorrect or the .csv file was corrupt. Or perhaps there truly was a tech start-up that spent this amount on advertising only to earn $0 in revenue and go completely bankrupt. In either case, this outlier will only hinder the analysis of the relationship between advertising spent and revenue during a tech start-up’s first year. All further analysis of the data will not include the above mentioned data point.

Relationship between Advertising Spending and Revenue in 1st year tech start-ups

From Figure 1 above we can see that there exists a linear relationship between Revenue and Spending. In particular, it appears that on the average, as spending increased revenue also increased in a linear fashion. Therefore, we calculate the least squares regression line for this data.

Figure 2

Figure 2

In Figure 2 we see that the linear relationship has a slope of 1.09. We interpret the slope as follows: For each 1 dollar increase in advertising spending, we expect an average increase of 1.09 dollars in revenue. Our model is valid only for tech companies who spend between 340,098 and 926,768 dollars on advertising in their first year. No inference can be made with this current model for tech companies that spend below or above those amounts.

For completeness we include an analysis of Least Squares Regression model validity. See Appendix.

Spending $500,000 or $700,000 on advertising

We will use our linear model \(Revenue = 1.09*Spent + 134,559\) to predict a range of expected Revenues for 500,000 and 700,000 dollars Spent respectively.

Figure 3

Figure 3

The lower and upper values of 500k and 700k spending are plotted as red and green diamonds respectively in the plot in Figure 3. See Table 1 below for exact values of lower and upper revenue range one could expect from 500k and 700k spending. In the columns Lower.Profit and Upper.Profit we calculate profit as follows: Profit = Revenue - Advertisement Spending.

Table 1:

##   Amount Lower.Rev Upper.Rev Lower.Profit Upper.Profit
## 1   500k  490347.1  869659.6    -9652.861     369659.6
## 2   700k  708591.9 1087770.0     8591.943     387770.0

It appears that 700k is a more sound strategy since the lower range of profits would still be positive (8.5k) and the upper range of profits would be 387k. Compare these profits to the lower and upper ranges of a strategy of 500k. When spending 500k, the lower range of profits would be NEGATIVE (-9.6k). Furthermore, the upper range of profits would be 369k, which is less than the 700k strategy upper range profit.

One important distinction must be made in this claim:

It is possible that the tech-companies who were willing to spend more than the sample median spending amount of 603k in their first year had reason to believe that their revenues would exceed that amount. Without any other variable relationships to analyze, it is impossible to know which factors may have influenced their decision.

Perhaps the new tech start-up accounting and budgeting departments should have quarterly updates on their revenues and adjust their advertisement spending.

Appendix A: LS Regression Analysis

In this section we will analyze the validity of our least squares regression line that we used to model the relationship between Revenue and Spending.

Figure 1

Figure 1

In Figure 1 (top left) we can see that average value of our residuals is approximately 0 and that they are equally distributed above and below this mean value of 0. This implies assumption (1) as stated in Lecture is valid. In particular this means that \(E(\epsilon|X) = 0\) (the residual errors have average value 0).

In Figure 1 (bottom left) we can see that the standardized residuals as a function of the fitted values is also equally distributed above and below and that they do not grow (or decrease) with the fitted values \(\hat{y}\). This implies assumption (2) as stated in Lecture is valid. In particular this means that \(Var(Y|X) = \sigma^{2}_\epsilon\) (our data is homoscedastic).

In Figure 1 (top right) we can see that the residuals form an approximately normal distribution. If we are to be critical we may argue that the tails are slightly heavier than a normal distribution, which means we have more data in the extremes of the distribution compared to the normal distribution. This means that extreme events are more likely to occur in this tech start-up marketing data. However, in my professional opinion this does not violate assumption (3) from lecture. Therefore, this means that \(\epsilon|x \sim Normal\) (the residuals are normally distributed)

Lastly, we assume that the data is independent. In other words, we assume that observation of marketing data from one company doesn’t affect another company’s marketing data. Therefore, this means that \(Cov(Y_i, Y_j|X) = 0\).

Since all four assumptions needed for Least Squares Linear Regression are valid, our model is valid.

Appendix B: Code used

knitr::opts_chunk$set(echo = TRUE)
library(formatR)
data <- read.csv("hw2.csv")
plot(data$Spend, data$Revenue, xlab = "Advertising Spending", ylab = "Revenue in 1st Year", 
    main = "Scatterplot of Advertising Spending and Revenue within 1st Year", col = ifelse(data$Revenue == 
        0, "red", "black"))
data[data$Revenue == 0, ]
spend = data$Spend[data$Revenue != 0]
revenue = data$Revenue[data$Revenue != 0]
model1 = lm(revenue ~ spend)
intercept = as.numeric(model1$coeff[1])
slope = as.numeric(model1$coeff[2])
options(scipen = 5)
plot(spend, revenue, xlab = "Advertising Spending in 1st Year", ylab = "Revenue in 1st Year", 
    main = "Revenue ~ 1.09*Spent + 134,559")
lines(spend, fitted(model1), col = "blue")
# text(x=8.3e+5, y=700000, labels=paste('rev=', round(slope, 2),'*spend + ',
# round(intercept,2), sep='')) text(x=8.3e+5,y=600000, labels=paste('for
# Advertising Spent between ')) text(x=8.3e+5,y=500000, labels=paste('$',
# floor(min(spend)), ' and $', ceiling(max(spend))))

newdata5 = data.frame(spend = seq(475000, 525000, length.out = 30))
predict.interval5 = predict(model1, newdata5, interval = "predict")

newdata5.only = data.frame(spend = 500000)
predict.interval5.only = predict(model1, newdata5.only, interval = "predict")


newdata7 = data.frame(spend = seq(675000, 725000, length.out = 30))
predict.interval7 = predict(model1, newdata7, interval = "predict")

newdata7.only = data.frame(spend = 700000)
predict.interval7.only = predict(model1, newdata7.only, interval = "predict")


options(scipen = 5)
plot(spend, revenue, xlab = "Advertising Spending", ylab = "Revenue in 1st Year", 
    main = "Prediction Intervals for Spent = 500k & 700k")
lines(spend, fitted(model1), col = "blue")
# text(x=8.3e+5, y=700000, labels=paste('rev=', round(slope, 2),'*spend + ',
# round(intercept,2), sep='')) text(x=8.3e+5,y=600000, labels=paste('for
# Advertising Spent between ')) text(x=8.3e+5,y=500000, labels=paste('$',
# floor(min(spend)), ' and $', ceiling(max(spend))))
lines(newdata5[, 1], predict.interval5[, 2], col = "orange", lty = 2)
lines(newdata5[, 1], predict.interval5[, 3], col = "orange", lty = 2)
points(500000, predict.interval5.only[, 2], pch = 23, col = "red")
points(500000, predict.interval5.only[, 3], pch = 23, col = "red")

lines(newdata7[, 1], predict.interval7[, 2], col = "green", lty = 2)
lines(newdata7[, 1], predict.interval7[, 3], col = "green", lty = 2)
points(700000, predict.interval7.only[, 2], pch = 23, col = "green")
points(700000, predict.interval7.only[, 3], pch = 23, col = "green")
fivelow = as.numeric(predict.interval5.only[, 2])
fiveup = as.numeric(predict.interval5.only[, 3])
sevenlow = as.numeric(predict.interval7.only[, 2])
sevenup = as.numeric(predict.interval7.only[, 3])

FIVE = 500000
SEVEN = 700000
Lower.Rev = c(fivelow, sevenlow)
Upper.Rev = c(fiveup, sevenup)
Lower.Profit = c(fivelow - FIVE, sevenlow - SEVEN)
Upper.Profit = c(fiveup - FIVE, sevenup - SEVEN)
Amount = c("500k", "700k")
df = data.frame(Amount, Lower.Rev, Upper.Rev, Lower.Profit, Upper.Profit)
print(df)

# median(data$Spend)
par(mfrow = c(2, 2))
plot(model1)
data <- read.csv("hw2.csv")
plot(data$Spend, data$Revenue, xlab = "Advertising Spending", ylab = "Revenue in 1st Year", 
    main = "Scatterplot of Advertising Spending and Revenue within 1st Year", col = ifelse(data$Revenue == 
        0, "red", "black"))

data[data$Revenue == 0, ]

spend = data$Spend[data$Revenue != 0]
revenue = data$Revenue[data$Revenue != 0]
model1 = lm(revenue ~ spend)
intercept = as.numeric(model1$coeff[1])
slope = as.numeric(model1$coeff[2])
options(scipen = 5)
plot(spend, revenue, xlab = "Advertising Spending in 1st Year", ylab = "Revenue in 1st Year", 
    main = "Revenue ~ 1.09*Spent + 134,559")
lines(spend, fitted(model1), col = "blue")
# text(x=8.3e+5, y=700000, labels=paste('rev=', round(slope, 2),'*spend + ',
# round(intercept,2), sep='')) text(x=8.3e+5,y=600000, labels=paste('for
# Advertising Spent between ')) text(x=8.3e+5,y=500000, labels=paste('$',
# floor(min(spend)), ' and $', ceiling(max(spend))))


newdata5 = data.frame(spend = seq(475000, 525000, length.out = 30))
predict.interval5 = predict(model1, newdata5, interval = "predict")

newdata5.only = data.frame(spend = 500000)
predict.interval5.only = predict(model1, newdata5.only, interval = "predict")


newdata7 = data.frame(spend = seq(675000, 725000, length.out = 30))
predict.interval7 = predict(model1, newdata7, interval = "predict")

newdata7.only = data.frame(spend = 700000)
predict.interval7.only = predict(model1, newdata7.only, interval = "predict")


options(scipen = 5)
plot(spend, revenue, xlab = "Advertising Spending", ylab = "Revenue in 1st Year", 
    main = "Prediction Intervals for Spent = 500k & 700k")
lines(spend, fitted(model1), col = "blue")
# text(x=8.3e+5, y=700000, labels=paste('rev=', round(slope, 2),'*spend + ',
# round(intercept,2), sep='')) text(x=8.3e+5,y=600000, labels=paste('for
# Advertising Spent between ')) text(x=8.3e+5,y=500000, labels=paste('$',
# floor(min(spend)), ' and $', ceiling(max(spend))))
lines(newdata5[, 1], predict.interval5[, 2], col = "orange", lty = 2)
lines(newdata5[, 1], predict.interval5[, 3], col = "orange", lty = 2)
points(500000, predict.interval5.only[, 2], pch = 23, col = "red")
points(500000, predict.interval5.only[, 3], pch = 23, col = "red")

lines(newdata7[, 1], predict.interval7[, 2], col = "green", lty = 2)
lines(newdata7[, 1], predict.interval7[, 3], col = "green", lty = 2)
points(700000, predict.interval7.only[, 2], pch = 23, col = "green")
points(700000, predict.interval7.only[, 3], pch = 23, col = "green")

fivelow = as.numeric(predict.interval5.only[, 2])
fiveup = as.numeric(predict.interval5.only[, 3])
sevenlow = as.numeric(predict.interval7.only[, 2])
sevenup = as.numeric(predict.interval7.only[, 3])

FIVE = 500000
SEVEN = 700000
Lower.Rev = c(fivelow, sevenlow)
Upper.Rev = c(fiveup, sevenup)
Lower.Profit = c(fivelow - FIVE, sevenlow - SEVEN)
Upper.Profit = c(fiveup - FIVE, sevenup - SEVEN)
Amount = c("500k", "700k")
df = data.frame(Amount, Lower.Rev, Upper.Rev, Lower.Profit, Upper.Profit)
print(df)

median(data$Spend)

par(mfrow = c(2, 2))
plot(model1)