Introduction

Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has food trucks in various cities and you have data for profits and populations from cities. We would like to use this data to help us select which city to expand to next.

The attached txt file contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss. The first column refers to the population size in 10,000s and the second column refers to the profit in $10,000s.

Load Data and View Summaries

data <- read.csv("r-foodex-data.txt", header=F)
names(data) <- c("Population","Profit")
head(data)
##   Population  Profit
## 1     6.1101 17.5920
## 2     5.5277  9.1302
## 3     8.5186 13.6620
## 4     7.0032 11.8540
## 5     5.8598  6.8233
## 6     8.3829 11.8860

Looking at structure and summary of data

str(data)
## 'data.frame':    97 obs. of  2 variables:
##  $ Population: num  6.11 5.53 8.52 7 5.86 ...
##  $ Profit    : num  17.59 9.13 13.66 11.85 6.82 ...
summary(data)
##    Population         Profit      
##  Min.   : 5.027   Min.   :-2.681  
##  1st Qu.: 5.708   1st Qu.: 1.987  
##  Median : 6.589   Median : 4.562  
##  Mean   : 8.160   Mean   : 5.839  
##  3rd Qu.: 8.578   3rd Qu.: 7.047  
##  Max.   :22.203   Max.   :24.147

Plotting and Visualing data

a<-summary(data$Profit)
profit_bucket <- cut(data$Profit, breaks = c(a[1],a[2],a[4],a[5],a[6]), labels = c("1st Quarter","2nd Quarter","3rd Quarter","4th Quarter"))
g<-ggplot(data) + aes(x = Population, y=Profit, col=profit_bucket) + geom_point(stat = "identity")
ggplotly(g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Looking at outliers in the data

prof_pop <- data$Profit/data$Population
mean_pp <- mean(prof_pop)
sd_pp <- mean(prof_pop)

outliers <- prof_pop>(mean_pp + 2*sd_pp) | prof_pop<(mean_pp - 2*sd_pp)
g<-ggplot(data) + aes(x = Population, y=Profit, col=outliers) + geom_point(stat = "identity")
ggplotly(g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Regression - 1

lm1 <- lm(Profit ~ Population, data = data)
summ_lm1 <- summary(lm1)
summ_lm1
## 
## Call:
## lm(formula = Profit ~ Population, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8540 -1.9686 -0.5407  1.5360 14.1982 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.89578    0.71948  -5.415 4.61e-07 ***
## Population   1.19303    0.07974  14.961  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.024 on 95 degrees of freedom
## Multiple R-squared:  0.702,  Adjusted R-squared:  0.6989 
## F-statistic: 223.8 on 1 and 95 DF,  p-value: < 2.2e-16

The R-square value is 0.7020316 and the adjusted R-square is 0.698895. The summary also shows hat the population is a significant variable (which is trivial in the case of single variable linear regression).

The summary also shows that Profit will increase by $11,930.30 for every population increase of 10,000.

g<-ggplot(data) + aes(x = Population, y=Profit) + geom_point(stat = "identity") + geom_abline(slope = 1.19303, intercept = -3.89578, col="blue")
ggplotly(g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Diagnostic Plot

par(mfrow=c(2,2))

plot(lm1)

Looking at the first plot we can say that the model is linear as it is not showing any pattern. Also, heteroskedasticity is not present.

From QQ plot, we can easily infer that the data is normally distributed.

Test for Auto-correlation (AR(1))

Durbin Watson Test

dwtest(data$Profit ~ data$Population)
## 
##  Durbin-Watson test
## 
## data:  data$Profit ~ data$Population
## DW = 0.9941, p-value = 4.267e-08
## alternative hypothesis: true autocorrelation is greater than 0

The test shows(0<DW<2, p-value is very low) that a positive autocorrelation exists in residuals from regression. Thus the our model is not a very good estimater.