Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has food trucks in various cities and you have data for profits and populations from cities. We would like to use this data to help us select which city to expand to next.
The attached txt file contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss. The first column refers to the population size in 10,000s and the second column refers to the profit in $10,000s.
data <- read.csv("r-foodex-data.txt", header=F)
names(data) <- c("Population","Profit")
head(data)
## Population Profit
## 1 6.1101 17.5920
## 2 5.5277 9.1302
## 3 8.5186 13.6620
## 4 7.0032 11.8540
## 5 5.8598 6.8233
## 6 8.3829 11.8860
Looking at structure and summary of data
str(data)
## 'data.frame': 97 obs. of 2 variables:
## $ Population: num 6.11 5.53 8.52 7 5.86 ...
## $ Profit : num 17.59 9.13 13.66 11.85 6.82 ...
summary(data)
## Population Profit
## Min. : 5.027 Min. :-2.681
## 1st Qu.: 5.708 1st Qu.: 1.987
## Median : 6.589 Median : 4.562
## Mean : 8.160 Mean : 5.839
## 3rd Qu.: 8.578 3rd Qu.: 7.047
## Max. :22.203 Max. :24.147
a<-summary(data$Profit)
profit_bucket <- cut(data$Profit, breaks = c(a[1],a[2],a[4],a[5],a[6]), labels = c("1st Quarter","2nd Quarter","3rd Quarter","4th Quarter"))
g<-ggplot(data) + aes(x = Population, y=Profit, col=profit_bucket) + geom_point(stat = "identity")
ggplotly(g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Looking at outliers in the data
prof_pop <- data$Profit/data$Population
mean_pp <- mean(prof_pop)
sd_pp <- mean(prof_pop)
outliers <- prof_pop>(mean_pp + 2*sd_pp) | prof_pop<(mean_pp - 2*sd_pp)
g<-ggplot(data) + aes(x = Population, y=Profit, col=outliers) + geom_point(stat = "identity")
ggplotly(g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
lm1 <- lm(Profit ~ Population, data = data)
summ_lm1 <- summary(lm1)
summ_lm1
##
## Call:
## lm(formula = Profit ~ Population, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8540 -1.9686 -0.5407 1.5360 14.1982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.89578 0.71948 -5.415 4.61e-07 ***
## Population 1.19303 0.07974 14.961 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.024 on 95 degrees of freedom
## Multiple R-squared: 0.702, Adjusted R-squared: 0.6989
## F-statistic: 223.8 on 1 and 95 DF, p-value: < 2.2e-16
The R-square value is 0.7020316 and the adjusted R-square is 0.698895. The summary also shows hat the population is a significant variable (which is trivial in the case of single variable linear regression).
The summary also shows that Profit will increase by $11,930.30 for every population increase of 10,000.
g<-ggplot(data) + aes(x = Population, y=Profit) + geom_point(stat = "identity") + geom_abline(slope = 1.19303, intercept = -3.89578, col="blue")
ggplotly(g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
par(mfrow=c(2,2))
plot(lm1)
Looking at the first plot we can say that the model is linear as it is not showing any pattern. Also, heteroskedasticity is not present.
From QQ plot, we can easily infer that the data is normally distributed.
Durbin Watson Test
dwtest(data$Profit ~ data$Population)
##
## Durbin-Watson test
##
## data: data$Profit ~ data$Population
## DW = 0.9941, p-value = 4.267e-08
## alternative hypothesis: true autocorrelation is greater than 0
The test shows(0<DW<2, p-value is very low) that a positive autocorrelation exists in residuals from regression. Thus the our model is not a very good estimater.