SRout Discussion 12

Question

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

What is residuals?

The residual data of the simple linear regression model is the difference between the observed data of the dependent variable y and the fitted values \(\hat y\).

The residuals from a linear regression model should be homoscedastic. If not, this indicates an issue with the model such as non-linearity in the data.

Data Loading and EDA

The datset having 2 colums and 500 observation. Two columns are:

Independant variable X: Outside Air Temperature
Dependant variable Y: Overall daily revenue generated in dollars

Data downloaded from below link:
https://www.kaggle.com/vinicius150987/ice-cream-revenue

# load csv data in to r
data <- read.csv("/Users/subhalaxmirout/DATA 605/IceCreamData.csv")
# first 5 rows od data
head(data)

##   Temperature  Revenue
## 1    24.56688 534.7990
## 2    26.00519 625.1901
## 3    27.79055 660.6323
## 4    20.59534 487.7070
## 5    11.50350 316.2402
## 6    14.35251 367.9407

Here we will see any missing value exist or not.

# missing values
cars[!complete.cases(data),]

## [1] speed dist 
## <0 rows> (or 0-length row.names)

No, missing values available in dataset.

Both variables having numeric data type.

str(data)

## 'data.frame':    500 obs. of  2 variables:
##  $ Temperature: num  24.6 26 27.8 20.6 11.5 ...
##  $ Revenue    : num  535 625 661 488 316 ...

Below plot shows correlation of Temperature and Revenue.

# Show correlation
pairs(data[,])

It shows positively corelated. Lets plot Plot the scatter plot to see relationship between 2 variables.

Summary od data shows the data distribution

summary(data)

##   Temperature       Revenue      
##  Min.   : 0.00   Min.   :  10.0  
##  1st Qu.:17.12   1st Qu.: 405.6  
##  Median :22.39   Median : 529.4  
##  Mean   :22.23   Mean   : 521.6  
##  3rd Qu.:27.74   3rd Qu.: 642.3  
##  Max.   :45.00   Max.   :1000.0

library(ggplot2)
ggplot(aes(x = Temperature, y = Revenue), data = data) +
  geom_point(color = 'dark red') +
  ggtitle('Ice cream Sales') +
  xlab('Outside Air Temperature (DegC)') +
  ylab('Overall daily revenue generated (dollar)') +
  theme(plot.title = element_text(hjust = 0.5))

In the car package, we have the function powerTransform which transforms variables in a regression equation to make the residuals in the transformed equation as normal as possible.

library(car)
powerTransform(data$Revenue)

## Estimated transformation parameter 
## data$Revenue 
##    0.9987643

Model building

Create a linear regression model.

lm = lm(Revenue ~ Temperature, data = data)
summary(lm)

## 
## Call:
## lm(formula = Revenue ~ Temperature, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.303 -15.596  -0.167  16.811  91.294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.8313     3.2718    13.7   <2e-16 ***
## Temperature  21.4436     0.1383   155.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.01 on 498 degrees of freedom
## Multiple R-squared:  0.9797, Adjusted R-squared:  0.9797 
## F-statistic: 2.404e+04 on 1 and 498 DF,  p-value: < 2.2e-16

Residuals are fairly distibutes, p-value is < 0.05 which is statistically significant. Multiple R-squared: 0.9797 which is good fit for the model. So, we can write linear model equation: \[ Revenue = 44.83 + 21.44 * Temperature \]

plot(data$Temperature,data$Revenue, pch=16,cex=1.3, col="dark red",
     xlab="Temperature",ylab="Revenue",main="Linear regression model")
abline(lm)

Most of data points are fall on regression line.

Residual Analysis

residual = resid(lm)
plot(data$Temperature, residual, xlab="Temperature",
     ylab="Residuals",
     main="Residual Plot" ) 
abline(h = 0, col = "blue", lwd=2, lty=2)
abline(h = 70, col = "dark red", lwd=2, lty=2)
abline(h = -70, col = "dark red", lwd=2, lty=2)

hist(residual, col = "steelblue")

qqnorm(residual)
qqline(residual)

Above residual analysis, we found:

The residual plot shows the residuals remains constant over the entire range of the explanatory variable. Most of the residuals variation are constantly bounding between y = 70 and y = -70 with y = 0 as a center.
Histogram shows unimodal and symmetric.
Q-Q plot shows most observations are fall on the line with minimal deviation.