Question

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

What is residuals?

The residual data of the simple linear regression model is the difference between the observed data of the dependent variable y and the fitted values \(\hat y\).

The residuals from a linear regression model should be homoscedastic. If not, this indicates an issue with the model such as non-linearity in the data.

Data Loading and EDA

The datset having 2 colums and 500 observation. Two columns are:

Data downloaded from below link:
https://www.kaggle.com/vinicius150987/ice-cream-revenue

# load csv data in to r
data <- read.csv("/Users/subhalaxmirout/DATA 605/IceCreamData.csv")
# first 5 rows od data
head(data)
##   Temperature  Revenue
## 1    24.56688 534.7990
## 2    26.00519 625.1901
## 3    27.79055 660.6323
## 4    20.59534 487.7070
## 5    11.50350 316.2402
## 6    14.35251 367.9407

Here we will see any missing value exist or not.

# missing values
cars[!complete.cases(data),]
## [1] speed dist 
## <0 rows> (or 0-length row.names)

No, missing values available in dataset.

Both variables having numeric data type.

str(data)
## 'data.frame':    500 obs. of  2 variables:
##  $ Temperature: num  24.6 26 27.8 20.6 11.5 ...
##  $ Revenue    : num  535 625 661 488 316 ...

Below plot shows correlation of Temperature and Revenue.

# Show correlation
pairs(data[,])

It shows positively corelated. Lets plot Plot the scatter plot to see relationship between 2 variables.

Summary od data shows the data distribution

summary(data)
##   Temperature       Revenue      
##  Min.   : 0.00   Min.   :  10.0  
##  1st Qu.:17.12   1st Qu.: 405.6  
##  Median :22.39   Median : 529.4  
##  Mean   :22.23   Mean   : 521.6  
##  3rd Qu.:27.74   3rd Qu.: 642.3  
##  Max.   :45.00   Max.   :1000.0
library(ggplot2)
ggplot(aes(x = Temperature, y = Revenue), data = data) +
  geom_point(color = 'dark red') +
  ggtitle('Ice cream Sales') +
  xlab('Outside Air Temperature (DegC)') +
  ylab('Overall daily revenue generated (dollar)') +
  theme(plot.title = element_text(hjust = 0.5))

In the car package, we have the function powerTransform which transforms variables in a regression equation to make the residuals in the transformed equation as normal as possible.

library(car)
powerTransform(data$Revenue)
## Estimated transformation parameter 
## data$Revenue 
##    0.9987643

Model building

Create a linear regression model.

lm = lm(Revenue ~ Temperature, data = data)
summary(lm)
## 
## Call:
## lm(formula = Revenue ~ Temperature, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.303 -15.596  -0.167  16.811  91.294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.8313     3.2718    13.7   <2e-16 ***
## Temperature  21.4436     0.1383   155.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.01 on 498 degrees of freedom
## Multiple R-squared:  0.9797, Adjusted R-squared:  0.9797 
## F-statistic: 2.404e+04 on 1 and 498 DF,  p-value: < 2.2e-16

Residuals are fairly distibutes, p-value is < 0.05 which is statistically significant. Multiple R-squared: 0.9797 which is good fit for the model. So, we can write linear model equation: \[ Revenue = 44.83 + 21.44 * Temperature \]

plot(data$Temperature,data$Revenue, pch=16,cex=1.3, col="dark red",
     xlab="Temperature",ylab="Revenue",main="Linear regression model")
abline(lm)

Most of data points are fall on regression line.

Residual Analysis

residual = resid(lm)
plot(data$Temperature, residual, xlab="Temperature",
     ylab="Residuals",
     main="Residual Plot" ) 
abline(h = 0, col = "blue", lwd=2, lty=2)
abline(h = 70, col = "dark red", lwd=2, lty=2)
abline(h = -70, col = "dark red", lwd=2, lty=2)

hist(residual, col = "steelblue")

qqnorm(residual)
qqline(residual)

Above residual analysis, we found: