Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I will be using an Ice Cream Sales data set to determine wether or not temperature(in Farenheit) has an affect on ice cream profits

Load the data

ice_cream <- read.csv('https://raw.githubusercontent.com/Kingtilon1/DATA607/main/Ice%20Cream%20Sales%20-%20temperatures.csv')
glimpse(ice_cream)
## Rows: 365
## Columns: 2
## $ Temperature       <int> 39, 40, 41, 42, 43, 43, 44, 44, 45, 45, 45, 46, 46, …
## $ Ice.Cream.Profits <dbl> 13.17, 11.88, 18.82, 18.65, 17.02, 15.88, 19.07, 19.…

Is there some kind of visual correlation?

scatter_plot <- plot_ly(data = ice_cream, x = ~Temperature, y = ~Ice.Cream.Profits, type = "scatter", mode = "markers", 
                        marker = list(color = ~Ice.Cream.Profits, colorscale = "Viridis")) %>%
                layout(title = "Ice Cream Sales vs. Temperature",
                       xaxis = list(title = "Temperature"),
                       yaxis = list(title = "Ice Cream Sales"))

scatter_plot

There is a positivie linear relationship between temperature and Ice cream sales

Fit the model

lm_model <- lm(Ice.Cream.Profits ~ Temperature, data = ice_cream)
summary(lm_model)
## 
## Call:
## lm(formula = Ice.Cream.Profits ~ Temperature, data = ice_cream)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9404 -1.3804  0.0956  1.5976  8.7716 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -33.698166   0.702173  -47.99   <2e-16 ***
## Temperature   1.192009   0.009594  124.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.427 on 363 degrees of freedom
## Multiple R-squared:  0.977,  Adjusted R-squared:  0.977 
## F-statistic: 1.544e+04 on 1 and 363 DF,  p-value: < 2.2e-16

The linear regression model predicts ice cream profits based on temperature. The coefficient for temperature is 1.192, indicating that for every one-unit increase in temperature, ice cream profits increase by approximately 1.192 units. The model is highly statistically significant (p < 2e-16), explaining a large proportion of the variance in ice cream profits (Multiple R-squared = 0.977). The residuals are well-distributed around zero, suggesting that the linear model is appropriate for the data.

Residual analysis

ice_cream$Predicted <- predict(lm_model)

ice_cream$Residuals <- residuals(lm_model)
residual_plot <- ggplot(ice_cream, aes(x = Temperature, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs. Temperature",
       x = "Temperature",
       y = "Residuals")

residual_plot

The residuals seem to be evenly distributed between over-predictions and under-predictions which is a good sign based off of the linear regression book in page 23 which says “A model that fits the data well would tend to over-predict as often as it under-predicts. Thus, if we plot the residual values, we would expect to see them distributed normally around zero for a well-fitted model”

Resource

Kaggle