Discussion 11

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I will be using an Ice Cream Sales data set to determine wether or not temperature(in Farenheit) has an affect on ice cream profits

Load the data

ice_cream <- read.csv('https://raw.githubusercontent.com/Kingtilon1/DATA607/main/Ice%20Cream%20Sales%20-%20temperatures.csv')
glimpse(ice_cream)

## Rows: 365
## Columns: 2
## $ Temperature       <int> 39, 40, 41, 42, 43, 43, 44, 44, 45, 45, 45, 46, 46, …
## $ Ice.Cream.Profits <dbl> 13.17, 11.88, 18.82, 18.65, 17.02, 15.88, 19.07, 19.…

Is there some kind of visual correlation?

scatter_plot <- plot_ly(data = ice_cream, x = ~Temperature, y = ~Ice.Cream.Profits, type = "scatter", mode = "markers", 
                        marker = list(color = ~Ice.Cream.Profits, colorscale = "Viridis")) %>%
                layout(title = "Ice Cream Sales vs. Temperature",
                       xaxis = list(title = "Temperature"),
                       yaxis = list(title = "Ice Cream Sales"))

scatter_plot

There is a positivie linear relationship between temperature and Ice cream sales

Fit the model

lm_model <- lm(Ice.Cream.Profits ~ Temperature, data = ice_cream)
summary(lm_model)

## 
## Call:
## lm(formula = Ice.Cream.Profits ~ Temperature, data = ice_cream)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9404 -1.3804  0.0956  1.5976  8.7716 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -33.698166   0.702173  -47.99   <2e-16 ***
## Temperature   1.192009   0.009594  124.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.427 on 363 degrees of freedom
## Multiple R-squared:  0.977,  Adjusted R-squared:  0.977 
## F-statistic: 1.544e+04 on 1 and 363 DF,  p-value: < 2.2e-16

The linear regression model predicts ice cream profits based on temperature. The coefficient for temperature is 1.192, indicating that for every one-unit increase in temperature, ice cream profits increase by approximately 1.192 units. The model is highly statistically significant (p < 2e-16), explaining a large proportion of the variance in ice cream profits (Multiple R-squared = 0.977). The residuals are well-distributed around zero, suggesting that the linear model is appropriate for the data.

Residual analysis

ice_cream$Predicted <- predict(lm_model)

ice_cream$Residuals <- residuals(lm_model)

residual_plot <- ggplot(ice_cream, aes(x = Temperature, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs. Temperature",
       x = "Temperature",
       y = "Residuals")

residual_plot

The residuals seem to be evenly distributed between over-predictions and under-predictions which is a good sign based off of the linear regression book in page 23 which says “A model that fits the data well would tend to over-predict as often as it under-predicts. Thus, if we plot the residual values, we would expect to see them distributed normally around zero for a well-fitted model”

Resource

Kaggle

Discussion 11

2024-03-13

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I will be using an Ice Cream Sales data set to determine wether or not temperature(in Farenheit) has an affect on ice cream profits

Is there some kind of visual correlation?

There is a positivie linear relationship between temperature and Ice cream sales

Fit the model

Residual analysis

Resource