Simple Regression

#Q1) When should we use regression instead of ANOVA?

#A1)Regression and ANOVA are both statistical methods used to analyze the relationship between a dependent variable and one or more independent variables. However, there are some differences between the two that can help guide the choice of which method to use in a particular situation.

#Regression is typically used when the goal is to predict or model the value of a dependent variable based on one or more independent variables. In regression, the focus is on estimating the parameters of a model that describes the relationship between the variables. Regression is often used in fields like finance, economics, and marketing to analyze and forecast trends in data.

#ANOVA, on the other hand, is typically used when the goal is to compare the means of two or more groups on a dependent variable. ANOVA tests whether there are significant differences in the means of the groups and can help determine whether those differences are statistically significant. ANOVA is often used in fields like psychology, education, and medicine to compare the effectiveness of different treatments or interventions.

#In summary, we should use regression when the goal is to model or predict a dependent variable based on one or more independent variables, and should use ANOVA when the goal is to compare the means of two or more groups on a dependent variable.

#Q2) Please explain the relationship between SStotal, SSregression, and SSerror.

#A2) In regression analysis, the total sum of squares (SStotal) is the sum of the squared differences between each observation of the dependent variable and the overall mean of the dependent variable. The total sum of squares is divided into two components: the sum of squares due to regression (SSregression) and the sum of squares due to error (SSerror).

#The sum of squares due to regression (SSregression) is the sum of the squared differences between the predicted values of the dependent variable (obtained from the regression equation) and the overall mean of the dependent variable. SSregression represents the variation in the dependent variable that is explained by the regression model. In other words, SSregression measures the amount of variation in the dependent variable that can be attributed to the independent variables in the regression model.

#The sum of squares due to error (SSerror) is the sum of the squared differences between the observed values of the dependent variable and the predicted values of the dependent variable (obtained from the regression equation). SSerror represents the variation in the dependent variable that is not explained by the regression model. In other words, SSerror measures the amount of variation in the dependent variable that is due to random error or other factors not accounted for by the independent variables in the regression model.

#The relationship between SStotal, SSregression, and SSerror is given by the following equation:

#SStotal = SSregression + SSerror

#In other words, the total variation in the dependent variable (SStotal) can be partitioned into two components: the variation explained by the regression model (SSregression) and the unexplained variation (SSerror). The proportion of variation explained by the regression model can be quantified by the coefficient of determination (R-squared), which is defined as:

#R-squared = SSregression / SStotal

#Thus, R-squared ranges from 0 to 1, with higher values indicating a better fit of the regression model to the data.

#Q3) Please use the following data to build a regression model and write a summary. IV is sugar and DV is calories.

#Sugar: 5, 8, 9, 10, 15, 18, 14, 17, 20, 22, 24, 26, 30 ,30, 32

#Calories: 20, 30, 60, 70, 100, 95, 70, 83, 103, 112, 130, 80, 95, 130, 112

sugar <- c(5, 8, 9, 10, 15, 18, 14, 17, 20, 22, 24, 26, 30, 30, 32)
calories <- c(20, 30, 60, 70, 100, 95, 70, 83, 103, 112, 130, 80, 95, 130, 112)

plot(sugar, calories, main = "Scatter plot of sugar and calories", xlab = "Sugar", ylab = "Calories")

model <- lm(calories ~ sugar)
summary(model)

## 
## Call:
## lm(formula = calories ~ sugar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.332 -19.060   3.438  11.985  27.758 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.1542    12.4132   2.349 0.035315 *  
## sugar         3.0453     0.6074   5.013 0.000237 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.56 on 13 degrees of freedom
## Multiple R-squared:  0.6591, Adjusted R-squared:  0.6329 
## F-statistic: 25.13 on 1 and 13 DF,  p-value: 0.0002373

#Conclusion: The regression analysis of the relationship between sugar and calories shows that sugar is a significant predictor of calories (t-value = 5.013, p-value = 0.000237). The estimated regression equation is:

#Calories = 29.1542 + 3.0453*Sugar
#This equation implies that, on average, for each one unit increase in sugar, there is a predicted increase in calories of 3.0453 units. The intercept of 29.1542 indicates that when sugar is zero, the predicted value of calories is 29.1542.

#The R-squared value of 0.6591 indicates that 65.91% of the variance in the dependent variable (calories) is explained by the model. This indicates that the model fits the data reasonably well. The adjusted R-squared value of 0.6329 suggests that the model still explains a significant proportion of the variance in calories after accounting for the number of predictors.

#The residual standard error (RSE) of 19.56 indicates that the typical distance between the observed values of calories and the predicted values is about 19.56 units. This suggests that the model's predictions of calorie values may have a degree of error.

#Finally, the F-statistic of 25.13 with 1 and 13 degrees of freedom indicates that the model is statistically significant, with a p-value of 0.0002373. This suggests that the model as a whole is a good fit for the data.

Simple Regression

Apoorva Grover

2023-03-12