Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Dichotomous variable vs Quadratic Variable:

Here are some notes, I took from ChatGPT: Dichotomous refers to something that is divided into two parts or categories. In statistics and research, it usually refers to a variable that has two distinct categories, such as "yes" or "no," "true" or "false," or "success" or "failure."

Quadratic, on the other hand, refers to a mathematical function or equation that has a degree of 2. In data science, it is often used to model nonlinear relationships between variables. For example, a quadratic model could be used to describe the relationship between a company's advertising budget and its sales revenue.

So, in summary, dichotomous refers to a variable with two distinct categories, while quadratic refers to a mathematical function or equation with a degree of 2, often used to model nonlinear relationships between variables. These are two different concepts and not directly related to each other.

A common approach to selecting quadratic terms is to use domain knowledge and intuition to identify potential nonlinear relationships between variables. For example, if you are modeling the relationship between a company's advertising budget and its sales revenue, you might suspect that the relationship is not linear and that there is a quadratic effect. In this case, you might choose to include a quadratic term for the advertising budget variable in your model.

Another approach to selecting quadratic terms is to use automated feature selection techniques, such as stepwise regression, which iteratively adds and removes variables from a model based on their statistical significance. These techniques can help to identify the most important quadratic terms for your model.

It's important to note that adding quadratic terms to a model can increase its complexity and make it more difficult to interpret. Therefore, it's important to use good judgment and carefully evaluate the performance of your model when including quadratic terms.

For the mtcars dataset, I am choosing am as my dichotomous variable and mpg as my quadratic variable. I will look at mpg vs wt because I think the lower the weight of the car, the more mpg it will have. My 2 dependent variable are mpg, and wt, and my independent variable is the wheel shape.

data("mtcars")
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
car.lm <- lm(mpg ~ wt + I(wt^2) + vs, data = mtcars)
print(car.lm)
## 
## Call:
## lm(formula = mpg ~ wt + I(wt^2) + vs, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt      I(wt^2)           vs  
##      44.990      -11.790        1.051        2.686

The I() function in R to specify the quadratic term. Here, I(wt^2) creates a new variable that is the squared value of wt, allowing for the possibility of a non-linear relationship between mpg and wt. The other terms in the equation are the same as before (mpg is the response variable, wt is the weight of the car, vs is the engine type, and data specifies the data frame).

Adding a quadratic term assumes that the relationship between mpg and wt is not linear. However, it is important to evaluate the model to ensure that this is actually the case and that the assumptions of linear regression are still being met. My initial equation was just: car.lm <- lm(mpg ~ wt + vs, data = mtcars), which didn't have the nonlinear parts.

mpg is the response variable (i.e., the variable we are trying to predict). In this case, it refers to miles per gallon (mpg). wt and vs are the predictor variables (i.e., the variables used to predict the response variable). wt refers to the weight of the car, and vs refers to the engine type (0 = V-shaped, 1 = straight). I(wt^2) creates a new variable that is the squared value of wt, allowing for the possibility of a non-linear relationship between mpg and wt.

The equation indicates that we are creating a linear regression model where we are trying to predict mpg based on wt, wt^2, and vs. This model assumes that the relationship between mpg and wt and wt^2 is linear, but allows for a non-linear relationship between mpg and wt through the quadratic term.

Note that adding a quadratic term assumes that the relationship between mpg and wt is not linear, but it is still important to evaluate the model to ensure that this assumption is actually supported by the data. Additionally, it is important to check for other possible sources of non-linearity and consider other types of transformations or predictor variables that may improve the accuracy of the model.

summary(car.lm)
## 
## Call:
## lm(formula = mpg ~ wt + I(wt^2) + vs, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5022 -1.6747 -0.8383  1.7018  5.5730 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.9903     4.3163  10.423 3.82e-11 ***
## wt          -11.7895     2.3865  -4.940 3.27e-05 ***
## I(wt^2)       1.0511     0.3327   3.159  0.00377 ** 
## vs            2.6861     1.0510   2.556  0.01631 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.429 on 28 degrees of freedom
## Multiple R-squared:  0.8533, Adjusted R-squared:  0.8376 
## F-statistic: 54.28 on 3 and 28 DF,  p-value: 8.619e-12

The intercept coefficient represents the expected value of the response variable (mpg) when all the predictor variables are equal to zero. However, this interpretation is not very meaningful in this case because all the predictor variables (wt and vs) are binary (0 or 1), and it is not possible for them to be exactly equal to zero. The coefficient for wt represents the expected change in mpg for a one-unit increase in weight, while holding the engine type constant. The coefficient value of -5.344 denotes that for every one-unit increase in weight, holding engine type constant, the predicted mpg value will decrease by 5.344. The coefficient for I(wt^2) represents the expected change in mpg for a one-unit increase in the squared value of weight. Since the coefficient value of 0.5596 is positive, it indicates that as weight increases, the change in mpg may not always decrease linearly but may decrease less rapidly at some point before finally decreasing again. This is because the coefficient for I(wt^2) allows for the possibility of a nonlinear relationship between wt and mpg. The coefficient for vs represents the difference in the expected value of mpg between cars with a V-shaped engine (vs=0) and cars with a straight engine (vs=1). In this case, the coefficient value of 2.3905 indicates that cars with a straight engine are expected to have higher mpg values than cars with a V-shaped engine, holding the weight constant.

qqPlot(car.lm, main = "Normal Q-Q Plot")

##       Fiat 128 Toyota Corolla 
##             18             20
# Check for homoscedasticity of residuals
spreadLevelPlot(car.lm, main = "Spread-Level Plot")

## 
## Suggested power transformation:  0.9208308
# residuals vs fitted value plots
plot(x = fitted(car.lm), y = residuals(car.lm), 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs. Fitted Values Plot")

Conclusion This assignment was quite a challenge for me, thus I had to get ChatGPT to fix my equation, because mine was initially just a regular linear model. What I learned from this task was: The assumptions of the multiple regression model can be checked by examining the residual plots. If normally distributed residuals, have a constant variance, and are randomly scattered around the horizontal line in the spread-level plot, then the linear model is appropriate. If these assumptions are violated, then the linear model may not be right for the situation. In this case, the residual plots show that the residuals are approximately normally distributed, have a constant variance, and are randomly scattered around the horizontal line, indicating that the linear model is appropriate for the sample data set. And now that I have a better understanding of multiple linear regression, I can rely less on ChatGPT to fix my work, next time.