Discussion 7c

rm(list = ls())
gc()

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 536646 28.7    1198565 64.1         NA   669417 35.8
## Vcells 990944  7.6    8388608 64.0      16384  1851813 14.2

cat("\f")

dev.off

## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x1195bcc40>
## <environment: namespace:grDevices>

# first we set the WD and assign the dataset to a variable

setwd("/Users/ginaocchipinti/Documents/ADEC 7310  Data Analytics/Week 6")

wine_data <- read.csv("winequality-red.csv", sep=";", quote = "")

Find a dataset and run multivariate regression, with estimating equation and summary statistics od dataset. Present final regression with stargazer.

# run summary stats using stargazer
library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(wine_data, type = "text", title = "Descriptive Stats", digits = 2)

## 
## Descriptive Stats
## ========================================================
## Statistic                 N   Mean  St. Dev. Min   Max  
## --------------------------------------------------------
## X.fixed.acidity.        1,599 8.32    1.74   4.60 15.90 
## X.volatile.acidity.     1,599 0.53    0.18   0.12  1.58 
## X.citric.acid.          1,599 0.27    0.19   0.00  1.00 
## X.residual.sugar.       1,599 2.54    1.41   0.90 15.50 
## X.chlorides.            1,599 0.09    0.05   0.01  0.61 
## X.free.sulfur.dioxide.  1,599 15.87  10.46   1.00 72.00 
## X.total.sulfur.dioxide. 1,599 46.47  32.90   6.00 289.00
## X.density.              1,599 1.00   0.002   0.99  1.00 
## X.pH.                   1,599 3.31    0.15   2.74  4.01 
## X.sulphates.            1,599 0.66    0.17   0.33  2.00 
## X.alcohol.              1,599 10.42   1.07   8.40 14.90 
## X.quality.              1,599 5.64    0.81    3     8   
## --------------------------------------------------------

head(wine_data)

##   X.fixed.acidity. X.volatile.acidity. X.citric.acid. X.residual.sugar.
## 1              7.4                0.70           0.00               1.9
## 2              7.8                0.88           0.00               2.6
## 3              7.8                0.76           0.04               2.3
## 4             11.2                0.28           0.56               1.9
## 5              7.4                0.70           0.00               1.9
## 6              7.4                0.66           0.00               1.8
##   X.chlorides. X.free.sulfur.dioxide. X.total.sulfur.dioxide. X.density. X.pH.
## 1        0.076                     11                      34     0.9978  3.51
## 2        0.098                     25                      67     0.9968  3.20
## 3        0.092                     15                      54     0.9970  3.26
## 4        0.075                     17                      60     0.9980  3.16
## 5        0.076                     11                      34     0.9978  3.51
## 6        0.075                     13                      40     0.9978  3.51
##   X.sulphates. X.alcohol. X.quality.
## 1         0.56        9.4          5
## 2         0.68        9.8          5
## 3         0.65        9.8          5
## 4         0.58        9.8          6
## 5         0.56        9.4          5
## 6         0.56        9.4          5

# create new data that contains only variables 

wine_data2 <- wine_data[ , c(1,2, 3, 4)]
head(wine_data2)

##   X.fixed.acidity. X.volatile.acidity. X.citric.acid. X.residual.sugar.
## 1              7.4                0.70           0.00               1.9
## 2              7.8                0.88           0.00               2.6
## 3              7.8                0.76           0.04               2.3
## 4             11.2                0.28           0.56               1.9
## 5              7.4                0.70           0.00               1.9
## 6              7.4                0.66           0.00               1.8

# check if predictor variables have a linear association with the response variable

#install.packages("GGally")
library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(wine_data2)

There appears to be some sense of linear relationship between some of the variables, like fixed acidity and citric acid, and volatile acid and citric acid.

# fit the linear regression model 

model <- lm(wine_data2$X.residual.sugar. ~ wine_data2$X.fixed.acidity. + wine_data2$X.volatile.acidity. + wine_data2$X.citric.acid., data = wine_data2)

# check our assumptions of the model

# distribution of model residuals should be approximately normal

hist(residuals(model), col = "steelblue")

The distribution of residuals is somewhat normal but there is definitely some right skewness.

# variance of residuals should be consistent for all observations or homoskedastic 

# create a fitted value vs residual plot 

plot(fitted(model), residuals(model))

abline(h = 0, lty = 2)

There is some inequality with the residuals being scattered around the fitted values, though lots are concentrated around the best fit line.

# create the regression output using Stargazer 

stargazer(model, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                          X.residual.sugar.     
## -----------------------------------------------
## X.fixed.acidity.               0.007           
##                               (0.027)          
##                                                
## X.volatile.acidity.          0.909***          
##                               (0.237)          
##                                                
## X.citric.acid.               1.456***          
##                               (0.284)          
##                                                
## Constant                     1.602***          
##                               (0.227)          
##                                                
## -----------------------------------------------
## Observations                   1,599           
## R2                             0.030           
## Adjusted R2                    0.028           
## Residual Std. Error      1.390 (df = 1595)     
## F Statistic          16.534*** (df = 3; 1595)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Talk about what you find in a few lines and interpret the slope.

summary(model)

## 
## Call:
## lm(formula = wine_data2$X.residual.sugar. ~ wine_data2$X.fixed.acidity. + 
##     wine_data2$X.volatile.acidity. + wine_data2$X.citric.acid., 
##     data = wine_data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6511 -0.6241 -0.3204  0.0703 12.9961 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.602470   0.226635   7.071 2.30e-12 ***
## wine_data2$X.fixed.acidity.    0.007474   0.027434   0.272 0.785333    
## wine_data2$X.volatile.acidity. 0.908763   0.237093   3.833 0.000132 ***
## wine_data2$X.citric.acid.      1.455827   0.284366   5.120 3.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.39 on 1595 degrees of freedom
## Multiple R-squared:  0.03016,    Adjusted R-squared:  0.02834 
## F-statistic: 16.53 on 3 and 1595 DF,  p-value: 1.396e-10

The slopes here are noted for each variable so .007 for fixed acidity, .9 for volatile acidity, 1.4 for citric acid. These represent the change in the dependent variable residual sugar for a one-unit change in each of these predictor variables. So a one unit increase in fixed acidity causes a 0.007 increase in residual sugar in red wine, and so on. Coefficients being positive or negative dictate direction, all are positive. Magnitude looks at extent of change, there’s a pretty small effect on res. sugar for fixed acidity and volatile acidity but a much stronger effect from changes in citric acid. For significance, we can look at our p-values - citric acid is less than .1, .05, .001 so high level of significance. Same with volatile acidity. Fixed acidity is greater than .1, .001, and .05, so not statistically significant, and we fail to reject null that fixed acidity has no impact on residual sugar.

plot(model)

Many residuals are around 0, thus normally distributed. There are some outliers but these don’t appear to correlate with the fitted values getting larger so not necessarily heteroskedasicity, so this generally linear. The QQ residuals generally follow the imputed line so is generally normally distributed but with some right skewness. Same with the scale location graph where the values are heavily concentrated around the line, genearlly normal/linear. the other graphs the fitted values are heavily focused around the fitted lines. With the leverage graph, many of these data points don’t have huge leverage on our model.
Yes generally the GM assumptions hold here of linearity (though weak), normal residual, constant variability. We can assume the data is independent and observation of one wine doesn’t impact the other. With GM, we want the data to be linear meaning there is some sort of relationship between x1, x2, x3 and y for us to quantify. Normal residuals essentially tell us our observations don’t deviate too sporadically from the expected means. We want the variability to be constant as well vs values expanding as inputs grow larger. Independence is important as well so relationship isn’t affected by previous occurence of a dependent observation.
OLS BLUE or ordinary least squares best linear unbiased estimator is where we find the line that minimizes the sum of our errors or differences between what we expected and what we actually see for values. Being unbiased estimator, this means that they are euqal to true population values, so over repeated trials the expected value or mean equals the true population mean. These best estimate are what we want to strive for in stats analysis because there is less variability from true population results, thus our model best fits the real world data.
Using logs in linear regression can be helpful for a few reasons. It can help make nonlinear relationships between variables appear more linear, thus a better candidate for LM models. It’s easier to see how a change in a variable with a very different scale say x affects y with a very different scale when we take the log. It can also help improve the fit of our LM, reducing variance and reducing heteroscedaticity. It can also deal with skewed data better, like our right skewed wine data. It helps us reduce the impact of outliers. Overall, logs make LM regressions easier and more in like with HM assumptions, causing better results and fit analysis.

Discussion 7c

Gina Occhipinti

2024-05-01