rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 536646 28.7 1198565 64.1 NA 669417 35.8
## Vcells 990944 7.6 8388608 64.0 16384 1851813 14.2
cat("\f")
dev.off
## function (which = dev.cur())
## {
## if (which == 1)
## stop("cannot shut down device 1 (the null device)")
## .External(C_devoff, as.integer(which))
## dev.cur()
## }
## <bytecode: 0x1195bcc40>
## <environment: namespace:grDevices>
# first we set the WD and assign the dataset to a variable
setwd("/Users/ginaocchipinti/Documents/ADEC 7310 Data Analytics/Week 6")
wine_data <- read.csv("winequality-red.csv", sep=";", quote = "")
# run summary stats using stargazer
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(wine_data, type = "text", title = "Descriptive Stats", digits = 2)
##
## Descriptive Stats
## ========================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------
## X.fixed.acidity. 1,599 8.32 1.74 4.60 15.90
## X.volatile.acidity. 1,599 0.53 0.18 0.12 1.58
## X.citric.acid. 1,599 0.27 0.19 0.00 1.00
## X.residual.sugar. 1,599 2.54 1.41 0.90 15.50
## X.chlorides. 1,599 0.09 0.05 0.01 0.61
## X.free.sulfur.dioxide. 1,599 15.87 10.46 1.00 72.00
## X.total.sulfur.dioxide. 1,599 46.47 32.90 6.00 289.00
## X.density. 1,599 1.00 0.002 0.99 1.00
## X.pH. 1,599 3.31 0.15 2.74 4.01
## X.sulphates. 1,599 0.66 0.17 0.33 2.00
## X.alcohol. 1,599 10.42 1.07 8.40 14.90
## X.quality. 1,599 5.64 0.81 3 8
## --------------------------------------------------------
head(wine_data)
## X.fixed.acidity. X.volatile.acidity. X.citric.acid. X.residual.sugar.
## 1 7.4 0.70 0.00 1.9
## 2 7.8 0.88 0.00 2.6
## 3 7.8 0.76 0.04 2.3
## 4 11.2 0.28 0.56 1.9
## 5 7.4 0.70 0.00 1.9
## 6 7.4 0.66 0.00 1.8
## X.chlorides. X.free.sulfur.dioxide. X.total.sulfur.dioxide. X.density. X.pH.
## 1 0.076 11 34 0.9978 3.51
## 2 0.098 25 67 0.9968 3.20
## 3 0.092 15 54 0.9970 3.26
## 4 0.075 17 60 0.9980 3.16
## 5 0.076 11 34 0.9978 3.51
## 6 0.075 13 40 0.9978 3.51
## X.sulphates. X.alcohol. X.quality.
## 1 0.56 9.4 5
## 2 0.68 9.8 5
## 3 0.65 9.8 5
## 4 0.58 9.8 6
## 5 0.56 9.4 5
## 6 0.56 9.4 5
# create new data that contains only variables
wine_data2 <- wine_data[ , c(1,2, 3, 4)]
head(wine_data2)
## X.fixed.acidity. X.volatile.acidity. X.citric.acid. X.residual.sugar.
## 1 7.4 0.70 0.00 1.9
## 2 7.8 0.88 0.00 2.6
## 3 7.8 0.76 0.04 2.3
## 4 11.2 0.28 0.56 1.9
## 5 7.4 0.70 0.00 1.9
## 6 7.4 0.66 0.00 1.8
# check if predictor variables have a linear association with the response variable
#install.packages("GGally")
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(wine_data2)
There appears to be some sense of linear relationship between some of
the variables, like fixed acidity and citric acid, and volatile acid and
citric acid.
# fit the linear regression model
model <- lm(wine_data2$X.residual.sugar. ~ wine_data2$X.fixed.acidity. + wine_data2$X.volatile.acidity. + wine_data2$X.citric.acid., data = wine_data2)
# check our assumptions of the model
# distribution of model residuals should be approximately normal
hist(residuals(model), col = "steelblue")
The distribution of residuals is somewhat normal but there is definitely some right skewness.
# variance of residuals should be consistent for all observations or homoskedastic
# create a fitted value vs residual plot
plot(fitted(model), residuals(model))
abline(h = 0, lty = 2)
There is some inequality with the residuals being scattered around the
fitted values, though lots are concentrated around the best fit
line.
# create the regression output using Stargazer
stargazer(model, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## X.residual.sugar.
## -----------------------------------------------
## X.fixed.acidity. 0.007
## (0.027)
##
## X.volatile.acidity. 0.909***
## (0.237)
##
## X.citric.acid. 1.456***
## (0.284)
##
## Constant 1.602***
## (0.227)
##
## -----------------------------------------------
## Observations 1,599
## R2 0.030
## Adjusted R2 0.028
## Residual Std. Error 1.390 (df = 1595)
## F Statistic 16.534*** (df = 3; 1595)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
summary(model)
##
## Call:
## lm(formula = wine_data2$X.residual.sugar. ~ wine_data2$X.fixed.acidity. +
## wine_data2$X.volatile.acidity. + wine_data2$X.citric.acid.,
## data = wine_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6511 -0.6241 -0.3204 0.0703 12.9961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.602470 0.226635 7.071 2.30e-12 ***
## wine_data2$X.fixed.acidity. 0.007474 0.027434 0.272 0.785333
## wine_data2$X.volatile.acidity. 0.908763 0.237093 3.833 0.000132 ***
## wine_data2$X.citric.acid. 1.455827 0.284366 5.120 3.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.39 on 1595 degrees of freedom
## Multiple R-squared: 0.03016, Adjusted R-squared: 0.02834
## F-statistic: 16.53 on 3 and 1595 DF, p-value: 1.396e-10
The slopes here are noted for each variable so .007 for fixed acidity, .9 for volatile acidity, 1.4 for citric acid. These represent the change in the dependent variable residual sugar for a one-unit change in each of these predictor variables. So a one unit increase in fixed acidity causes a 0.007 increase in residual sugar in red wine, and so on. Coefficients being positive or negative dictate direction, all are positive. Magnitude looks at extent of change, there’s a pretty small effect on res. sugar for fixed acidity and volatile acidity but a much stronger effect from changes in citric acid. For significance, we can look at our p-values - citric acid is less than .1, .05, .001 so high level of significance. Same with volatile acidity. Fixed acidity is greater than .1, .001, and .05, so not statistically significant, and we fail to reject null that fixed acidity has no impact on residual sugar.
plot(model)
Many residuals are around 0, thus normally distributed. There are some outliers but these don’t appear to correlate with the fitted values getting larger so not necessarily heteroskedasicity, so this generally linear. The QQ residuals generally follow the imputed line so is generally normally distributed but with some right skewness. Same with the scale location graph where the values are heavily concentrated around the line, genearlly normal/linear. the other graphs the fitted values are heavily focused around the fitted lines. With the leverage graph, many of these data points don’t have huge leverage on our model.
Yes generally the GM assumptions hold here of linearity (though weak), normal residual, constant variability. We can assume the data is independent and observation of one wine doesn’t impact the other. With GM, we want the data to be linear meaning there is some sort of relationship between x1, x2, x3 and y for us to quantify. Normal residuals essentially tell us our observations don’t deviate too sporadically from the expected means. We want the variability to be constant as well vs values expanding as inputs grow larger. Independence is important as well so relationship isn’t affected by previous occurence of a dependent observation.
OLS BLUE or ordinary least squares best linear unbiased estimator is where we find the line that minimizes the sum of our errors or differences between what we expected and what we actually see for values. Being unbiased estimator, this means that they are euqal to true population values, so over repeated trials the expected value or mean equals the true population mean. These best estimate are what we want to strive for in stats analysis because there is less variability from true population results, thus our model best fits the real world data.
Using logs in linear regression can be helpful for a few reasons. It can help make nonlinear relationships between variables appear more linear, thus a better candidate for LM models. It’s easier to see how a change in a variable with a very different scale say x affects y with a very different scale when we take the log. It can also help improve the fit of our LM, reducing variance and reducing heteroscedaticity. It can also deal with skewed data better, like our right skewed wine data. It helps us reduce the impact of outliers. Overall, logs make LM regressions easier and more in like with HM assumptions, causing better results and fit analysis.