The purpose of this week’s discussion topic is to build out a simple linear regression model and test the assumptions using any data set of interest.
The source of data is cited APA-style below:
The data considers 122 nations and explores the correlation between happiness (via Happiness as well as other social indices) and alcohol consumption (via beer, spirit, and wine consumption per capita).
I thought it might be interesting to explore whether there was any relationship between happiness and alcohol consideration.
Does drinking make us merry? OR does it drown down our mood?
More specifically we’ll explore the tie between wine consumption and happiness here.
After downloading the .csv file from Kaggle and uploading it to Github, we read the corresponding data (in raw form) and then familiarize ourselves with the dataset by displaying column names, column number, row number, the 1st 6 observations.
#Read .csv data
happy_alc <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/tidyverse/HappinessAlcoholConsumption.csv")
##
## -- Column specification --------------------------------------------------------------------
## cols(
## Country = col_character(),
## Region = col_character(),
## Hemisphere = col_character(),
## HappinessScore = col_double(),
## HDI = col_double(),
## GDP_PerCapita = col_double(),
## Beer_PerCapita = col_double(),
## Spirit_PerCapita = col_double(),
## Wine_PerCapita = col_double()
## )
## [1] "Country" "Region" "Hemisphere" "HappinessScore"
## [5] "HDI" "GDP_PerCapita" "Beer_PerCapita" "Spirit_PerCapita"
## [9] "Wine_PerCapita"
## [1] 9
## [1] 122
## # A tibble: 6 x 9
## Country Region Hemisphere HappinessScore HDI GDP_PerCapita Beer_PerCapita
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Denmark Weste~ north 7.53 928 53.6 224
## 2 Switze~ Weste~ north 7.51 943 79.9 185
## 3 Iceland Weste~ north 7.50 933 60.5 233
## 4 Norway Weste~ north 7.50 951 70.9 169
## 5 Finland Weste~ north 7.41 918 43.4 263
## 6 Canada North~ north 7.40 922 42.3 240
## # ... with 2 more variables: Spirit_PerCapita <dbl>, Wine_PerCapita <dbl>
#Being that we're going to explore happiness vs. wine consumption we'll explore the summary statistics of both variables
summary(happy_alc$HappinessScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.069 4.528 5.542 5.525 6.477 7.526
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 5.0 16.0 66.6 112.8 370.0
From the above outputsm we see the variables we’re dealing with and more specifically we can observe important metrics for Happiness and Wine Consumption. The average Happiness Score is 5.525 with a max of 7.526 and a min of 3.069, whereas the average wine consumption is 66.6 with a max of 370.0 and a min of 1.0.
While we have an understanding that Happiness data was derived from the World Happiness Report, it’s harder to say where the Alcohol consumption data came from and what the resulting numbers represents (ie. mLs/day?). This is worth noting as we proceed.
Note: I’d used this data set for a DATA 607 assignment and have adapted a different application / exploration of the data here.
Now that we’ve familiarized ourselves with the data, we can build our simple linear regression model.
From the course text:
The simplest linear regression model finds the relationship between one input variable, which is called the predictor variable, and the output, which is called the system’s response. This type of model is known as a one-factor linear regression.
##
## Call:
## lm(formula = HappinessScore ~ Wine_PerCapita)
##
## Coefficients:
## (Intercept) Wine_PerCapita
## 5.133643 0.005874
We have a y-intercept of 5.133643 and a slope of 0.005874.
Now, we can explore what this would look like as a plot of wine consumption per capita vs. happiness:
plot(HappinessScore, Wine_PerCapita, main = "Happiness vs. Wine Consumption", xlab = "Happiness Score", ylab = "Wine Consumption (per capita)")
#plot(x, y, log="y")
abline(happy_alc.lm)
From the above plot it seems rather clear that there is not a linear relationship between our variables and thus this dataset should not be usable in practice. Still, we’ll test a few assumptions …
##
## Call:
## lm(formula = HappinessScore ~ Wine_PerCapita)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.15862 -0.90416 0.09174 0.79698 2.08049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.133643 0.117038 43.863 < 2e-16 ***
## Wine_PerCapita 0.005874 0.001062 5.529 1.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 120 degrees of freedom
## Multiple R-squared: 0.203, Adjusted R-squared: 0.1964
## F-statistic: 30.57 on 1 and 120 DF, p-value: 1.911e-07
First we observe the summary statistics:
Residual values: for a good fit, we would expect residual values normally distributed around a mean of zero. Our min and max values are of a similar scale (in opposite directions), our 1Q and 3Q scores differ a little in scale but not significantly, and our median value is slightly above 0. A better model would have a median value nearer 0 and min-max and 1st quartile and 3rd quartile values closer in scale but it doesn’t seem our residuals are too far off of normal (based on the summary statistics). PASS
Coefficients: for a good model, we’d like to see a standard error on the scale of 5-10x smaller than corresponding coefficient. Ours appear to be ~5x smaller. PASS
$R^2 value: values closer to 1 indicate a better fit and representation of the data set of interest. Based on this value, our model explains only ~20% of the data’s variation and thus is not a very good fit … FAIL
At this point, we could rule out our simple linear regression model, but we’ll observe the plot (just for fun / confirmation):
And thus, we can confidently say that our model of the relationship between Wine Consumption and Happiness is not a linear one and thus we can not predict one using the other. We can rule out our our model.