Background

The purpose of this week’s discussion topic is to build out a simple linear regression model and test the assumptions using any data set of interest.

Data Source

The source of data is cited APA-style below:

The data considers 122 nations and explores the correlation between happiness (via Happiness as well as other social indices) and alcohol consumption (via beer, spirit, and wine consumption per capita).

I thought it might be interesting to explore whether there was any relationship between happiness and alcohol consideration.

Does drinking make us merry? OR does it drown down our mood?

More specifically we’ll explore the tie between wine consumption and happiness here.


Load data

After downloading the .csv file from Kaggle and uploading it to Github, we read the corresponding data (in raw form) and then familiarize ourselves with the dataset by displaying column names, column number, row number, the 1st 6 observations.

#Read .csv data
happy_alc <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/tidyverse/HappinessAlcoholConsumption.csv")
## 
## -- Column specification --------------------------------------------------------------------
## cols(
##   Country = col_character(),
##   Region = col_character(),
##   Hemisphere = col_character(),
##   HappinessScore = col_double(),
##   HDI = col_double(),
##   GDP_PerCapita = col_double(),
##   Beer_PerCapita = col_double(),
##   Spirit_PerCapita = col_double(),
##   Wine_PerCapita = col_double()
## )
#Familiarize ourselves with the dataset
colnames(happy_alc)
## [1] "Country"          "Region"           "Hemisphere"       "HappinessScore"  
## [5] "HDI"              "GDP_PerCapita"    "Beer_PerCapita"   "Spirit_PerCapita"
## [9] "Wine_PerCapita"
ncol(happy_alc)
## [1] 9
nrow(happy_alc)
## [1] 122
head(happy_alc)
## # A tibble: 6 x 9
##   Country Region Hemisphere HappinessScore   HDI GDP_PerCapita Beer_PerCapita
##   <chr>   <chr>  <chr>               <dbl> <dbl>         <dbl>          <dbl>
## 1 Denmark Weste~ north                7.53   928          53.6            224
## 2 Switze~ Weste~ north                7.51   943          79.9            185
## 3 Iceland Weste~ north                7.50   933          60.5            233
## 4 Norway  Weste~ north                7.50   951          70.9            169
## 5 Finland Weste~ north                7.41   918          43.4            263
## 6 Canada  North~ north                7.40   922          42.3            240
## # ... with 2 more variables: Spirit_PerCapita <dbl>, Wine_PerCapita <dbl>
#Being that we're going to explore happiness vs. wine consumption we'll explore the summary statistics of both variables

summary(happy_alc$HappinessScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.069   4.528   5.542   5.525   6.477   7.526
summary(happy_alc$Wine_PerCapita)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     5.0    16.0    66.6   112.8   370.0

From the above outputsm we see the variables we’re dealing with and more specifically we can observe important metrics for Happiness and Wine Consumption. The average Happiness Score is 5.525 with a max of 7.526 and a min of 3.069, whereas the average wine consumption is 66.6 with a max of 370.0 and a min of 1.0.

While we have an understanding that Happiness data was derived from the World Happiness Report, it’s harder to say where the Alcohol consumption data came from and what the resulting numbers represents (ie. mLs/day?). This is worth noting as we proceed.

Note: I’d used this data set for a DATA 607 assignment and have adapted a different application / exploration of the data here.

Linear regression model

Now that we’ve familiarized ourselves with the data, we can build our simple linear regression model.

From the course text:

The simplest linear regression model finds the relationship between one input variable, which is called the predictor variable, and the output, which is called the system’s response. This type of model is known as a one-factor linear regression.

attach(happy_alc)
happy_alc.lm <- lm(HappinessScore ~ Wine_PerCapita )
happy_alc.lm
## 
## Call:
## lm(formula = HappinessScore ~ Wine_PerCapita)
## 
## Coefficients:
##    (Intercept)  Wine_PerCapita  
##       5.133643        0.005874

We have a y-intercept of 5.133643 and a slope of 0.005874.

Now, we can explore what this would look like as a plot of wine consumption per capita vs. happiness:

plot(HappinessScore, Wine_PerCapita, main = "Happiness vs. Wine Consumption", xlab = "Happiness Score", ylab = "Wine Consumption (per capita)")
#plot(x, y, log="y")
abline(happy_alc.lm)

From the above plot it seems rather clear that there is not a linear relationship between our variables and thus this dataset should not be usable in practice. Still, we’ll test a few assumptions …

Assumption testing

summary(happy_alc.lm)
## 
## Call:
## lm(formula = HappinessScore ~ Wine_PerCapita)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.15862 -0.90416  0.09174  0.79698  2.08049 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.133643   0.117038  43.863  < 2e-16 ***
## Wine_PerCapita 0.005874   0.001062   5.529 1.91e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.03 on 120 degrees of freedom
## Multiple R-squared:  0.203,  Adjusted R-squared:  0.1964 
## F-statistic: 30.57 on 1 and 120 DF,  p-value: 1.911e-07
plot(fitted(happy_alc.lm),resid(happy_alc.lm))

First we observe the summary statistics:

  • Residual values: for a good fit, we would expect residual values normally distributed around a mean of zero. Our min and max values are of a similar scale (in opposite directions), our 1Q and 3Q scores differ a little in scale but not significantly, and our median value is slightly above 0. A better model would have a median value nearer 0 and min-max and 1st quartile and 3rd quartile values closer in scale but it doesn’t seem our residuals are too far off of normal (based on the summary statistics). PASS

  • Coefficients: for a good model, we’d like to see a standard error on the scale of 5-10x smaller than corresponding coefficient. Ours appear to be ~5x smaller. PASS

  • $R^2 value: values closer to 1 indicate a better fit and representation of the data set of interest. Based on this value, our model explains only ~20% of the data’s variation and thus is not a very good fit … FAIL

At this point, we could rule out our simple linear regression model, but we’ll observe the plot (just for fun / confirmation):

  • Residual plot: based on our residual plot, we see that our residuals are concentrated on the left side of the plot and not evenly distributed. Thus, the plot shows us that using the Happiness Score as the sole predictor in the regression model does not sufficiently or fully explain the data. This does not mean that our simple one-factor model is useless, just that a better model might be constructed. One solution might be to drop all non-drinking values (ie. 1 entries) and then reconsider the relationship …

And thus, we can confidently say that our model of the relationship between Wine Consumption and Happiness is not a linear one and thus we can not predict one using the other. We can rule out our our model.