Artyom Kulikov, Anastasia Vlasenko, Alexey Artyushin and Nadezhda Bykova
30.04.2019
Areas of members’ responsibilities:
While the country of our interest is still Germany, in this particular project we use data from two topics, namely Social Demographics and Social Inequalities in Helath. The data was collected in 2014.
Smoking is one of the most important problems of our time. Everyone knows that smoking entails a high risk to human health. Because of smoking, people have different kinds of diseases, whether it is cardiovascular or respiratory ones, but what is more smoking increases mortality around the whole world. This is only one side of the coin. It is important to remember that smoking is harmful not only for smokers and people around them, it harms the environment as well. And it happens by emission of toxic air pollutants into the atmosphere. Not only tobacco smoke is dangerous, but also cigarette butts are dangerous because their toxic chemicals seep into the soil and waterways, which respectively causes soil and water pollution. We should not forget that under the danger are also animals and vegetation, because they absorb toxic substances from cigarette butts.
We have already included the topic of environmental problems in one of our projects - we investigated if the opinion on the increase in taxes on fossil fuels such as oil, gas and coal was dependent on gender of respondents. Since, as it was mentioned, smoking causes air pollution, it is important not only to support increased taxes on fossil fuels polluting the atmosphere, but also to make individual contribution: start using cars less, drive a car working on electricity etc. Quitting smoking - is another step that can be made towards the future with a clear air.
Beyond a shadow of a doubt, smoking is an extremely serious problem and it is needed to be tackled. In our work, we would like to dig deeper into the smoking problem in such country as Germany and try to see which variables can be used as predictors of smoking.
There are quite a lot of articles written on this topic, so it is possible to say that smoking in Germany is not only a relevant, but also quite a pressing problem. Only in Germany between 100-120 thousands people die because of smoking every year (http://www.gbe-bund.de/pdf/DEGS1_Verbreitung_Rauchen_E.pdf). But what is more interesting is that Germany portrays itself as a country with people leading a really healthy lifestyle. The author of the article on the website The Local (https://www.thelocal.de/20181120/opinion-why-germany-needs-to-take-anti-smoking-laws-more-seriously) emphasizes that it seems astonishing that Germans spend a lot of time and money on their health and stand in long lines for healthy food, but still smoke quite a lot.
We start by loading the packages we will be using, namely “ggplot2”, “psych”, “foreign”, “dyplr”and “sjPlot”.
library(ggplot2)
library(psych)
library(foreign)
library(sjPlot)
library(dplyr)
library(sjmisc)
library(sjlabelled)We proceed by uploading out dataset and choosing variables of particular interest to us. We also remove ‘NA’ values from these variables and convert them to numeric type.
data <- read.spss("ESS7DE.sav", use.value.labels = TRUE, to.data.frame = TRUE)
save <- c("wkhtot", "njbspv", "cgtsday", "alcwkdy", "gndr", "eduyrs")
data1 <- data[save]
data1 <- data1[!is.na(data1$cgtsday),]
data1 <- data1[!is.na(data1$wkhtot),]
data1 <- data1[!is.na(data1$njbspv),]
data1 <- data1[!is.na(data1$alcwkdy),]
data1 <- data1[!is.na(data1$gndr),]
data1 <- data1[!is.na(data1$eduyrs),]
data1$wkhtot <- as.numeric(data1$wkhtot)
data1$njbspv <- as.numeric(data1$njbspv)
data1$cgtsday <- as.numeric(data1$cgtsday)
data1$alcwkdy <- as.numeric(data1$alcwkdy)
data1$eduyrs <- as.numeric(data1$eduyrs)Six variables in total are used in our linear model construction: 4 of predictor variables are continious and 1 is categorical.
The dependent variable in our model is ‘number of cigarettes smoked on typical day (cgtsday)’. It is a continious ratio variable. Descriptive statistics as well as distribution of variable are shown below.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 235 14.72 7.49 14 14.56 10.38 1 32 31 0.16 -0.69
## se
## X1 0.49
From descriptive statistics we see that mean and median are close to be equal (app. 14 cigarettes per day) while the biggest number of cigarettes respondents smoke per day is 32. Even though kurtosis, as well as skew, indicates normality, as seen from histogram and density plot below, distribution does not seem normal, it is not bell-shaped and unimodal. In fact, we can say that it is bimodal (the biggest number of respondents reported smoking around 10 and around 20 cigarettes per day).
Then we proceed to the predictor variables used in our model.
The first predictor is ‘Total hours normally worked per week in main job overtime included (wkhtot)’. It is a continious ratio variable. Descriptive statistics as well as distribution of variable are shown below.
Why did we consider adding this variable to our model?
Both the number of working hours (overtime included) and the number of people one is responsible for were investigated as possible predictors in future models due to the stress a busy working week and many people under control might cause.
A lot of articles and papers suggest reasons why people smoke, and almost in every of them one of the most frequently reported reasons is stress relief. It is a quite common myth that tobacco helps to decrease a stress level in our lives (https://addictionblog.org/top-10/why-do-people-start-smoking/).
There is a tendency to think that cigarettes can help to cope with stress, but in the article of the american psychologist Andrew Parrot “Does Cigarette Smoking Cause Stress?”, this myth is refuted. Stress level of adult smokers is slightly higher than that of non-smokers, while adolescent smokers report an increase in the level of stress. It happens because these adolescents develop regular models of smoking. As it turns out, smoking cessation leads to stress reduction but not smoking itself. Dependence on nicotine not only does not help reduce stress but increases it. This is confirmed by the daily mood patterns described by smokers. Thus, it turns out that smoking relieves the stress that have been developed by smoking before. Roughly speaking, addicted smokers need nicotine to stay at normal stress level (https://www.researchgate.net/publication/12759310_Does_Cigarette_Smoking_Cause_Stress).
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 235 44.12 10.87 43 44.51 7.41 3 74 71 -0.55 2.07
## se
## X1 0.71
Here, mean and meadian also seem to be quite close to each other (44 and 43 respectively). While the maximal number of working hours (including overtime) repondents reported was 74, there were no people reporting 0 hours of work (3 hours is the smallest number).
Contrtrary to the previous variable, even though the kurtosis is bigger than 2, the distribution looks wuite unimodal and bell-shaped. It is a bit skewed to the left.
According to the information we found, a normal working day in Germany lasts 8 hours, which under certain conditions can be extended to 10 hours. The working week is set from Monday to Saturday, although most employees work from Monday to Friday. Night and shift work is also permitted.
In other words, we would expect people to work aroung 45 - 55 hours a week and this is what explains quite big numbers of people, judging by the histogram, reporting their working hours per week being around 40 - 50.
Since there are no people who don’t work at all (min = 3) we mean center the variable in order to make the further interpretation of the reression equation meaningful. Just to show that the shape of the distribution didn’t change we also plot the distribution again.
Our next predictor is ‘Number of people responsible for in job (njbspv)’. It is also a continious ratio variable. Descriptive statistics and distribution of variable are shown below.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 235 10.56 13.97 5 7.39 4.45 1 77 76 2.45 6.24
## se
## X1 0.91
In this case mean and median are not equal at all (10.56 and 5 respectively). Kurtosis and skew are too large to talk about normality (kurtosis of 6.24). The distribution is unimodal, but not bell-shaped. In fact, it is strongly skewed to the right indicating that only a few respondent have a lot of people responsible for in job.
We also mean center this variable since the smallest number of people respondents reported being responsible for is 1.
One more variable that we took for our analysis is ‘Grams alcohol, last time drinking on a weekday, Monday to Thursday (alcwkdy)’. It is, again, continious ratio variable. Its descriptive statistics with visualizations are depicted below.
Why did we consider adding this variable to our model?
According to the article, published on the website of National Institute on Alcohol Abuse and Alcoholism, people who are dependent on alcohol are three times more likely then those in the general population to be smokers, and people who are dependent on tobacco are four times more likely than the general population to be dependent on alcohol. It actually seems like people themselves are aware of it. According to the article, published on the website Tobacco Free Life, people who both smoke and drink often mention that these two habits go together and complement each other: they feel more like smoking when they have an alcoholic drink and vice-versa.
There are many possible explanations for such phenomenon, including the fact that nicotine (that cigarettes contain) enhances the pleasurable effects of alcohol. Even though the reserach in this topic is not suffient to say if it can shed a light on a different reason for connection between smoking and drinking, it is known that both nicotine and alcohol work on the same brain systems. Finally, same genes might be responsible for predisposition to both smoking and drinking or the general reason might be the absence of willpower.
We specifically chose amount of alcohol drunk on a weekday and not on a weekend day. First of all, drinking on weekends is quite common even for people who do not smoke since they are likely to go on some corporate parties or meetings with friends, while drinking on weekdays is more likely to be a warning signal of a person having a predesposition to alcohol frinking. Since genes of smoking and drinking predisposition are similar (or even the same), it is more reasonable for us to use this particular variable.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 235 17.22 23.35 8 11.94 10.38 1 117 116 2.27 5.2
## se
## X1 1.52
Descriptive statistics show big difference between median and mean (8 and 17 respectively). Both skew and kurtosis exceed the value of 2. The distribution is again unimodal and right-skewed - not that many people report drinking a lot on week days.
Again, we mean center this variable since the minimal reported number of grams of alcohol drunk last time on a week day is 1.
Another predictor in our model is ‘Years of full-time education completed (eduyrs)’. It is continious ratio variable. Descriptive statistics and plots are below as usual.
Why did we consider adding this variable to our model?
On the website Studential (https://www.studential.com/university/guides/correlation-between-education-and-quitting-smoking) the following expert opinions offering some reasons why uneducated people are more likely to smoke are posted.
Some of the reasons of people living in worse conditions being more likely to smoke, which might be connected with their education, are also offered:
Some statistics of smokers by level of education can be seen below
Among both genders, smoking is far less widespread in the groups with higher levels of education than in less educated ones. With the exception of the 65-plus age group, where no significant differences regarding the educational level appear, this clear link between smoking and education is evident across all age groups.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 235 12.74 2.7 12 12.56 2.97 7 23 16 0.68 0.49 0.18
According to description, mean and median are 12.74 and 12 respectively. Kurtosis and skew are both small. The distribution looks bell-shaped and unimodal, but not fully normal. It is right-skewed a little bit since, while children are obliged to spend fixed number of years studying by the government and there are not so many children not doing so, there are still some people that spend more years studying than others.
The minimal number of years spent studying is 7 which is why we also mean center this variable.
The last variable that we took is ‘Gender (gndr)’. It is a categorical binominal variable. A barplot with distribution of gender among respondents is below.
Why did we consider adding this variable to our model?
According to the data from the microcensus of 2013, information of which can be found on Wikipedia (https://en.wikipedia.org/wiki/Smoking_in_Germany), 24.5% (almost fourth) of the total German population aged 15 and over are smokers, where 20.9% smoke regularly, and only 3.6% do so occasionally. It is also known from the statistics of 2013 that in Germany men smoke more than women do (29% and 20% respectively). The graph below shows the proportions of male and female smokers of different age groups in Germany. The average age when german people begin to smoke is 17.8. The most smoking age categories in this country are from 25 to 30, from 20 to 25, and from 30 to 35, while german people over 75 it contains only 8% of male and 3,6 of female smokers. It can be seen, that, as a rule, men smoke more than women with the difference between percentages of smokers in these two age groups (15-20 - 4.9%, 20-25 - 7,1%, 25-30 - 12,1%, 30-35 - 14,2%, 35-40 - 11,1%, 40-45 - 8%, 45-50 - 7,8%, 50-55 - 8,4%, 55-60 - 8.5%, 60-65 - 8,1%, 65-70 - 5,8%, 70-75 - 5%, over 75 - 4,4%) fluctuates from about 4% to 15%.
The article “Prevalence of smoking in the adult population of Germany” written by Thomas Lampert, Elena von der Lippe and Stephan Müters offers the data where categories of people who never smoked, ex-smokers, occasional and daily smokers among 18 to 79-years-old women and men.
What is more, the statistics of heavy smokers is given in the article. So, the most smoking category of German is people from 30 to 44 years, the least - 65-79 year-olds.
Another article (“Smoking Behaviour in Germany – Evidence from the SOEP” written by Daniela Heilert and Ashok Kaul) presents such statistics:
As it can be seen, the blue line (which represents men) is always above the red line representing women.
There were around 60% of men and nearly 40% of women among respondents.
We now proceed to the scatterplots visualizing relationships between a number of smoked cigarettes and our predictor variables. Note, that we do not use centered variables here since we want to interpret actual values of them.
##
## Pearson's product-moment correlation
##
## data: data1$cgtsday and data1$wkhtot
## t = 1.7188, df = 233, p-value = 0.08698
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01631155 0.23648055
## sample estimates:
## cor
## 0.1118944
We start with a total working hours as our predictor variable. Reported pearson’s coefficient is equal to 0.1118944 which indicates a weak positive relationship between the variables. By looking at the scatter plot we can make the same conclution. Quite interestingly, the approxiation line for female respondents seems to have a bigger slope coefficient. In other words, while there is a weak positive relationship between number of total working hours and a number of cigarettes one smokes daily, this relationships seems to be a bit stronger for women. One of the possible explanations might be the fact that women have less stress tolerance.
##
## Pearson's product-moment correlation
##
## data: data1$cgtsday and data1$njbspv
## t = 1.9647, df = 233, p-value = 0.05064
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0003195519 0.2515214269
## sample estimates:
## cor
## 0.1276581
While, judging by the obtained coefficient (0.1276581), there is a weak positive relationship between the number of cigarettes one smokes daily and a number of people he/she is responsible for, we can see that the approximating line for men is steaper than the one for women. We can also see that there are more males having more than 20 people they are responsible for in comparison to women, and more males smoking more than 20 cigarettes daily. One of the possible explanations is, again, stress tolerance.
##
## Pearson's product-moment correlation
##
## data: data1$cgtsday and data1$alcwkdy
## t = 2.6827, df = 233, p-value = 0.007827
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04614499 0.29454306
## sample estimates:
## cor
## 0.1730952
The correlation coefficient in case of these variables is a bit bigger, but it still indicates a weak positive relationship between the number of cigarettes one smokes each day and grams of alcohol consumed last time during the week days. The approximating lines for males and females gave approximately the same slope. As we expected, people who drink are also more likely to smoke.
##
## Pearson's product-moment correlation
##
## data: data1$cgtsday and data1$eduyrs
## t = -1.8238, df = 233, p-value = 0.06947
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.242920810 0.009479834
## sample estimates:
## cor
## -0.1186365
This is the only pair of varibles the relationship between which appears to be weak but negative (-0.1186365). In other words, the bigger the number of years of education one obtained, the less cigarettes he/she smokes. The approxiating line for men seems to have a bigger slope in comparison to women which can be possibly explained by the fact that women are, in general, less likely to start smoking under the influence of the peers no matter for how long one continues her education.
Here is a boxplot showing how many cigarettes are smoked by people on each day according to their gender. It is seen that mean number of cigarettes for men is bigger that one for women almost by 5 units. We can also see that there are two women who smoke more than 30 cigarettes each day - those are marked as two outliers.
Linear assumptions for the model to check for bias:
First things first, additivity of predictors implies that the influence of a predictor variables on dependent variable is not influenced by any other external influence. Considering our examples, we conclude that the effects of different independent variables on expected value of the dependent variable are additive since when we add second independent variables it does not change the relationship between the first independent and the outcome variables (for example, the addition of gender does not change the slope of the line that much).
Talking about linearity of a relationship, we firstly examine bivariate plots (the scatterplots above).They show us straight lines, thus, we are satisfied and the conditions are met. Then we will look at the residuals to check if the model is applicable (We will consider this a bit later, after building the models).
Last but not least, we recall that the data we have consists of independent observations as it comes from the national survey and has been checked by professionals.
Now we proceed to constructing the models which consider two independent variables which constitute to the working conditions of a person: ‘Total hours normally worked per week in main job overtime included (wkhtot)’ and ‘Number of people responsible for in job (njbspv)’.
model1 <- lm(cgtsday ~ wkhtot_cent, data = data1)
model2 <- lm(cgtsday ~ njbspv_cent, data = data1)
model3 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + gndr, data = data1) We can now summarize the results of the models to get the statistic of each and compare them later.
summary(model1)##
## Call:
## lm(formula = cgtsday ~ wkhtot_cent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4114 -5.0225 -0.4824 5.9389 17.5176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.72340 0.48678 30.246 <2e-16 ***
## wkhtot_cent 0.07716 0.04489 1.719 0.087 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.462 on 233 degrees of freedom
## Multiple R-squared: 0.01252, Adjusted R-squared: 0.008282
## F-statistic: 2.954 on 1 and 233 DF, p-value: 0.08698
summary(model2)##
## Call:
## lm(formula = cgtsday ~ njbspv_cent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6577 -5.1715 -0.5482 5.9518 17.8627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.72340 0.48585 30.304 <2e-16 ***
## njbspv_cent 0.06849 0.03486 1.965 0.0506 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.448 on 233 degrees of freedom
## Multiple R-squared: 0.0163, Adjusted R-squared: 0.01207
## F-statistic: 3.86 on 1 and 233 DF, p-value: 0.05064
summary(model3)##
## Call:
## lm(formula = cgtsday ~ wkhtot_cent + njbspv_cent + gndr, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0569 -4.7958 -0.2175 5.3140 18.7842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.71188 0.62051 25.321 <2e-16 ***
## wkhtot_cent 0.03356 0.04647 0.722 0.4709
## njbspv_cent 0.04896 0.03514 1.393 0.1649
## gndrFemale -2.61003 1.04049 -2.508 0.0128 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.348 on 231 degrees of freedom
## Multiple R-squared: 0.05074, Adjusted R-squared: 0.03841
## F-statistic: 4.116 on 3 and 231 DF, p-value: 0.007197
According to these, we will be able to construct the equations and also find the best model which explains the most share.The F-test statistics are significant with the third model having the biggest number and the smallest p-value. Thus, we conclude that third model is better than no model at all. We shall move to the adjusted R-squared which indicates how well the regression function fits with empirical data. In case of the first model we can conclude that the model explains about 3,8% of the variation in cigarettes smoked daily - and, again, the last model explains the most. Considering this, we should take up our third model for further
Let’s plot now! Shall we?
plot(model1)plot(model2)plot(model3)This is where we finally see that our residuals are nearing zero, the line in the Residuals vs Fitted graph is almost straight and, thus, our calcualtions are not wrong and we are on the way of getting to the best model!
You can also take a look at the normal probability plot that shows if residuals are distributed normally. As you can see, there are some outliers but this is not the worst case scenario - the line is still quite straight.
In case of the Scale-Location plot, we can check if residuals are spead equally along the ranges of predictors. We can see that the a more or less straight horizontal line with randomly (equally) spread points which is good.
Finally, the last plot is supposed to help us identify the highly influential observations that don’t get along with the majority of other cases. In this case, we should watch out for dots outside of the line of ‘Cook’s distance’.
Now we should compare the models with each other and check it by conducting ANOVA test comparison. We can compare our reduced and full models since the former one can be created by getting rid of one of the predictors in the full model. Here are the hypothesis:
anova(model1, model3)## Analysis of Variance Table
##
## Model 1: cgtsday ~ wkhtot_cent
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 233 12974
## 2 231 12472 2 502.2 4.6506 0.01047 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model2, model3)## Analysis of Variance Table
##
## Model 1: cgtsday ~ njbspv_cent
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 233 12925
## 2 231 12472 2 452.58 4.1911 0.0163 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at this, we can tell that in both cases p-value is less than 0.05 and, thus, we reject the null hypothesis which claims that reduced model is as good and conclude that addition of a predictor variable (number of people one is responsible for) did imporve our model and the full model is better.
Now we want to construct models based on such socially constructed factor as ‘Years of full-time education completed (eduyrs)’ and on such factor as ‘Grams alcohol, last time drinking on a weekday, Monday to Thursday (alcwkdy)’.
Let’s proceed to the construction of the models and see what we have got.
model4 <- lm(cgtsday ~ alcwkdy_cent, data = data1)
model5 <- lm(cgtsday ~ eduyrs_cent, data = data1)
model6 <- lm(cgtsday ~ alcwkdy_cent + eduyrs_cent + gndr, data = data1)summary(model4)##
## Call:
## lm(formula = cgtsday ~ alcwkdy_cent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5439 -4.7667 -0.1557 6.1221 18.1775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.72340 0.48246 30.517 < 2e-16 ***
## alcwkdy_cent 0.05554 0.02070 2.683 0.00783 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.396 on 233 degrees of freedom
## Multiple R-squared: 0.02996, Adjusted R-squared: 0.0258
## F-statistic: 7.197 on 1 and 233 DF, p-value: 0.007827
summary(model5)##
## Call:
## lm(formula = cgtsday ~ eduyrs_cent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2962 -5.1317 -0.3088 5.7038 17.6659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.7234 0.4864 30.270 <2e-16 ***
## eduyrs_cent -0.3291 0.1805 -1.824 0.0695 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.456 on 233 degrees of freedom
## Multiple R-squared: 0.01407, Adjusted R-squared: 0.009843
## F-statistic: 3.326 on 1 and 233 DF, p-value: 0.06947
summary(model6)##
## Call:
## lm(formula = cgtsday ~ alcwkdy_cent + eduyrs_cent + gndr, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1945 -4.8574 -0.2264 5.0707 18.7513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.71173 0.60442 25.995 <2e-16 ***
## alcwkdy_cent 0.04791 0.02067 2.318 0.0213 *
## eduyrs_cent -0.33704 0.17598 -1.915 0.0567 .
## gndrFemale -2.60961 0.99209 -2.630 0.0091 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.258 on 231 degrees of freedom
## Multiple R-squared: 0.0739, Adjusted R-squared: 0.06187
## F-statistic: 6.144 on 3 and 231 DF, p-value: 0.00049
Now we are interested in the F-statistics and p-values numbers again and, as we see, the sixth (the full) model provides the most explanation into our case and the R-squared states that it covers 6,1% of variation.
Again, looking at the residuals plots and the other we see that we are able to trust the results.
plot(model4)plot(model5)plot(model6)Our residuals are nearing zero, the line in the Residuals vs Fitted graph is almost straight.
Looking at the normal probability plot, we can see, there are some outliers but this is not still applicable and the line is quite straight.
On the Scale-Location plot we can see that the a more or less straight horizontal line with randomly spread points which is fine by us.
Last but not least, according to the last plot, we should watch out for dots outside of the line of ‘Cook’s distance’.
Now we again check if the reduced model is as good as the full model or not and use the ANOVA test.
anova(model4, model6)## Analysis of Variance Table
##
## Model 1: cgtsday ~ alcwkdy_cent
## Model 2: cgtsday ~ alcwkdy_cent + eduyrs_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 233 12745
## 2 231 12168 2 577.3 5.4798 0.00473 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model5, model6)## Analysis of Variance Table
##
## Model 1: cgtsday ~ eduyrs_cent
## Model 2: cgtsday ~ alcwkdy_cent + eduyrs_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 233 12954
## 2 231 12168 2 786.05 7.4612 0.0007244 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at this, we again claim that in both cases p-value is less than 0.05 and, thus, we reject the null hypothesis which claims that reduced model is as good and conclude that the full model is of a better explanation.
Now we know which models of previous two groups are the best, but we cannot compare them with each other by using anova since they use completely different predictors.
But what happens if we take all of those predictors and use them in our final model? Let’s take a look.
model7 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy_cent + eduyrs_cent + gndr, data = data1)
summary(model7)##
## Call:
## lm(formula = cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy_cent +
## eduyrs_cent + gndr, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1389 -4.6141 -0.2642 5.0853 18.0107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.48977 0.61438 25.212 <2e-16 ***
## wkhtot_cent 0.04373 0.04580 0.955 0.3407
## njbspv_cent 0.06084 0.03512 1.732 0.0846 .
## alcwkdy_cent 0.04778 0.02058 2.322 0.0211 *
## eduyrs_cent -0.41034 0.17869 -2.296 0.0226 *
## gndrFemale -2.02355 1.04129 -1.943 0.0532 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.222 on 229 degrees of freedom
## Multiple R-squared: 0.09099, Adjusted R-squared: 0.07114
## F-statistic: 4.584 on 5 and 229 DF, p-value: 0.0005265
anova(model3, model7)## Analysis of Variance Table
##
## Model 1: cgtsday ~ wkhtot_cent + njbspv_cent + gndr
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy_cent + eduyrs_cent +
## gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 231 12472
## 2 229 11944 2 528.78 5.0693 0.007011 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model6, model7)## Analysis of Variance Table
##
## Model 1: cgtsday ~ alcwkdy_cent + eduyrs_cent + gndr
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy_cent + eduyrs_cent +
## gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 231 12168
## 2 229 11944 2 224.5 2.1523 0.1186
Having had a look at these, we see that p-value for the third and the last models is less than 0.05, thus, the full model explains our outcome better then the reduced one! Success! Now to the sixth and the full models, since p-value is not significant (0.1186 > 0.05), we can conclude that we should accept H0 and say that adding the new paramentrs did not help in that case.
We can see that we now have a dummy variable gndrFemale. Let’s see why we have women instead of men here.
contrasts(data1$gndr)## Female
## Male 0
## Female 1
R has actually created a gndrFemale dummy variable that takes on a value of 1 if the sex is Female, and 0 otherwise. This decision to code female as 1 and males as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients. So, we can change it.
data1 <- data1 %>%
mutate(gndr = relevel(gndr, ref = "Female"))
contrasts(data1$gndr)## Male
## Female 0
## Male 1
Now, we see that the summary of the model is a bit different
model7 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy_cent + eduyrs_cent + gndr, data = data1)
summary(model7)##
## Call:
## lm(formula = cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy_cent +
## eduyrs_cent + gndr, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1389 -4.6141 -0.2642 5.0853 18.0107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.46622 0.80028 16.827 <2e-16 ***
## wkhtot_cent 0.04373 0.04580 0.955 0.3407
## njbspv_cent 0.06084 0.03512 1.732 0.0846 .
## alcwkdy_cent 0.04778 0.02058 2.322 0.0211 *
## eduyrs_cent -0.41034 0.17869 -2.296 0.0226 *
## gndrMale 2.02355 1.04129 1.943 0.0532 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.222 on 229 degrees of freedom
## Multiple R-squared: 0.09099, Adjusted R-squared: 0.07114
## F-statistic: 4.584 on 5 and 229 DF, p-value: 0.0005265
We start the analysis of our model from the F-statistic. In this case we have the following hypothesis:
As it can be seen, the p-value is extremely small (< 0.05) which implies that the probablity to get such F-statistic as we do if H0 is correcct is quite small, therefore, we reject the null hypothesis and conclude that at least one of regression coefficients is not equal to zero. In other words, having such a model is better than having the model that predicts number of cigarettes a person smokes per day based on the mean value.
We proceed to the R-squared which indicates how well the regression function fits with empirical data. In this case, R-squared is equal to 0.07114 which, in other words, means that our model explains around 7% of the variance in the number of cigarettes people smoke per day.
If we were to write down the regression equation, it woulld look like this:
Predicted number of cigarettes one smokes per day = 13.46622 + 0.04373 * number of working hours + 0.06084 * number of people one is responsible for + 0.04778 * grams of alcohol drunk last time on a weekday - 0.41034 * number of years of education + 2.02355 * 1 (being a man)
By looking at this equation we can definitely see that in case of all x being equal to 0 (female, mean number of working hours, mean number of people he is responsible for, mean amount of alcohol drunk last time on a weekday, mean number of years of education attained) the predicted number of cigarettes smoked per day would be 13.46622.
Now, we need to take a look at the plots that help us analyze the model.
plot(model7)‘Residuals VS Fitted’ plot that shows if residuals are distributed linearly or not. We know that residuals represent the difference between number of cigarettes smoked per day predicted by our model and real numbers, and we know that residuals are distributed linearly if the points are equally spread around a straight horizontal line and there is no clear non-linear pattern forming there. The line on the plot is not perfectly straight, but it is still fine to conclude that residuals are distributed more or less linearly (especially compared to the plots of previous models).
The normal probability plot shows if residuals are distributed normally. As you can see, there are some outliers, namely, 37, 124 and 126, but this is not the worst case scenario and apart from those - the line is still quite straight.
In case of the Spread-Location plot, we can check if residuals are spead equally along the ranges of predictors. The line we obrain is not perfectly horizontal - it has a concave curve in the middle and the points are spread more densily in that area, but, overall, it is not as bad.
Finally, the last plot is supposed to help us identify the highly influential observations that don’t get along with the majority of other cases. In this case, we should watch out for dots outside of the line of ‘Cook’s distance’, which is not visible on the plot.
model3 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + gndr, data = data1)
model6 <- lm(cgtsday ~ alcwkdy_cent + eduyrs_cent + gndr, data = data1)
model7 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy_cent + eduyrs_cent + gndr, data = data1)
tab_model(model3, model6, model7)| cgtsday | cgtsday | cgtsday | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 13.10 | 11.52 – 14.68 | <0.001 | 13.10 | 11.58 – 14.63 | <0.001 | 13.47 | 11.90 – 15.03 | <0.001 |
| wkhtot cent | 0.03 | -0.06 – 0.12 | 0.471 | 0.04 | -0.05 – 0.13 | 0.341 | |||
| njbspv cent | 0.05 | -0.02 – 0.12 | 0.165 | 0.06 | -0.01 – 0.13 | 0.085 | |||
| Male | 2.61 | 0.57 – 4.65 | 0.013 | 2.61 | 0.67 – 4.55 | 0.009 | 2.02 | -0.02 – 4.06 | 0.053 |
| alcwkdy cent | 0.05 | 0.01 – 0.09 | 0.021 | 0.05 | 0.01 – 0.09 | 0.021 | |||
| eduyrs cent | -0.34 | -0.68 – 0.01 | 0.057 | -0.41 | -0.76 – -0.06 | 0.023 | |||
| Observations | 235 | 235 | 235 | ||||||
| R2 / adjusted R2 | 0.051 / 0.038 | 0.074 / 0.062 | 0.091 / 0.071 | ||||||
Our team compared models and concluded that the best one is the final model containing ‘Total hours worked per week in main job overtime included’, ‘Number of people responsible for in job’, ‘Grams alcohol last time drinking on a weekday, Monday to Thursday’, ‘Years of full-time education completed’, and ‘Gender’. It can be seen from the table that the final model is really the best one on the basis of adjusted R squared (adjusted R squared in model 3 is equal to 0,038, in model 6 it is 0,062, and in model 7 it is 0,071) . We compared the final model and the sixth one and concluded that addition of variables related to stress on work did not improve our model significantly. While in case of the third model, addition of other variables worked out and improved the model.
In our work, we reffer to different papers and found out that one of the most common causes of cigarette consumption is stress relief. After checking all the models, we understood that the model that contains not only stress (‘Total hours normally worked per week in main job overtime included (wkhtot)’ and ‘Number of people responsible for in job (njbspv)’) but all the predictors at once works better than any other. That is why when we add ‘stress’ predictors to our second model (which is based on socially constructed factors), the model only slightly improves.
That is it, thank you for your attention!