In this research project we wanted to investigate the topic of environmental problems, people’s attitude towards it and their actual behavioural patterns which might worsen the situation. We chose Germany to be the country around which our projects would be organized as a country which is quite famous for its’ environemntal policies. For instance, in 2018 Germany was ranked the 13th in the world judging by the Environmental Performance Index. Water resources, sanitation, biodiversity, climate and energy are the areas in which Germany performs the best. It’s not only governmental insitutions promoting and executing such campaigns, people in Germany also tend to lead helathy lifestyles. According to the 2017 World Health Index that was released by the Bloomberg, Germany was ranked 16th with a health grade of 84.78 (out of 100). From 2006 to 2016, percentage of vegetarian and vegan Germans doubled from 5 to 10 percent.
We wanted to dive deepeer into the topic of environmental problems and campaigns in Germany and analyze some of the socio demographic trends that might have something to do with it.
Source:
https://epi.envirocenter.yale.edu https://www.kelownanow.com/news/news/National_News/17/03/20/2017_healthiest_country_index/ https://www.handelsblatt.com/today/companies/meat-market-germanys-eating-habits-die-hard/23580890.html?ticket=ST-3046082-VTR4Sbac9rdgiLu5ghCM-ap2
The data we are using in these projects comes from the official website of the European Social Survey, includes variables of such themes as “Socio demographics”, “Public attitude to climate change”, “Social inequalities of health”, “Media and social trust” and “Gender, household”. We are mostly using data collected in 2016 (round 8 of the ESS), but the data from the theme “Inequalities in health” was collected in 2014.
One of the many things we like about the ESS is that their team isaware of the importance of the quality of data, which is why they always monitor it using different measures such as Multitrait-Multimethod (MTMM) approach (to assess reliability, validity and method effects of the questions), Survey Quality Predictor (SQP). They additionally check Measurement equivalence and Measurement quality of concepts and publish quality reports where the methodologies are described. This makes us think that we can actually think of the assumption of observations’ indevendence as being met. We can also easily check the description of the variables we are using and since the data that is provided is well-structured it makes it a lot easier to get everything we need for our projects ready.
So what can you expect from this research project? Here is a little sneak peak:
Please, enjoy!
Disclaimer: We have decided for each project to leave data loading as it is since it would complicate the understanding of what variables we use for every project. Also, each topic covered is a piece of analysis itself so we felt like deleting this part would be useless if someone was to take our code and try to check it on his or her own data.
Team: Rteam&DA
Areas of members’ responsibilities:
In the following projects we will be using various variables, descriptives and graphs. In order to make clear how it is done here we present detailed description of graph construction and interpretation, variable types and basic descriptive statistics.
We start with loading packages that will be used.
library(foreign)
library(ggplot2)
library(knitr)
library(sjPlot)
library(corrplot)
library(gridExtra)
library(dplyr)
library(car)
library(psych)
library(magrittr)
library(kableExtra)
library(DescTools)
library(lsr)
library(RColorBrewer)
library(sjmisc)
library(sjlabelled)
library(margins)
Then we upload dataset and create separate frame with variables of interest. Also, we remove NA’s and convert variables to make it easier to work with them.
data <- read.spss("ESS8DE.sav", to.data.frame = TRUE, use.value.labels = TRUE)
save <- c("prtclede", "eisced", "lrscale", "eduyrs", "gndr", "nwspol")
data1 <- data[save]
data1 <- na.omit(data1)
data1$eduyrs <- as.numeric(as.character(data1$eduyrs))
data1$nwspol <- as.numeric(as.character(data1$nwspol))
data1$lrscale <- as.character(data1$lrscale)
data1$lrscale[data1$lrscale=="Left"] <- "0"
data1$lrscale[data1$lrscale=="Right"] <- "10"
data1$lrscale <- as.numeric(as.character(data1$lrscale))
data1$prtclede <- as.character(data1$prtclede)
data1$eisced <- as.character(data1$eisced)
As it can be seen, we took four core and two additional variables of different types of measurement for the analysis. Two summary tables with basic descriptive statistics of our variables are created and presented below.
| Code of variable | Name of variable | Type of variable | Level of measurement |
|---|---|---|---|
| prtclede | Political party pfererence | Qualitative | Niminal |
| eisced | Highest level of education obtained by a person | Qualitative | Ordinal |
| lrscale | Placement on political left-right scale | Quantitative | Interval |
| eduyrs | Number of years spent on education | Quantitative | Ratio |
| Variable | Mean | Median | Mode | Variance | SD |
|---|---|---|---|---|---|
| Political party pfererence |
|
CDU/CSU | CDU/CSU |
|
|
| Highest level of education obtained by a person |
|
ES-ISCED IV, advanced vocational, sub-degree | ES-ISCED IIIb, lower tier upper secondary |
|
|
| Placement on political left-right scale | 4.3 | 5 | 5 | 3.95 | 1.99 |
| Number of years spent on education | 14.6 | 14 | 13 | 11.63 | 3.41 |
We start with an interval variable which in our case is political party preference(prtclede). A list of political German parties was given to respondents in survey and they needed to answer the question ‘Which party do you feel closer to?’.
According to the table shown above, median of this variable is CDU/CSU and mode is CDU/CSU.
Below you can see two barplots visualizing the distribution of answers: one with pure numbers (frequences of such answers) and one with percentages.
As seen from the barplots, the most frequently reported as preferable party was CDU/CSU which is Christian democratic union/Christian social union. This option was chosen by one third of respondents in Germany. The next goes Social Democraic Party with 27.1% which is also quite a large share. Several parties were chosen by a relatively small number of people, but there are also two parties which people did not report as preferable almost at all. These are Piratenpartie (Pirate party) and National Democratic Party with less than 1% both.
As ordinal variable we took was the highest level of education obtained by a person according to the International Standard Classification of Education (eisced). It is ordinal because levels of education are ranged from the lowest to the highest ones.
Going back to the summary table, median of this variable is ‘ES-ISCED IV, advanced vocational, sub-degree’ and mode is ‘ES-ISCED IIIb, lower tier upper secondary’.
Barplots below represents distribution of levels of education in pure numbers and in percentages.
In barplots levels of education are ranged from the bottom to top of vertical axis. We see that lower tier upper secondary education is the most frequently reported level among German people, more than one third picked it to be precise. Levels that are higher than lower tier upper secondary are also more or less frequent in Germany. It is not so common to have education lower than lower tier upper secondary education. And to highlight, very little percentage (1.4%) of people have less than lower secondary education.
We decided to take placement on political left-right scale (lrscale) as an example of interval variable. We consider it as interval because it shows position in political spectrum in numbers the distances between which are equal. Also, zero here does not mean absolute zero or absence of something but marginal position on the scale.
Summary table at the beginning of our report tells that mean of the variable is equal to 4.3, median is 5 and mode is 5.
In the graph below the distribution of variable is represented.
From the histogram it is seen that in Germany people tend to be more at the center of political spectrum and there is small number of radical people. So, the distribution seems to be almost normal. It is unimodal, the median and the mode has the same value and the mean quite close to it.
Our example of ratio variable is number of years spent on education (eduyrs). It is a continious variable which has the value of absolute zero.
Central tendency measures are following: mean - 14.6, median - 14 and mode - 13.
In the figure below there is a histogram which shows distribution of a variable.
So, as it is seen from the histogram, German people spend about 15 years on education. There are some cases of larger number of years, but there are not so many people who spend less than 5 years on education. The distribution is unimodal; mean, median and mode are close to each other, but not the same. Generally, we can to say that distribution is normal.
And now we are going to visualize different bivariate distributions for two categorical, two continious and continious and categorical variables.
Let’s start with two continious variables. The visualization of this bivariate distribution will be a scatterplot. We will show distribution of time that was spent on education and time spent watching political news, which was taken additionally.
From the scatterplot it is hard to say if there is a correlation between two given varibles. We can assume that people who spent average number of years on education tend to spend more time watching political news (higher values of the variable are obsereved in case of people who have from 10 to 20 years of education). But those who spent a lot or almost no time on education do not spent much time on political news.
The next distribution is for one categorical and one continious variables. It can be visualized with the help of a boxplot which is given below. We decided to check relationship between gender and number of years spent on education.
As we see from boxplot, the difference between the number of years females and males spent on education is quite small. Medians seems to be almost the same, but 75th percentile for males is a little bit higher than for females. The box representing male respondents is a bit bigger which indicates that the values are a bit more spread out. While the median number of years of education for females is almost in the middle of the box the median of years of education for males divides the box in something like 60%/40% shares which means that there are a bit more males who have spent more years on education. There are relatively few outliers in case of both females and males.
And the last bivariate distribution is for two categorical variables. Again, we touch the topic of gender and education, bit now we want to see relationship between gender and the highest level of education attained by a person. The stacked barplot below represents distribution.
Here we decided to look at percentages of males and females within each category. That is why in every level of education total number of people is taken as 100%. So, as we see, in the lowest level, namely ‘less than lower secondary’ males are predominant and hold almost two thirds of a group. When it comes to the secondary education, more females reported this level as the highest level of education they obtained. However, as we move to the higher levels of educationthe share of males increases and on the level of lower tertiary education there are 70% of males. In case of higher tertiary education level, shares of males and females are almost the same.
Don’t forget about these shares! We will come back to them a bit later…
We have reviewed four variables with different levels of measurements, visulized and analyzed their distributions and how they can possibly interact with each other.
Team: Rteam&DA
Areas of members’ responsibilities:
This project is a starting point of our reserach on topic of public attitudes to climate change. Here we use data dated 2016 from two themes, namely, Social Demographics and Public attitudes to climate change to investigate the connection between them.
Our very first research question here is whether it’s females of males who actually worries more about the environemntal problems and in order to answer it we are using two categorical variables, namely, gender and people’s attitudes towards increasing taxes on fossil fuels, such as oil, gas and coal with the aim to reduce climate change.
We hypothesize that females are more likely to be in favor of an increase in taxes beacause of being more empathetic and emotional and therefore being more likely to be moved by watching commercials on topic of severe environemntal damage fossil fuels actually cause. According to the article published by the Journal of Consumer Research, greenness and environemntal behaviour are quite often stereotyped as being feminine in the eyes of both people of such behaviour and others. Because of that, men might feel like engaging in such activities or supporting them might make them less musculine in eyes of the others and tend to avoid such behaviours.
Source: https://academic.oup.com/jcr/article-abstract/43/4/567/2630509?redirectedFrom=fulltext
# Uploading data
data <- read.spss("ESS8DE.sav", use.value.labels = TRUE, to.data.frame = TRUE)
# Chosing variables to work with
save <- c( "gndr", "eduyrs", "inctxff")
data1 <- data[save]
# Deleting NA's
data1 <- na.omit(data1)
#Creating variables and assigning values to them
gender <- data1$gndr
att <- data1$inctxff
Here is a short description of our variables:
## Male Female
## 1493 1319
## Strongly in favour Somewhat in favour
## 241 816
## Neither in favour nor against Somewhat against
## 667 797
## Strongly against
## 291
We want to perform a Chi-squared test in order to find out if gender and attitudes towards increase of taxes on fossil fuels are independent or not independent.
Now we proceed to the visualization of the distribution of our variables. Here is a stacked barplot visualizing gender composition for each point of view and bar chart which shows frequences of answers in each category by gender.
#Creating plots
set_theme(legend.pos = "top", legend.inside = TRUE, axis.textsize = 0.8, title.align = "center")
plot1 <- sjp.xtab(att,gender, bar.pos = "stack", legend.title = "Gender",
axis.titles = "Attitude towards tax increase", title = "Gender composition of taxation attitude", show.total = FALSE, margin = "row",
geom.colors = (palette = "Pastel2"))
plot2 <- sjp.grpfrq(att, gender, type = "bar", legend.title = "Gender", geom.spacing = - 1,
axis.titles = "Attitude towards tax increase", title = "Attitude distribution by gender", show.prc = FALSE, geom.colors = (palette = "Pastel2"))
grid.arrange(plot1, plot2, ncol=2)
As it can be seen from the graphs, the proportion of men and proportion of women in all five of the options are not equal. The only option in which the proportion of women is bigger is “Neither in favour nor against”, while the biggest difference is observed in the “Strongly against”" category. The second biggest difference is observed in “Strongly in favour” category, while remaining two categories differ by around 10%. “Somewhat against” was the most popular option for men, “Neither in favour nor against” - for women. “Strongly in favour” was chosen by the fewest number of men and women.
Interestingly enough, it seems like, in general, women tend to stick to the “Neither in favour nor against” option, remaining neutral, while men are likely to express their opinion and chose a side.
In order to actually perform the chi-squared test and find out if these variables are independent we need to make sure that all necessary assumtions of the test are met:
Here are hypotheses for the test:
In order to check independece of chosen variables we need to apply Pearson’s Chi-squared test. For this it is necessary to create a contingency table which contains observed frequencies.
ct<-table(gender, att)
kable(ct)
| Strongly in favour | Somewhat in favour | Neither in favour nor against | Somewhat against | Strongly against | |
|---|---|---|---|---|---|
| Male | 143 | 440 | 281 | 446 | 183 |
| Female | 98 | 376 | 386 | 351 | 108 |
Then, we run Pearson’s Chi-squared test using function chisq.test().
test <-chisq.test(ct)
test
##
## Pearson's Chi-squared test
##
## data: ct
## X-squared = 50.03, df = 4, p-value = 3.56e-10
We obtained the Chi-square statistic of 50.32 and a p-value equal to 3.096e-10 (0.0000000003096) having the degree of freedom 4. The critical value of x2 with degree of freedom 4 and significance level 0.05 is 9.49. Obtained Chi-square statistic exceeds such critical value, and p-value is a lot smaller than the significance level of 0.05, meaning that the probability to obtain the observed, or more extreme, results if the null hypothesis (H0) of a study question is true (variables are independent) is extremely low. Therefore we conclude that variables are not independent: the gender of a respondent and their opinion regarding the increase of taxes are not independent.
Since the Chi-square test statistic is significant, we also would like to take a look at residuals. So, let’s create tables with expected and observed frequences and then with residuals.
| Strongly in favour | Somewhat in favour | Neither in favour nor against | Somewhat against | Strongly against | |
|---|---|---|---|---|---|
| Male | 127.9563 | 433.2461 | 354.1362 | 423.1583 | 154.5032 |
| Female | 113.0437 | 382.7539 | 312.8638 | 373.8417 | 136.4968 |
| Strongly in favour | Somewhat in favour | Neither in favour nor against | Somewhat against | Strongly against | |
|---|---|---|---|---|---|
| Male | 143 | 440 | 281 | 446 | 183 |
| Female | 98 | 376 | 386 | 351 | 108 |
| Strongly in favour | Somewhat in favour | Neither in favour nor against | Somewhat against | Strongly against | |
|---|---|---|---|---|---|
| Male | 2.030799 | 0.5623422 | -6.497204 | 1.915285 | 3.535362 |
| Female | -2.030799 | -0.5623422 | 6.497204 | -1.915285 | -3.535362 |
According to the table, there are residuals with absolute value bigger than 2 (‘Strongly against’, ‘Neither’ and ‘Strongly in favour’). We then proceed to draw a plot in order to take a closer look at them.
We observe that in regards to male respondents “Neither in favour nor against” category contains less observations than we would expect in case of variables independence. For women it’s the other way around: we would expected more observations in this category. When it comes to the categories “Strognly and favour” and “Strogly against”, we would expect to have less observations there in case of men and more observations in case of women.
The same situation can be observed by using a plot drawn below. There is a strong positive association between female respondents and “Neither in favour nor against” while there is a strong negative association between male respondents and this category. There is a moderate positive strong association between “Strognly and favour” and “Strogly against” categories and male respondents, while in case of female respondents this association is negative.
Overall, we can conclude that chosen variables turned out to be not independent: attitude towards the increase in taxation on fossil fuels, such as oil, gas and coal with the aim to reduce climate change gender of respondets were not independent. Our hypothesis was, in fact, not verified: females tend to choose the “Neither in favour nor against” option, staying neutral, while males prefer to choose either of two sides of the argument, still having a tendency to be against the increase of taxes, more than we would expect them to in case of variable independence.
So what might be the reason behind this phenomenon? We decided to analyze gender and the total number of years spent on education, hoping to find an explanation.
Since we want to check if the mean number of total number of years spent on education is different for females and males we decided to apply the T-test. Samples in this case are independent but not paired since we just have two categories which are males/females where we measure the same thing. Let us get a grip on our data:
summary(data1$eduyrs)
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 2 3 1 5 4 0 47 67 138 226 421 427 257 276 272 191 165 113
## 20 21 22 23 24 25 26 27 28
## 86 37 27 26 7 9 3 1 1
So, first of all, our variable is of a factor class. Later during the analysis we will turn it into a numeric one, so all in all, it is a continuous variable, ratio scale. Moving on to the histogram of distribution of the variable, on the x-axis we see the values which show amount of years, while on the y-axis it is the number of times this value has been encountered. So, it can be seen from the graph that distribution seems normal, but we still will prove this assumption via QQ plot.
ggplot(data1, aes(x = as.numeric(eduyrs), fill = gndr)) +
geom_histogram(binwidth = 1, position = "identity", alpha = .8) +
theme_minimal() +
scale_fill_brewer(palette = "Pastel2") +
ggtitle("Variable's distribution") +
xlab("Years spent on education") +
ylab("Frequency") +
guides(fill=guide_legend(title="Gender"))
Our next step is a graphical depiction of numerical data groups through their quartiles.
ggplot(data, aes (x = gndr, y = as.numeric(eduyrs), fill = gndr)) +
geom_boxplot() +
ggtitle("Time spent on education for different sexes") +
xlab("Gender") +
ylab("Years spent") +
scale_fill_brewer(palette = "Pastel2")+
guides(fill=guide_legend(title="Gender"))
Here, we can see that the median figures of males and females are slightly different - with men having 13 years and women having 12 years.Though we do have some outliers - there are not many of them so we can go on to analysis without bootstrapping.
Let us check the normality a second time - with qqplots. Here we see QQ plot which compares two probability distributions due to plotting their quantiles against each other. So, the QQ plot shows that two compared distributions are similar and normal.
female <- subset(data1, data1$gndr == "Female")
male <- subset(data1, data1$gndr == "Male")
plot3 <- qqnorm(as.numeric(female$eduyrs)); qqline(as.numeric(female$eduyrs, col = 2))
plot4 <- qqnorm(as.numeric(male$eduyrs)); qqline(as.numeric(male$eduyrs, col = 2))
We then make sure to match all necessary assumptions of t-test - let’s test homogeneity of our variances:
var.test(as.numeric(data1$eduyrs) ~ data1$gndr)
##
## F test to compare two variances
##
## data: as.numeric(data1$eduyrs) by data1$gndr
## F = 1.0052, num df = 1492, denom df = 1318, p-value = 0.923
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.905048 1.116140
## sample estimates:
## ratio of variances
## 1.005241
Having done the test for testifying variances, we conclude that p-value is large (bigger than 0.05) and we have no right to reject H0, thus, variances are equal.
*The groups are sampled from normal distributions with equal variances - the assumption holds.
Here are hypotheses for the test:
t.test(as.numeric(data1$eduyrs) ~ data1$gndr, var.equal = TRUE)
##
## Two Sample t-test
##
## data: as.numeric(data1$eduyrs) by data1$gndr
## t = 2.6902, df = 2810, p-value = 0.007184
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.09108817 0.58085830
## sample estimates:
## mean in group Male mean in group Female
## 13.44742 13.11145
Since our p-value is very small (smaller than 0.05), we have to reject H0 which states that our means are not significantly different. According to the test, men get 13.4 years of education and women - 13.1.
Here we used the distribution-free nonparametric test, which is generally defined as the hypothesis test which is not based on underlying assumptions, because our independent variables are non-metric. So, here we presented Wilcoxon Test to check our results.
wilcox.test(as.numeric(data1$eduyrs) ~ data1$gndr)
##
## Wilcoxon rank sum test with continuity correction
##
## data: as.numeric(data1$eduyrs) by data1$gndr
## W = 1050800, p-value = 0.001959
## alternative hypothesis: true location shift is not equal to 0
Again, having conducted the test, we come to the conclusion that our means are significantly different since p-value is really small and we reject H0.
Overall, we state that, on average, men get more education (if we are to measure it in years). Since we have already discovered that men tend to choose to vote either for increasing taxes or not, whereas women tend to stay neutral, we hypothesize that one of the factors that influences such decision is their education. Hypothetically, bigger mean in terms of education in years for men might be used to explain their attitude towards the increase of taxation. For instance, education might allow them to gain more knowledge about both taxation system and environment problems, which may result in them being able to choose a side: to be against increasing taxes or support such a change. On the other hand, women who are less educated, might lack knowledge to make such a decision and prefer to stay neutral.
Team: Rteam&DA Areas of members’ responsibilities:
In this project we wanted to proceed with our reserach and investigate how education (mean years of which we found to be different for females and males in a previous project) is connected to the amount of time peopl spend watching political news. How politics are connected with environemntal problems you ask? Well, governemnt and the ruling political party in particular is one of the most important entities when it comes to establishing and increasing taxes of fossil fuels and other substances or activities that harm environment.
According to the acrticle published in “Deutschland.de” there are 10 main thing that Germany is doing for the environemnt including nuclear power plants being shutdown, Renewable Energy Act (EEG) being constantly updated and extensively used, taxating activities that are harmful for environment etc. Those changes would be impossible to make if it wasn’t for the german government tha cares about environemnt so much. In fact, there is a special word - “Energiewende” that is the name of the country’s planned transition to a low-carbon, nuclear-free economy governed by such entities as german federal government, federal legislature, federal states etc.
Sources: https://www.deutschland.de/en/topic/environment/10-things-germany-is-doing-for-the-environment https://www.cleanenergywire.org/easyguide
In other words, for people who actually care about environment policits is extremely important. But we want to answer a question of whether there is a connection between education of a person and time he/she spends on news related to politics.
Initially, our choice of variables was driven by the hypothetical assumption that people who are more educated tend to watch/read more about the politics. First of all, they have more free time in comparison with people with low educational level that have to start work as early as possible. Secondly, more educated people have better understanding of what is going on in the political spehere of life (and environmental issues as well, as it was noted previously) and are more interested in it because of it.
We have found that according to the survey published in the article “The conversation” in 2017, people on unstable job positions show less interest in politics in comparison to people of economically secure positions and are more likely to be an undecided voter or to not vote at all. Nearly 40% of people from economically insecure positions claimed that they can not change anything with their vote. Instead of trying to rebel against existing situation, these people just turn their backs to politics. Our data was collected in 2016, the article was published in 2017 - both before the presidential elections in 2017.
We can hypothesize that from sociological point of view these two ideas are connected: people with lower education tend to get unstable jobs and turn their backs to politics both because they feel like thier votes don’t count and becuase they have little knowledge of how the system works. It is the other way around for people with higher education.
Can our data support such conclusions? Let’s find out.
Here we upload data and delete NAs from our variable of interest which is the highest level of education obtained by a person according to the International Standard Classification of Education (ISCED).
data1 <- read.spss("ESS8DE.sav", use.value.labels = TRUE, to.data.frame = TRUE)
data1 <- data1[!is.na(data1$eisced),]
While there is also a continious variable for total years of education (eduyrs), whe have chosen the categorical variable (eisced) which indicates the highest level of education obtained by a person according to the International Standard Classification of Education (ISCED). According to ESS website, this variable serves as the best indicator of the education.
Lets’s have a look how this varible is structured.
## Not possible to harmonise into ES-ISCED
## 0
## ES-ISCED I , less than lower secondary
## 74
## ES-ISCED II, lower secondary
## 262
## ES-ISCED IIIb, lower tier upper secondary
## 1050
## ES-ISCED IIIa, upper tier upper secondary
## 123
## ES-ISCED IV, advanced vocational, sub-degree
## 597
## ES-ISCED V1, lower tertiary education, BA level
## 288
## ES-ISCED V2, higher tertiary education, >= MA level
## 440
## Other
## 0
Now, let’s have a look at the bar chart representing how the observations are distributed in groups.
Even on the ESS website it is sugested to recode this variable into a new variable with three levels: primary (1,2), secondary (3,4) and tertiary (5,6,7) education, so we decided to do so with a slight change in number of levels. We assigned first two categories (less than lower secondary, lower secondary) into “Low Secondary” level, following two categories (lower tier upper secondary, upper tier upper secondary) were assigned to the “Upper Secondary” level, advanced vocational categorie remained as it is, new level “Tertiary” was created for last two categories (lower tertiary education, higher tertiary education). All of the changes have been done in order to make groups more comparable.
We can now look at the way our new variable is distributed.
As seen from the barplot, the majority of people have upper secondary education; their share in population is a little bit more than 40%. The low secondary education on the contrary is the least widesread in Germany. Tertiary education is attained by a quarter of population. And people with advanced vocational education account for one fifth of the poopulation.
The second variable (nwspol) of our interest is connected with the topic of political news. This variable represents the time each respondent spends everyday reading, watching or listening to news connected with politics (in minutes).
And now let’s visualize this variable as a density plot and as a histogram.
ggplot(data1, aes(x = nwspo)) +
geom_histogram(fill="#b3e2cd")+
labs(title = "Distribution of time spent on political news", x = "Minutes", y = "Frequency")+
geom_vline(aes(xintercept = mean(data1$nwspo, na.rm = TRUE), colour="Mean"), lwd=1.1 )+
labs(colour="")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Well, it is seen from the histogram that the distribution does not look normal. It is right-skewed. Also, there is no general trend: big and small values alternate with each other. Speaking about density plot, it is double humped, as it is clearly seen. The mean takes the value of approximately 22.5 minutes. The most people chose an answer around 23-25 minutes, which is the mode.
Since we want to work with these variables together, we need to visualize descriptives of the continious variable by groups of categorical one.
| Education | N | Mean | SD | Median | Min | Max | Skew | Kurtosis | St.error |
|---|---|---|---|---|---|---|---|---|---|
| Advanced vocational | 597 | 66.64 | 75.18 | 60 | 0 | 960 | 6.52 | 67.46 | 3.08 |
| Low Secondary | 336 | 54.49 | 76.03 | 30 | 0 | 720 | 4.72 | 30.13 | 4.15 |
| Tertiary | 728 | 73.84 | 94.68 | 60 | 0 | 1119 | 6.33 | 53.33 | 3.51 |
| Upper Secondary | 1173 | 67.19 | 83.86 | 60 | 0 | 1200 | 6.99 | 72.87 | 2.45 |
We hypothesize that education of a respondent and the time he/she spends watching political news are not independent variables. But for this particular project we want to see if the division into groups would be useful in the explanation of variance.
mean(data1$nwspo)
## [1] 67.27488
As we see it from the aggregated data, difference in means between the categories and between the mean for the whole dataset is relatively small.
If we were to judge solely by the table, it would be quite hard to draw any conclusions: means of the groups seem to be quite similar (22,76; 18,79; 23,91; 22,57) to each other. Even though we might expect the F-ratio to be quite small, in order to check whether there is, in fact, a statistically significant difference in group means we need to run the formal test - ANOVA.
Let’s create a boxplot to see how these two variables look together.
Well, three out of four groups (advanced vocational, tertiary and upper secondary education) look quite similar with about the same means, however, outliers in these groups are distributed differently. The group of low secondary level of education looks differently from the other two with lower mean. So, according to this boxplot, it is possible to assume that having low secondary education implies spending less time on reading or watching political news, while other three levels of education do not seem to affect variability in time-spending, however all three groups show bigger numbers of minutes spent on watching/reading about politics.
We are aware of the fact that ANOVA is a parametric test, which is why we have to check these assumptions before working on it:
In order to check whether variances of our variables are equal, we now are moving on to Levene’s test for homogenity of variances.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.2277 0.8771
## 2830
Since P-value is equal to 0.06 which is relatively big compared to 0.05, we have 0.06 probability to get the data we are working with if H0 is true for the population. We have no right to reject the null hypothesis and conclude that variances do not differ from each other.
We now start working with the actual F-test which is used in order to identify non-equality among the groups in general.
These are the hypothesis for the test:
##
## One-way analysis of means
##
## data: data1$nwspo and data1$educ
## F = 4.0716, num df = 3, denom df = 2830, p-value = 0.006758
## Df Sum Sq Mean Sq F value Pr(>F)
## data1$educ 3 86598 28866 4.072 0.00676 **
## Residuals 2830 20063657 7090
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F(3, 2830) = 16.01, p-value < .001 means that the null hypothesis should be rejected (the probability to obtain such data that we have if H0 was true for the population is low), thus, the difference in the time spent watching news across education groups is statistically significant.
Now, in order to determine normality graphically we can use the output of a normal Q-Q Plot and check the third assumption: if residuals are normally distributed or not. So, we create a Q-Q plot to do this.
We can see how the data points are close to the diagonal line in Q-Q plot and almost straight red lines in upper two graphs. Therefore we can conclude that our residuals are distributed normally.
But we would like to check it again with a more formal procedure.
## vars n mean sd median trimmed mad min max range skew
## X1 1 2834 0 84.16 -13.84 -12.88 39.51 -73.84 1132.81 1206.65 6.57
## kurtosis se
## X1 63.21 1.58
Skew and kurtosis should be < 2, here we see that they are 0.62 and 0.67 respectively
We are now moving on to the Shapiro-Wilk normality test.
##
## Shapiro-Wilk normality test
##
## data: anova.res
## W = 0.51576, p-value < 2.2e-16
Shapiro test helps us to check hypothesis about normality. Unfortunately, p-value is extremely small (0.05<) which indicates that the residuals are not distributed normally because as a rule if p-value is less or equal to 0,05 then we should conclude that the residuals did not come from a Normal distribution. But let’s take a look at the histogram.
Overall, skew and kurtosis were OK and the histogram looks at least somewhat appropriate even though the Shapiro-Wilk test that our residuals were not distributed normally. As a conclusion of the visual analysis, we claim that normality assumption holds.
We have already established that F-ratio is statistically significant and now we want to find out which specific groups’ means (in comparison with each other) are statistically different. For this purpose we move on to the Tukey’s Honestly Significant Differences’ post hoc test. We chose it over the pairwise t-test with Bonferroni’s correction since we have already established that the variances of groups are equal.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data1$nwspo ~ data1$educ)
##
## $`data1$educ`
## diff lwr upr
## Low Secondary-Advanced vocational -12.1513969 -26.9126544 2.609861
## Upper Secondary-Advanced vocational 7.2041434 -4.7465863 19.154873
## Tertiary-Advanced vocational 0.5518899 -10.3296360 11.433416
## Upper Secondary-Low Secondary 19.3555403 5.0805828 33.630498
## Tertiary-Low Secondary 12.7032867 -0.6893261 26.095900
## Tertiary-Upper Secondary -6.6522535 -16.8643687 3.559862
## p adj
## Low Secondary-Advanced vocational 0.1481741
## Upper Secondary-Advanced vocational 0.4079735
## Tertiary-Advanced vocational 0.9992089
## Upper Secondary-Low Secondary 0.0028053
## Tertiary-Low Secondary 0.0702960
## Tertiary-Upper Secondary 0.3374114
By looking at adjusted p-values for all possible pairs, we can see that the difference between group means is statistically significant only in case of three pairs out of six. P-values of comparison between Tertiary and Advanced vocational, and Upper Secondary and Advanced vocational education are exceptionally big, while the p-value of comparison between Upper Secondary and Tertiary education is just slightly bigger than 0.05. We can also make sure that everything is right by looking at actual differences between grop means: group comparisons with large p-values have small difference in means, namely, 1.1560412, -0.1901665 and -1.3462077.
We now proceed to plot these differences in means.
By looking at the plot, it is quite clear that two lines (Tertiary - Advanced vocational and Upper Secondary - Advanced vocational) cross the zero value while it is unclear if the third line (Upper Secondary - Tertiary) crosses it or not. However, we already know that, even though the p-value reported for this pair is quite small, it is still bigger than 0.05 and, in fact, this line does cross the zero value. Remaining three lines do not cross the zero value which means that differences between these groups are statistically significant.
Now we want to double check our results by performing the non-parametric equivalent of ANOVA - the Kruskal-Wallis test. Results of it will be less reviable, but we wanted to compare it with the results of actual ANOVA. In this case we do not have to check any assumptions and can just proceed to the test and its hypothesis.
##
## Kruskal-Wallis rank sum test
##
## data: nwspo by educ
## Kruskal-Wallis chi-squared = 54.94, df = 3, p-value = 7.072e-12
As we can see, with KW chi-square(3) = 54.94, p-value is < .001, which which means that the null hypothesis is rejected and the differences between mean ranks of groups turn out to be statistically significant. This result confirms what we saw earlier in the ANOVA test.
Since the result of this test is significant, we would also like to run Dunn’s test.
##
## Dunn's test of multiple comparisons using rank sums : holm
##
## mean.rank.diff pval
## Low Secondary-Advanced vocational -313.73514 5.2e-08 ***
## Upper Secondary-Advanced vocational 76.73315 0.1718
## Tertiary-Advanced vocational -18.25290 0.6537
## Upper Secondary-Low Secondary 390.46829 1.5e-12 ***
## Tertiary-Low Secondary 295.48224 1.8e-08 ***
## Tertiary-Upper Secondary -94.98605 0.0385 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As it can be seen, there are three groups with statistically significant difference in their medians.
Now we want to know what is the substantive significance of group membership in determining the time spent watching political news. Therefore, we look at the share of variance that is explained by education groups in all the variance of time spent watching/reading political new - at the ‘effect size’.
## [1] 0.003240955
We get the result of 0.015 which indicates a low-medium effect size. It means that, even though there is a statistically significant difference in level of time spent watching political news across education groups and this difference is statistically significant across three pairs, in practical terms this effect is not large at all.
Overall, having conducted the tests, we have come to the conclusion that different groups actually DO spend different amounts of time on political news. We conclude that people with tertiary education spend more time than others reading/listening the news. At the same time people with low secondary education only spend the least amount of time on news. Thus, we also state that the difference for tertiary vs. advanced vocational and upper secondary vs. advanced vocational was not significant.
So how does this fit into our reserach? In the previous project we showed that men tend to have bigger number of years spent on education in comparison with women and hypothesized that this might be the reason why women tend to stay neutral when it comes to environemntal problems. Now we concluded that there is statistically significant difference across different educational groups when it comes to the time spent on political news. We then hypothesize that women might be neutral to environemntal problems not only directly because they simply do not know enough about them but because they do not watch political news in which environemental issues are quite often brough up (since people with lower education watch political news less than others).
These two projects were concerned with the question of whether men or women care more about environmental problems, pay attention to political news connected with them and support environemntly-friendly campaigns. We also suggested some of the probable reasons why they do so. In the following project we shall see if men not only say that they care about environemnt more often then women do but actually take action.
Team: Rteam&DA Areas of members’ responsibilities:
Smoking is one of the most important problems of our time. Everyone knows that smoking entails a high risk to human health. Because of smoking, people have different kinds of diseases, whether it is cardiovascular or respiratory ones, but what is more smoking increases mortality around the whole world. This is only one side of the coin. It is important to remember that smoking is harmful not only for smokers and people around them, it harms the environment as well. And it happens by emission of toxic air pollutants into the atmosphere. Not only tobacco smoke is dangerous, but also cigarette butts are dangerous because their toxic chemicals seep into the soil and waterways, which respectively causes soil and water pollution. We should not forget that under the danger are also animals and vegetation, because they absorb toxic substances from cigarette butts.
We have already included the topic of environmental problems in one of our projects - we investigated if the opinion on the increase in taxes on fossil fuels such as oil, gas and coal was dependent on gender of respondents. Since, as it was mentioned, smoking causes air pollution, it is important not only to support increased taxes on fossil fuels polluting the atmosphere, but also to make individual contribution: start using cars less, drive a car working on electricity etc. Quitting smoking - is another step that can be made towards the future with a clear air.
Beyond a shadow of a doubt, smoking is an extremely serious problem and it is needed to be tackled. In our work, we would like to dig deeper into the smoking problem in such country as Germany and try to see which variables can be used as predictors of smoking.
There are quite a lot of articles written on this topic, so it is possible to say that smoking in Germany is not only a relevant, but also quite a pressing problem. Only in Germany between 100-120 thousands people die because of smoking every year (http://www.gbe-bund.de/pdf/DEGS1_Verbreitung_Rauchen_E.pdf). But what is more interesting is that Germany portrays itself as a country with people leading a really healthy lifestyle. The author of the article on the website The Local (https://www.thelocal.de/20181120/opinion-why-germany-needs-to-take-anti-smoking-laws-more-seriously) emphasizes that it seems astonishing that Germans spend a lot of time and money on their health and stand in long lines for healthy food, but still smoke quite a lot.
We proceed by uploading out dataset and choosing variables of particular interest to us. We also remove ‘NA’ values from these variables and convert them to numeric type.
data <- read.spss("ESS7DE.sav", use.value.labels = TRUE, to.data.frame = TRUE)
save <- c("wkhtot", "njbspv", "cgtsday", "alcwkdy", "gndr", "eduyrs", "agea")
data1 <- data[save]
data1 <- na.omit(data1)
data1$wkhtot <- as.numeric(as.character(data1$wkhtot))
data1$njbspv <- as.numeric(as.character(data1$njbspv))
data1$cgtsday <- as.numeric(as.character(data1$cgtsday))
data1$alcwkdy <- as.numeric(as.character(data1$alcwkdy))
data1$eduyrs <- as.numeric(as.character(data1$eduyrs))
data1$agea <- as.numeric(as.character(data1$agea))
Six variables in total are used in our linear model construction: 4 of predictor variables are continious and 1 is categorical.
The dependent variable in our model is ‘number of cigarettes smoked on typical day (cgtsday)’. It is a continious ratio variable. Descriptive statistics as well as distribution of variable are shown below.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 234 13.87 7.81 13 13.59 10.38 0 40 40 0.39 0.01
## se
## X1 0.51
From descriptive statistics we see that mean and median are close to be equal (app. 13-14 cigarettes per day) while the biggest number of cigarettes respondents smoke per day is 40. Even though kurtosis, as well as skew, indicates normality, as seen from histogram and density plot below, distribution does not seem normal, it is not bell-shaped and unimodal. In fact, we can say that it is bimodal (the biggest number of respondents reported smoking around 10 and around 20 cigarettes per day). Moreover, it is right-skewed because of the long tail (respondent smoking 40 cigarettes per day)
Then we proceed to the predictor variables used in our model.
The first predictor is ‘Total hours normally worked per week in main job overtime included (wkhtot)’. It is a continious ratio variable. Descriptive statistics as well as distribution of variable are shown below.
Why did we consider adding this variable to our model?
Both the number of working hours (overtime included) and the number of people one is responsible for were investigated as possible predictors in future models due to the stress a busy working week and many people under control might cause.
A lot of articles and papers suggest reasons why people smoke, and almost in every of them one of the most frequently reported reasons is stress relief. It is a quite common myth that tobacco helps to decrease a stress level in our lives (https://addictionblog.org/top-10/why-do-people-start-smoking/).
There is a tendency to think that cigarettes can help to cope with stress, but in the article of the american psychologist Andrew Parrot “Does Cigarette Smoking Cause Stress?”, this myth is refuted. Stress level of adult smokers is slightly higher than that of non-smokers, while adolescent smokers report an increase in the level of stress. It happens because these adolescents develop regular models of smoking. As it turns out, smoking cessation leads to stress reduction but not smoking itself. Dependence on nicotine not only does not help reduce stress but increases it. This is confirmed by the daily mood patterns described by smokers. Thus, it turns out that smoking relieves the stress that have been developed by smoking before. Roughly speaking, addicted smokers need nicotine to stay at normal stress level (https://www.researchgate.net/publication/12759310_Does_Cigarette_Smoking_Cause_Stress).
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 234 43.56 11.9 42.5 43.57 6.67 2 90 88 0.14 3.08
## se
## X1 0.78
Here, mean and meadian also seem to be quite close to each other (43.56 and 42.5 respectively). While the maximal number of working hours (including overtime) repondents reported was 90, there were no people reporting 0 hours of work (2 hours is the smallest number).
Contrtrary to the previous variable, even though the kurtosis is bigger than 2 (3.08), the distribution looks quite unimodal and bell-shaped.
According to the information we found, a normal working day in Germany lasts 8 hours, which under certain conditions can be extended to 10 hours. The working week is set from Monday to Saturday, although most employees work from Monday to Friday. Night and shift work is also permitted.
In other words, we would expect people to work aroung 45 - 55 hours a week and this is what explains quite big numbers of people, judging by the histogram, reporting their working hours per week being around 40 - 50.
Since there are no people who don’t work at all (min = 2) we mean center the variable in order to make the further interpretation of the reression equation meaningful. Just to show that the shape of the distribution didn’t change we also plot the distribution again.
Our next predictor is ‘Number of people responsible for in job (njbspv)’. It is also a continious ratio variable. Descriptive statistics and distribution of variable are shown below.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 234 27.57 182.33 5 7.41 4.45 1 2699 2698 13.6 194.41
## se
## X1 11.92
In this case mean and median are not equal at all (27.57 and 5 respectively). Kurtosis and skew are too large to talk about normality (kurtosis of 194.41). The distribution is unimodal, but not bell-shaped. In fact, it is strongly skewed to the right indicating that only a few respondent have a lot of people responsible for in job.
We also mean center this variable since the smallest number of people respondents reported being responsible for is 1.
Here is the graph to show how it has changed after mean-centering.
One more variable that we took for our analysis is ‘Grams alcohol, last time drinking on a weekday, Monday to Thursday (alcwkdy)’. It is, again, continious ratio variable. Its descriptive statistics with visualizations are depicted below.
Why did we consider adding this variable to our model?
According to the article, published on the website of National Institute on Alcohol Abuse and Alcoholism, people who are dependent on alcohol are three times more likely then those in the general population to be smokers, and people who are dependent on tobacco are four times more likely than the general population to be dependent on alcohol. It actually seems like people themselves are aware of it. According to the article, published on the website Tobacco Free Life, people who both smoke and drink often mention that these two habits go together and complement each other: they feel more like smoking when they have an alcoholic drink and vice-versa.
There are many possible explanations for such phenomenon, including the fact that nicotine (that cigarettes contain) enhances the pleasurable effects of alcohol. Even though the reserach in this topic is not suffient to say if it can shed a light on a different reason for connection between smoking and drinking, it is known that both nicotine and alcohol work on the same brain systems. Finally, same genes might be responsible for predisposition to both smoking and drinking or the general reason might be the absence of willpower.
We specifically chose amount of alcohol drunk on a weekday and not on a weekend day. First of all, drinking on weekends is quite common even for people who do not smoke since they are likely to go on some corporate parties or meetings with friends, while drinking on weekdays is more likely to be a warning signal of a person having a predesposition to alcohol frinking. Since genes of smoking and drinking predisposition are similar (or even the same), it is more reasonable for us to use this particular variable.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 234 23.81 33.02 15 17.16 15.57 0 263 263 3.53 16.84
## se
## X1 2.16
Descriptive statistics show big difference between median and mean (15 and 23.81 respectively). Both skew and kurtosis exceed the value of 2. The distribution is again unimodal and right-skewed - not that many people report drinking a lot on week days.
We actually do not have to mean centre this variable since the minimal reported number of grams of alcohol drunk last time on a week day is 0.
Another predictor in our model is ‘Years of full-time education completed (eduyrs)’. It is continious ratio variable. Descriptive statistics and plots are below as usual.
Why did we consider adding this variable to our model?
On the website Studential (https://www.studential.com/university/guides/correlation-between-education-and-quitting-smoking) the following expert opinions offering some reasons why uneducated people are more likely to smoke are posted.
Some of the reasons of people living in worse conditions being more likely to smoke, which might be connected with their education, are also offered:
Some statistics of smokers by level of education can be seen below
Among both genders, smoking is far less widespread in the groups with higher levels of education than in less educated ones. With the exception of the 65-plus age group, where no significant differences regarding the educational level appear, this clear link between smoking and education is evident across all age groups.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 234 14.75 2.7 14 14.57 2.97 9 25 16 0.68 0.49 0.18
According to description, mean and median are 14.75 and 14 respectively. Kurtosis and skew are both small. The distribution looks bell-shaped and unimodal, but not fully normal. It is right-skewed a little bit since, while children are obliged to spend fixed number of years studying by the government and there are not so many children not doing so, there are still some people that spend more years studying than others.
The minimal number of years spent studying is 9 which is why we need to mean center this variable.
Here is the graph to show how it has changed after mean-centering.
The last variable that we took is ‘Gender (gndr)’. It is a categorical binominal variable. A barplot with distribution of gender among respondents is below.
Why did we consider adding this variable to our model?
According to the data from the microcensus of 2013, information of which can be found on Wikipedia (https://en.wikipedia.org/wiki/Smoking_in_Germany), 24.5% (almost fourth) of the total German population aged 15 and over are smokers, where 20.9% smoke regularly, and only 3.6% do so occasionally. It is also known from the statistics of 2013 that in Germany men smoke more than women do (29% and 20% respectively). The graph below shows the proportions of male and female smokers of different age groups in Germany. The average age when german people begin to smoke is 17.8. The most smoking age categories in this country are from 25 to 30, from 20 to 25, and from 30 to 35, while german people over 75 it contains only 8% of male and 3,6 of female smokers. It can be seen, that, as a rule, men smoke more than women with the difference between percentages of smokers in these two age groups (15-20 - 4.9%, 20-25 - 7,1%, 25-30 - 12,1%, 30-35 - 14,2%, 35-40 - 11,1%, 40-45 - 8%, 45-50 - 7,8%, 50-55 - 8,4%, 55-60 - 8.5%, 60-65 - 8,1%, 65-70 - 5,8%, 70-75 - 5%, over 75 - 4,4%) fluctuates from about 4% to 15%.
The article “Prevalence of smoking in the adult population of Germany” written by Thomas Lampert, Elena von der Lippe and Stephan Müters offers the data where categories of people who never smoked, ex-smokers, occasional and daily smokers among 18 to 79-years-old women and men.
What is more, the statistics of heavy smokers is given in the article. So, the most smoking category of German is people from 30 to 44 years, the least - 65-79 year-olds.
Another article (“Smoking Behaviour in Germany – Evidence from the SOEP” written by Daniela Heilert and Ashok Kaul) presents such statistics:
As it can be seen, the blue line (which represents men) is always above the red line representing women.
There were around 60% of men and nearly 40% of women among respondents.
We now proceed to the scatterplots visualizing relationships between a number of smoked cigarettes and our predictor variables. Note, that we do not use centered variables here since we want to interpret actual values of them.
##
## Pearson's product-moment correlation
##
## data: data1$cgtsday and data1$wkhtot
## t = 1.5026, df = 232, p-value = 0.1343
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03045832 0.22360209
## sample estimates:
## cor
## 0.0981712
We start with a total working hours as our predictor variable. Reported pearson’s coefficient is equal to 0.098 which indicates a very weak positive relationship between the variables. By looking at the scatter plot we can make the same conclution. Quite interestingly, the approxiation line for female respondents seems to have a bigger slope coefficient. In other words, while there is a weak positive relationship between number of total working hours and a number of cigarettes one smokes daily, this relationships seems to be a bit stronger for women. One of the possible explanations might be the fact that women have less stress tolerance.
##
## Pearson's product-moment correlation
##
## data: data1$cgtsday and data1$njbspv
## t = 2.0091, df = 232, p-value = 0.04569
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.002568072 0.254744814
## sample estimates:
## cor
## 0.1307711
While, judging by the obtained coefficient (0.1307711), there is a weak positive relationship between the number of cigarettes one smokes daily and a number of people he/she is responsible for, we can see that the approximating line for men is steaper than the one for women. We can also see that there are more males having a lot more people they are responsible for in comparison to women, and more males smoking more than 20 cigarettes daily. One of the possible explanations is, again, stress tolerance.
##
## Pearson's product-moment correlation
##
## data: data1$cgtsday and data1$alcwkdy
## t = 2.7776, df = 232, p-value = 0.005923
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05236072 0.30072908
## sample estimates:
## cor
## 0.1794021
The correlation coefficient in case of these variables is a bit bigger, but it still indicates a weak positive relationship between the number of cigarettes one smokes each day and grams of alcohol consumed last time during the week days. The approximating lines for males and females gave approximately the same slope. As we expected, people who drink are also more likely to smoke.
##
## Pearson's product-moment correlation
##
## data: data1$cgtsday and data1$eduyrs
## t = -1.905, df = 232, p-value = 0.05802
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.248392415 0.004213512
## sample estimates:
## cor
## -0.1240996
This is the only pair of varibles the relationship between which appears to be weak but negative (-0.1240996). In other words, the bigger the number of years of education one obtained, the less cigarettes he/she smokes. The approxiating line for men seems to have a bigger slope in comparison to women which can be possibly explained by the fact that women are, in general, less likely to start smoking under the influence of the peers no matter for how long one continues her education. Here is a boxplot showing how many cigarettes are smoked by people on each day according to their gender. It is seen that mean number of cigarettes for men is bigger that one for women almost by 5 units. We can also see that there are two women who smoke more than 30 cigarettes each day - those are marked as two outliers.
So! Since we will be constructing quite a lot of models, we would like to give you a giudeline on how to read the following part of the project.
We hope it makes it a bit easier! Let’s start.
Linear assumptions for the model to check for bias:
First things first, additivity of predictors implies that the influence of a predictor variables on dependent variable is not influenced by any other external influence. Considering our examples, we conclude that the effects of different independent variables on expected value of the dependent variable are additive since when we add second independent variables it does not change the relationship between the first independent and the outcome variables (for example, the addition of gender does not change the slope of the line that much).
Talking about linearity of a relationship, we firstly examine bivariate plots (the scatterplots above).They show us straight lines, thus, we are satisfied and the conditions are met. Then we will look at the residuals to check if the model is applicable (We will consider this a bit later, after building the models).
Last but not least, we recall that the data we have consists of independent observations as it comes from the national survey and has been checked by professionals.
Now we proceed to constructing the models which consider two independent variables which constitute to the working conditions of a person: ‘Total hours normally worked per week in main job overtime included (wkhtot)’ and ‘Number of people responsible for in job (njbspv)’.
As we mentioned in the descriptions of the variables included as predictors in the models 1, 2 and 3 that we are going to build right now, both of them are supposed to be linked to the stress a person experiences. We hypothesize that the more people one has in responsibility and the more hours one works the higher is the stress level of this person which makes him smoke more. In other words, we expect to obtain a positive correlation when it comes to these variables and their connection to the number of cigarettes one smokes per day.
model1 <- lm(cgtsday ~ wkhtot_cent, data = data1)
model2 <- lm(cgtsday ~ njbspv_cent, data = data1)
model3 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + gndr, data = data1)
The first model includes working hours as a predictor of a number of cigarettes smoked per day, the second one - number of people one is responsible for, and the third one - both of them.
We can now summarize the results of the models to get the statistic of each and compare them later.
summary(model1)
##
## Call:
## lm(formula = cgtsday ~ wkhtot_cent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5713 -5.2176 -0.2827 5.8944 26.3616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.86752 0.50898 27.246 <2e-16 ***
## wkhtot_cent 0.06443 0.04288 1.503 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.786 on 232 degrees of freedom
## Multiple R-squared: 0.009638, Adjusted R-squared: 0.005369
## F-statistic: 2.258 on 1 and 232 DF, p-value: 0.1343
summary(model2)
##
## Call:
## lm(formula = cgtsday ~ njbspv_cent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5050 -4.6115 -0.7272 6.1917 26.2756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.867521 0.507061 27.349 <2e-16 ***
## njbspv_cent 0.005599 0.002787 2.009 0.0457 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.757 on 232 degrees of freedom
## Multiple R-squared: 0.0171, Adjusted R-squared: 0.01286
## F-statistic: 4.036 on 1 and 232 DF, p-value: 0.04569
summary(model3)
##
## Call:
## lm(formula = cgtsday ~ wkhtot_cent + njbspv_cent + gndr, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.8382 -4.9731 -0.1215 5.1321 27.3969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.864280 0.643339 23.105 <2e-16 ***
## wkhtot_cent 0.031419 0.043803 0.717 0.4739
## njbspv_cent 0.004983 0.002760 1.806 0.0723 .
## gndrFemale -2.650473 1.074797 -2.466 0.0144 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.656 on 230 degrees of freedom
## Multiple R-squared: 0.05059, Adjusted R-squared: 0.03821
## F-statistic: 4.085 on 3 and 230 DF, p-value: 0.007499
According to these, we will be able to construct the equations and also find the best model which explains the most share.The F-test statistics is significant with the third model having the smallest p-value. Thus, we conclude that third model is better than no model at all. We shall move to the adjusted R-squared which indicates how well the regression function fits with empirical data. In case of the first model we can conclude that the model explains about 0.5% of the variation in cigarettes smoked daily - and, again, the last model explains the most (3%). Considering this, we should take up our third model for further
Let’s plot now! Shall we?
plot(model1)
plot(model2)
plot(model3)
This is where we finally see that our residuals are nearing zero, the line in the Residuals vs Fitted graph is almost straight and, thus, our calcualtions are not wrong and we are on the way of getting to the best model!
You can also take a look at the normal probability plot that shows if residuals are distributed normally. As you can see, there are some outliers but this is not the worst case scenario - the line is still quite straight.
In case of the Scale-Location plot, we can check if residuals are spead equally along the ranges of predictors. We can see that the a more or less straight horizontal line with randomly (equally) spread points which is good.
Finally, the last plot is supposed to help us identify the highly influential observations that don?t get along with the majority of other cases. In this case, we should watch out for dots outside of the line of Cook’s distance.
Now we should compare the models with each other and check it by conducting ANOVA test comparison. We can compare our reduced and full models since the former one can be created by getting rid of one of the predictors in the full model. Here are the hypothesis:
anova(model1, model3)
## Analysis of Variance Table
##
## Model 1: cgtsday ~ wkhtot_cent
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 232 14064
## 2 230 13482 2 581.58 4.9607 0.007777 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: cgtsday ~ njbspv_cent
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 232 13958
## 2 230 13482 2 475.59 4.0566 0.01856 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at this, we can tell that in both cases p-value is less than 0.05 and, thus, we reject the null hypothesis which claims that reduced model is as good and conclude that addition of a predictor variable (number of people one is responsible for) did imporve our model and the full model is better.
Now we want to construct models based on such socially constructed factor as ‘Years of full-time education completed (eduyrs)’ and on such factor as ‘Grams alcohol, last time drinking on a weekday, Monday to Thursday (alcwkdy)’.
According to the data mentioned before, people who are more educated usually smoke less, while people who are addicted to alcohol or at least consume it in relatively large amounts usually smoke more than those who don’t. We include the latter variable in our model 4, the former - in model 5, and both of them in model 6. In this case situation is a bit different: in case of education we expect a negative correlation, in case of alcohol intake - positive.
Let’s proceed to the construction of the models and see what we have got.
model4 <- lm(cgtsday ~ alcwkdy, data = data1)
model5 <- lm(cgtsday ~ eduyrs_cent, data = data1)
model6 <- lm(cgtsday ~ alcwkdy + eduyrs_cent + gndr, data = data1)
summary(model4)
##
## Call:
## lm(formula = cgtsday ~ alcwkdy, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3180 -5.2075 -0.3666 5.9777 27.1425
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.85755 0.62079 20.712 < 2e-16 ***
## alcwkdy 0.04242 0.01527 2.778 0.00592 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.697 on 232 degrees of freedom
## Multiple R-squared: 0.03219, Adjusted R-squared: 0.02801
## F-statistic: 7.715 on 1 and 232 DF, p-value: 0.005923
summary(model5)
##
## Call:
## lm(formula = cgtsday ~ eduyrs_cent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.496 -5.194 -0.496 5.504 25.504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.8675 0.5075 27.325 <2e-16 ***
## eduyrs_cent -0.3587 0.1883 -1.905 0.058 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.763 on 232 degrees of freedom
## Multiple R-squared: 0.0154, Adjusted R-squared: 0.01116
## F-statistic: 3.629 on 1 and 232 DF, p-value: 0.05802
summary(model6)
##
## Call:
## lm(formula = cgtsday ~ alcwkdy + eduyrs_cent + gndr, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9957 -5.0407 -0.3192 4.8852 27.5576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.90192 0.76139 18.258 <2e-16 ***
## alcwkdy 0.03768 0.01528 2.466 0.0144 *
## eduyrs_cent -0.36964 0.18394 -2.010 0.0456 *
## gndrFemale -2.47685 1.03800 -2.386 0.0178 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.565 on 230 degrees of freedom
## Multiple R-squared: 0.07312, Adjusted R-squared: 0.06103
## F-statistic: 6.048 on 3 and 230 DF, p-value: 0.0005571
Now we are interested in the F-statistics and p-values numbers again and, as we see, the sixth (the full) model provides the most explanation into our case and the R-squared states that it covers 6,1% of variation.
Again, looking at the residuals plots and the other we see that we are able to trust the results.
plot(model4)
plot(model5)
plot(model6)
Our residuals are nearing zero, the line in the Residuals vs Fitted graph is almost straight.
Looking at the normal probability plot, we can see, there are some outliers but this is not still applicable and the line is quite straight.
On the Scale-Location plot we can see that the a more or less straight horizontal line with randomly spread points which is fine by us.
Last but not least, according to the last plot, we should watch out for dots outside of the line of Cook’s distance?.
Now we again check if the reduced model is as good as the full model or not and use the ANOVA test.
anova(model4, model6)
## Analysis of Variance Table
##
## Model 1: cgtsday ~ alcwkdy
## Model 2: cgtsday ~ alcwkdy + eduyrs_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 232 13744
## 2 230 13162 2 581.32 5.079 0.006943 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model5, model6)
## Analysis of Variance Table
##
## Model 1: cgtsday ~ eduyrs_cent
## Model 2: cgtsday ~ alcwkdy + eduyrs_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 232 13982
## 2 230 13162 2 819.68 7.1615 0.0009612 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at this, we again claim that in both cases p-value is less than 0.05 and, thus, we reject the null hypothesis which claims that reduced model is as good and conclude that the full model is of a better explanation.
Now we know which models of previous two groups are the best, but we cannot compare them with each other by using anova since they use completely different predictors.
But what happens if we take all of those predictors and use them in our final model? Let’s take a look.
model7 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent + gndr, data = data1)
summary(model7)
##
## Call:
## lm(formula = cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy +
## eduyrs_cent + gndr, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.090 -4.811 -0.190 4.744 26.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.716176 0.765749 17.912 <2e-16 ***
## wkhtot_cent 0.038523 0.043131 0.893 0.3727
## njbspv_cent 0.005421 0.002713 1.998 0.0469 *
## alcwkdy 0.038844 0.015191 2.557 0.0112 *
## eduyrs_cent -0.396836 0.183453 -2.163 0.0316 *
## gndrFemale -2.056666 1.073409 -1.916 0.0566 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.518 on 228 degrees of freedom
## Multiple R-squared: 0.09263, Adjusted R-squared: 0.07273
## F-statistic: 4.655 on 5 and 228 DF, p-value: 0.0004578
anova(model3, model7)
## Analysis of Variance Table
##
## Model 1: cgtsday ~ wkhtot_cent + njbspv_cent + gndr
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent +
## gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 230 13482
## 2 228 12886 2 596.95 5.2813 0.005727 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model6, model7)
## Analysis of Variance Table
##
## Model 1: cgtsday ~ alcwkdy + eduyrs_cent + gndr
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent +
## gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 230 13162
## 2 228 12886 2 277.01 2.4508 0.0885 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Having had a look at these, we see that p-value for the third and the last models is less than 0.05, thus, the full model explains our outcome better then the reduced one! Success! Now to the sixth and the full models, since p-value is not significant (0.0885 > 0.05), we can conclude that we should accept H0 and say that adding the new paramentrs did not help in that case.
We can see that we now have a dummy variable gndrFemale. Let’s see why we have women instead of men here.
contrasts(data1$gndr)
## Female
## Male 0
## Female 1
R has actually created a gndrFemale dummy variable that takes on a value of 1 if the sex is Female, and 0 otherwise. This decision to code female as 1 and males as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients. So, we can change it.
data1 <- data1 %>%
mutate(gndr = relevel(gndr, ref = "Female"))
contrasts(data1$gndr)
## Male
## Female 0
## Male 1
Now, we see that the summary of the model is a bit different
model7 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent + gndr, data = data1)
summary(model7)
##
## Call:
## lm(formula = cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy +
## eduyrs_cent + gndr, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.090 -4.811 -0.190 4.744 26.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.659510 0.857963 13.590 <2e-16 ***
## wkhtot_cent 0.038523 0.043131 0.893 0.3727
## njbspv_cent 0.005421 0.002713 1.998 0.0469 *
## alcwkdy 0.038844 0.015191 2.557 0.0112 *
## eduyrs_cent -0.396836 0.183453 -2.163 0.0316 *
## gndrMale 2.056666 1.073409 1.916 0.0566 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.518 on 228 degrees of freedom
## Multiple R-squared: 0.09263, Adjusted R-squared: 0.07273
## F-statistic: 4.655 on 5 and 228 DF, p-value: 0.0004578
We start the analysis of our model from the F-statistic. In this case we have the following hypothesis:
As it can be seen, the p-value is extremely small (< 0.05) which implies that the probablity to get such F-statistic as we do if H0 is correcct is quite small, therefore, we reject the null hypothesis and conclude that at least one of regression coefficients is not equal to zero. In other words, having such a model is better than having the model that predicts number of cigarettes a person smokes per day based on the mean value.
We proceed to the R-squared which indicates how well the regression function fits with empirical data. In this case, R-squared is equal to 0.07273 which, in other words, means that our model explains around 7% of the variance in the number of cigarettes people smoke per day.
If we were to write down the regression equation, it woulld look like this:
Predicted number of cigarettes one smokes per day = 11.659510 + 0.038523 * number of working hours + 0.005421 * number of people one is responsible for + 0.038844 * grams of alcohol drunk last time on a weekday - 0.396836 * number of years of education + 2.056666 * 1 (being a man)
By looking at this equation we can definitely see that in case of all x being equal to 0 (female, mean number of working hours, mean number of people he is responsible for, mean amount of alcohol drunk last time on a weekday, mean number of years of education attained) the predicted number of cigarettes smoked per day would be 11.659510.
In plain words, in order to obtain a 1-unit increase in a number of cigarettes one smokes per day, those are the changes that needs to be made for each variable:
Now, we need to take a look at the plots that help us analyze the model.
plot(model7)
‘Residuals VS Fitted’ plot that shows if residuals are distributed linearly or not. We know that residuals represent the difference between number of cigarettes smoked per day predicted by our model and real numbers, and we know that residuals are distributed linearly if the points are equally spread around a straight horizontal line and there is no clear non-linear pattern forming there. The line on the plot is not perfectly straight, but it is still fine to conclude that residuals are distributed more or less linearly (especially compared to the plots of previous models).
The normal probability plot shows if residuals are distributed normally. As you can see, there are some outliers, namely, 125 and 172, but this is not the worst case scenario and apart from those - the line is still quite straight.
In case of the Spread-Location plot, we can check if residuals are spead equally along the ranges of predictors. The line we obrain is not perfectly horizontal - it has a concave curve in the middle and the points are spread more densily in that area, but, overall, it is not as bad.
Finally, the last plot is supposed to help us identify the highly influential observations that don’t get along with the majority of other cases. In this case, we should watch out for dots outside of the line of ‘Cook’s distance’, and the only point that is beyond the Cook’s distance is the outlier 51.
model3 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + gndr, data = data1)
model6 <- lm(cgtsday ~ alcwkdy + eduyrs_cent + gndr, data = data1)
model7 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent + gndr, data = data1)
tab_model(model3, model6, model7)
| cgtsday | cgtsday | cgtsday | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 12.21 | 10.57 – 13.85 | <0.001 | 11.43 | 9.77 – 13.08 | <0.001 | 11.66 | 9.98 – 13.34 | <0.001 |
| wkhtot cent | 0.03 | -0.05 – 0.12 | 0.474 | 0.04 | -0.05 – 0.12 | 0.373 | |||
| njbspv cent | 0.00 | -0.00 – 0.01 | 0.072 | 0.01 | 0.00 – 0.01 | 0.047 | |||
| Male | 2.65 | 0.54 – 4.76 | 0.014 | 2.48 | 0.44 – 4.51 | 0.018 | 2.06 | -0.05 – 4.16 | 0.057 |
| alcwkdy | 0.04 | 0.01 – 0.07 | 0.014 | 0.04 | 0.01 – 0.07 | 0.011 | |||
| eduyrs cent | -0.37 | -0.73 – -0.01 | 0.046 | -0.40 | -0.76 – -0.04 | 0.032 | |||
| Observations | 234 | 234 | 234 | ||||||
| R2 / adjusted R2 | 0.051 / 0.038 | 0.073 / 0.061 | 0.093 / 0.073 | ||||||
Now, let’s sumarize everything we’ve done here.
Our team compared models and concluded that the best one is the final model containing ‘Total hours worked per week in main job overtime included’, ‘Number of people responsible for in job’, ‘Grams alcohol last time drinking on a weekday, Monday to Thursday’, ‘Years of full-time education completed’, and ‘Gender’. It can be seen from the table that the final model is really the best one on the basis of adjusted R squared (adjusted R squared in model 3 is equal to 0.073, in model 6 it is 0.061, and in model 7 it is 0.038) . We compared the final model and the sixth one and concluded that addition of variables related to stress on work did not improve our model significantly. While in case of the third model, addition of other variables worked out and improved the model.
In our work, we looked into to different papers and found out that one of the most common causes of cigarette consumption is stress relief. After checking all the models, we understood that the model that contains not only stress (‘Total hours normally worked per week in main job overtime included (wkhtot)’ and ‘Number of people responsible for in job (njbspv)’) but all the predictors at once works better than any other. That is why when we add ‘stress’ predictors to our second model (which is based on socially constructed factors), the model only slightly improves.
We create a new model that feautures the age of a respondent as one of the predictors, and make gender play a role of a moderator. In other words, we want to check if the coefficient between X (age of respondent) and Y (number of cigarettes one smokes per day) depends on the level of Z (gender).
describe(data1$agea)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 234 45.17 13.29 47 44.99 14.83 19 77 58 0.08 -0.75
## se
## X1 0.87
Since we see that the minimal age of a respondent is 19 and not 0 we want to make our variable mean-centred in orer to make the interpretation of the coefficients later on.
data1$agea_cent <- data1$agea - mean(data1$agea)
Then, we proceed to comparing the additive (model 8) and interaction model (model 9) to see which of them has a better fit.
model8 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent + agea_cent + gndr, data = data1)
model9 <- lm(cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent + agea_cent * gndr, data = data1)
anova(model8, model9)
## Analysis of Variance Table
##
## Model 1: cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent +
## agea_cent + gndr
## Model 2: cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent +
## agea_cent * gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 227 12588
## 2 226 12366 1 222.15 4.0602 0.04509 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can observe that p-value is equal to 0.04509 which is smaller than 0.05 meaning that the interaction model is actually beter - it has a better fit. We should interpret it then!
tab_model(model8)
| cgtsday | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 11.74 | 10.07 – 13.41 | <0.001 |
| wkhtot cent | 0.04 | -0.05 – 0.12 | 0.401 |
| njbspv cent | 0.01 | -0.00 – 0.01 | 0.055 |
| alcwkdy | 0.03 | 0.00 – 0.06 | 0.029 |
| eduyrs cent | -0.38 | -0.73 – -0.02 | 0.040 |
| agea cent | 0.09 | 0.01 – 0.16 | 0.021 |
| Male | 2.14 | 0.05 – 4.22 | 0.046 |
| Observations | 234 | ||
| R2 / adjusted R2 | 0.114 / 0.090 | ||
tab_model(model9)
| cgtsday | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 11.72 | 10.07 – 13.38 | <0.001 |
| wkhtot cent | 0.04 | -0.04 – 0.12 | 0.356 |
| njbspv cent | 0.01 | 0.00 – 0.01 | 0.049 |
| alcwkdy | 0.04 | 0.01 – 0.07 | 0.019 |
| eduyrs cent | -0.37 | -0.72 – -0.02 | 0.042 |
| agea cent | 0.19 | 0.07 – 0.31 | 0.003 |
| Male | 2.08 | 0.01 – 4.15 | 0.050 |
| agea_cent:gndrMale | -0.16 | -0.31 – -0.00 | 0.045 |
| Observations | 234 | ||
| R2 / adjusted R2 | 0.129 / 0.102 | ||
By looking at adjusted R squared we can conclude that our interaction model explains 10% of the variance in number of cigarettes smoked per day while the additive model explained 9%. In short - our new model is better!
The intercept is 11.72 meaning that in case of all predictors being equal to 0 (in case of continious variables - equal to mean since they are mean centred; in case of categorical variable - being Female) the number of sigarettes smoked per day is equal to 11.72. P-value associated with working hours and gender is not significant which is why we do not interpret them. One-unit increase in an age of a respondent is associated with a 0.19 increase in number of cigarettes a person smokes daily (in plain words, if the age of a person is 5 units higher than the mean, a number of cigarettes this person smokes is higher by 1). If a person drinks 25 more grams of alcohol than the mean he smokes 1 cigarette more per day; if a person has 2,7 less years of education than the mean then he smokes 1 cigarette more per day; if a person has 100 more people he is responsible for in a job than the mean he smokes 1 cigarette more per day.
Let’s now take a look at the interaction plots.
plot_model(model9, type = "int", title = "Predicted values of number of cigarettes smoked per day", axis.title = c("Age", "Number of cigarettes smoked per day"), legend.title = "Gender", axis.lim = c(0, 60))
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
Judging by this plot, the interaction effect is increasing in strength the older a person is in case of female respondents (colored red) and staying exactly the same in case of male respondents. If a woman’s age is equal to the mean (45 yrs) then she smokes around 12 cigarettes per day and 15 cigarettes per day in case of male respondents. Both in case of males and females the number of cigarettes one smokes per day is approximately 15 cigarettes if a person’s age is 1o units higher than the mean (55 yrs). Since from that point of the confidence intervals overlap we know that the interaction effect is not statistically significant (they were not competely separated even in the begininng but they were at least not overlapping). We can observe that p-value is equal to 0.03 which is smaller than 0.05 meaning that the interaction model is actually beter - it has a better fit. We should interpret it then!
sjPlot::plot_model(model9, type = "pred")
## $wkhtot_cent
##
## $njbspv_cent
##
## $alcwkdy
##
## $eduyrs_cent
##
## $agea_cent
##
## $gndr
To be precise, a female respondent is expected to smoke between 11 and 14 (about 13) cigarettes per day, male respondent - from 13 to 15 (about 14,5).
But now we want to report marginal effects of predictors.
As we know, marginal effects show the expected change in the outcome (number of cigarettes smoked per day) associated with 1-unit change of a particular predictor, taking into account all the other predictors. In other words, it is the “pure” effect of predictors.
margins(model9)
## Average marginal effects
## lm(formula = cgtsday ~ wkhtot_cent + njbspv_cent + alcwkdy + eduyrs_cent + agea_cent * gndr, data = data1)
## wkhtot_cent njbspv_cent alcwkdy eduyrs_cent agea_cent gndrMale
## 0.03932 0.00529 0.03572 -0.3703 0.09064 2.08
The highest marginal effect is observed in case of the gender: 2.08 which implies that this variable is the most “important” in comparison with other predictors. Years of education is the second most important one (-0.3703).
To summarize everything, we would like to highlight our main findings. It was discovered that males in Germany have their own strong opinion about political programs and movements connected with environmental issues. Analysis of data shows greater involvement and awareness of German men: educated people spend more time on political news (and we know that men are more educated) , they are against or support increasing taxation because they know a lot about it. Women in Germany, on the contrary, seem to be neutral and less involved in these issues. We assume that such a gap exists because of differences in education between gender. However, we also found that males smoke much more than females. So, despite their political positions about environment, they bring more harm to nature. Total hours of work, number of people one is responsible for, drinking patterns, age and education - those are the factors that contribute to it. That is why, government needs to take into account such information in order to reduce number of hard-smoking people in Germany.