Data Science project

By Denis Jackman (student number:23509489)

Part 1; Affect of subliminal messages on test scores?

A group of students who had failed a maths exam were enlisted for this experiment. They were randomly assigned to receive one of two messages.

The treatment group received the message, “Each day I am getting better at mathematics.” The control group saw a boring message that had nothing to do with maths.

All students in both groups took an initial test (pre-test), went to a maths course during the summer, and then took another test after (post-test). The researchers claim that the positive message improves test scores. Test their claim using the change in test score (post-test score pretest score) for the treatment and control message group

My null and alternative Hypotheses

Null Hypothesis: Positive Subliminal messages don’t improve exam scores

Alternative Hypothesis: Positive subliminal messages do improve exam scores

Sidenote: I chose the Welch two-sample t-test due to the variance in the data. Essentially, there was unequal variance, so I used the Welch test instead due to its robust nature. The standard two-sample t-test assumes equal variance and homoskedacisity.

Summary Statistics
Group_Type Mean Min Max SD
C 7.43 4 12 3.10
T 11.40 6 16 3.17

T-test Results of a one-sided test: Treatment vs. Control
t.statistic p.value
t 2.575 0.011

Conclusion: Part 1

I concluded from theoretical theory that there was no reason to perform a two sided test because from real life experience i have a very strong inclination that a positive subliminal message will only have a positive effect and therefore would be unnecessary to test for the negative effect.

To determine whether our results have Significance we must set a goalpost to judge how significant they are. For this sample the goalpost is a 99% confidence interval, meaning to get a FALSE null hypothesis the P-value must indicate a significance value lower than 0.01. The P-value obtained was 0.023 but since I’m only doing a one-sided test it is 0.011, meaning that the null hypothesis is barely true at the significance level 0.01. However, if we were more lenient and changed the goal post to a significance level of 0.05 (95% confidence level) the null hypothesis would be FALSE and the alternative would be true. If this were the case we could state that subliminal messages improve exam scores with 95% confidence we are not wrong.

Part 2: Econometrics on Morality Data

Regressions picked, Key variables (the reasons for them) and Statistical tools:

Regression chosen;

The regressions I chose are the logistical regression and the OLS regression, a binary variable as the dependent variable bounds the results possible between 1 and 0, so I thought the logistical regression would be efficient and appropriate as it is intended for variables that are bound between 1 and 0, meaning my estimates would be probabilities. I picked OLS for its more intuitive interpretations of the interactions between variables, mostly to use for easier interpretations of the relationship between death and one other variable.

After carrying out multiple preliminary logistical regressions I concluded that it would be easier to interpret the results if I started with a base regression and compared AIC, VIF, PSEUDO[Mcfadden] and estimate results after adding another or swapping out variables. Also, I chose to do some preliminary research on certain variables to grasp their possible efficiency in predicting death and to understand better which terms might cause multicollinearity or other statistical problems. After some research, I concluded that some changes are needed to better prepare my variables before running regressions to minimize multicollinearity and to improve the accuracy of my estimators at predicting death. My results from that process are these variables;

  1. pre_age and age; I decided to square each observation of age as i have a strong theoretical belief that the older someone is the more likely the other variables like death, pack_years and alcholfreq will affect the probability of dying. Since the mean age was 43.6 I realised that to get a true indicator of death I’d have to add more weight to older ages since we not only experience these variables more at an older age but we are more likely to die from them. I also chose just age because I used tidyverse on it to pull only observations above the mean (43.6 years old) for some regressions.

  2. log(Income); the reason behind this is that theoretically the more income you have the more you can spend on healthcare, safety, food and education which means that you are more likely to live longer. I took the log of it because some outliers and non-normal residuals were skewing the data, so I used log to tidy up the data of income.

  3. Pack_years; This variable was calculated by dividing smokeintensity by 20 and multiplying it by smokeyrs. It’s a standard medical term to estimate the effect of smoking on longevity.

  4. alcoholfreq; Alcohol if consumed daily has negative health effects due to its damaging qualities to human organs which would affect life expectancy.

  5. school: Education in general provides useful knowledge applicable to everyday life for example healthy diet, unhealthy habits to avoid, etc and the pursuit of further education is of course linked with higher income ( understandably the predictor school possibly is correlated with the other variables but it is still intuitively a good predictor of death )

## 
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                              death           
## ---------------------------------------------
## pre_age                    0.001***          
##                            (0.0001)          
##                                              
## log(income)                -1.734***         
##                             (0.505)          
##                                              
## pack_years                 0.012***          
##                             (0.004)          
##                                              
## alcoholfreq                 -0.089           
##                             (0.058)          
##                                              
## school                     -0.056**          
##                             (0.027)          
##                                              
## Constant                     1.395           
##                             (1.391)          
##                                              
## ---------------------------------------------
## Observations                 1,507           
## Log Likelihood             -537.236          
## Akaike Inf. Crit.          1,086.472         
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01
## 
## =================================================
## pre_age log(income) pack_years alcoholfreq school
## -------------------------------------------------
## 1.118      1.298      1.103       1.078    1.309 
## -------------------------------------------------

The Pseudo R^2 fit test

Pseudo-r2 McFadden 0.2558981

A good fit of a McFadden GLM regression is between 0.2 and 0.4. I ran multiple preliminary regressions and compared the results of each, and then chose my Final model which had a good fit according to the McFadden R^2 test.

Summary statistics of the variables of my final_model;

This sample data-set from 1971 to 1992 unfortunately has a mean of 43.65 years old due to it being normally distributed, meaning that every age above 56 is one SD from the mean giving way less observations where things like diseases, smoking side-effects and even age can really take affect on the probability of dying. Intuitively the predictors for death of the ages before the mean won’t really be efficient predictors because its very unlikely to die before 43. Essentially, even though we might have over 1500 observations, the truly useful observations on predicting good estimates for are variables on death are past the 1st SD on the right-hand tail (56 years old). So, this means we have greatly reduced the amount of observations we have to find significance. I tried to combat this of course by giving more weight to the age variable by squaring it, meaning a bigger age equals a exponentially bigger value.

Summary Statistics of the Variables from final Model logistical regression
Variable Min Max Mean SD
Age 25.000000 74.000000 43.659642 11.9898144
Pre_age 625.000000 5476.000000 2049.828225 1092.3262865
log(Income) 2.397895 3.091042 2.877719 0.1618178
Pack_Years 0.100000 156.000000 25.595147 20.4265489
Alcohol_Freq 0.000000 5.000000 1.913155 1.3037643
School 0.000000 17.000000 11.173052 3.0704302

Impact of smoking; Probability of death of smokers vs non-smokers;

Impact of smoking on the probability of death is significant, choosing to smoke increases your chances of dying by around 10% than if you chose not to smoke or to quit. I squared each observation of age due to the reasoning that most people die not only later in life but are also affected more later in life by their previous years of smoking because they are more vulnerable to the side-effects of smoking. Clearly, smoking is a dangerous habit that leads to a earlier death so to decrease your probability of dying people should not smoke.

The scatterplot of Pack years Vs Death has heteroskedacisity variance but also indicates that a higher amount of cigerettes consumed over a lifespan is linked with a positive relationship with death, meaning the greater intensity of smoking each year the greater your odds of death.

## `geom_smooth()` using formula = 'y ~ x'

Income Vs Death

the relationship between the probability of dying and a higher income is negative, indicating that as your salary grows yours chances of dying go down. This is interpreted as income gives more opportunities to stay healthy through better access to healthcare, to buy more nutritional food, access professional health consultancy, less stressful financial relationship dynamic and more security. All these things are intuitively correlated with less probability of dying.

## `geom_smooth()` using formula = 'y ~ x'

Relationship between school and death;

Clearly below is a big negative relationship in an extra year of schooling and death, meaning every extra year of schooling your probability of dying reduces significantly by the log of 0.027. If you continue on with education until grade 17 which would be 4 years plus in college your probability of dying drops near %50, of course this is correlated with income but there are other factors aswell that contribute to this like more knowledge and a better system for decision making cultivated from possible research practices. Also, the estimator of school only reached a significance level of 95% so it’s reliability is somewhat limited.

Conclusion; Part 2

I agree with government policies aiming at reducing the consumption of tobacco because my data indicated that there is a positive relationship between the consumption of tobacco and death, so to reduce that probability of death would be a great benefit to society. Also, government funding of education is a very efficient policy at reducing the odds of death and increasing welfare accoding to my final_model. I would advocate for even more funding in education has its affect is substantial and could improve the welfare of society even further. Potentially, the government’s low corporate tax policy which attracts big MNCs that increase wages is a good policy due to logincome’s negative relationship with the probability of death. Another, suggestion would be to reduce income tax so that people can retain more of their earning which would lead to less probability of death, but the reduction of taxes should not decrease the funding of healthcare because then there would be conflicting results.

The Alcohol frequency variable has a estimate that contradicts my initial theoretical belief of the affect of alcohol on the probability of death, this could be caused by a rare sample or a more complex relationship between alcohol and human behaviour. Humans are social animals and there are links between social interactions and improved mental health which could help prevent depression or other symptoms of exclusion which might explain the odd negative relationship.Also, there lacked a significance level below 0.1 limiting the reliability of the estimator. For that reason I would’nt change any policies on alcohol consumption based on the results of my final model.

AI policy statement;

I used AI for help on formating and styling my summary statistics;

summary_stats %>% kable(caption = “Summary Statistics”, digits = 3, format = “markdown”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”),position = “center”, font_size = 18)

test_results %>% kable(caption = “T-test Results of a one-sided test: Treatment vs. Control”, digits = 3, format = “markdown”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”),position = “center”, font_size = 20, full_width = FALSE) %>% column_spec(1, width = “10cm”) %>% column_spec(2, width = “10cm”)

summary_stats %>% kable(caption = “Summary Statistics of the Variables from final Model logistical regression”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”), full_width = TRUE, font_size = 18)

I used AI For help on styling my regression functions ggplots;

labs(title = “Impact of Smoking on Death Probability Over Age”,x = “Age (pre_age)”,y = “Probability of Death”,color = “Smoking Status”) + facet_wrap(~ qsmk, labeller = labeller(qsmk = c(“0” = “Non-Smokers”, “1” = “Smokers”))) + theme_minimal()

R code;

{r, echo=FALSE} library(haven) my_data <- readRDS(“C:/Users/denis/Downloads/mortality.rds”) my_data\(pre_age <- my_data\)age^2 my_data\(pack_years <- (my_data\)smokeintensity/20)*my_data\(smokeyrs my_data\)logincome <- log(my_data$income)

{r, echo=FALSE, warning=FALSE, message=FALSE} library(dplyr) library(kableExtra)

data <- data.frame(Group = 1:17, Score1 = c(18, 18, 21, 18, 18, 20, 23, 23, 21, 17, 18, 24, 20, 18, 22, 15, 19), Score2 = c(24, 25, 33, 29, 33, 36, 34, 36, 34, 27, 29, 29, 24, 26, 27, 22, 31), Difference = c(6, 7, 12, 11, 15, 16, 11, 13, 13, 10, 11, 5, 4, 8, 5, 7, 12), Group_Type = c(“T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “C”, “C”, “C”, “C”, “C”, “C”, “C”))

summary_stats <- data %>% group_by(Group_Type) %>% summarise( Mean = round(mean(Difference), 2), Min = min(Difference), Max = max(Difference), SD = round(sd(Difference), 2))

summary_stats %>% kable(caption = “Summary Statistics”, digits = 3, format = “markdown”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”),position = “center”, font_size = 18)

{r echo=FALSE, message=FALSE, warning=FALSE} library(ggplot2) data <- data.frame(Group = 1:17, Score1 = c(18, 18, 21, 18, 18, 20, 23, 23, 21, 17, 18, 24, 20, 18, 22, 15, 19), Score2 = c(24, 25, 33, 29, 33, 36, 34, 36, 34, 27, 29, 29, 24, 26, 27, 22, 31), Difference = c(6, 7, 12, 11, 15, 16, 11, 13, 13, 10, 11, 5, 4, 8, 5, 7, 12), Group_Type = c(“T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “C”, “C”, “C”, “C”, “C”, “C”, “C”)) group_t <- data\(Difference[data\)Group_Type == “T”] group_c <- data\(Difference[data\)Group_Type == “C”]

ggplot(data, aes(x = Group_Type, y = Difference, fill = Group_Type)) +geom_boxplot() +labs(title = “Difference in Scores by Group Type”, subtitle = “Comparison of Treatment vs Control”,x = “Group Type”, y = “Difference in Scores”)

{r, echo=FALSE, warning=FALSE} library(kableExtra) data <- data.frame(Group = 1:17, Score1 = c(18, 18, 21, 18, 18, 20, 23, 23, 21, 17, 18, 24, 20, 18, 22, 15, 19), Score2 = c(24, 25, 33, 29, 33, 36, 34, 36, 34, 27, 29, 29, 24, 26, 27, 22, 31), Difference = c(6, 7, 12, 11, 15, 16, 11, 13, 13, 10, 11, 5, 4, 8, 5, 7, 12), Group_Type = c(“T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “T”, “C”, “C”, “C”, “C”, “C”, “C”, “C”))

group_t <- data\(Difference[data\)Group_Type == “T”] group_c <- data\(Difference[data\)Group_Type == “C”]

t_test_result <- t.test(group_t, group_c, alternative = “greater”) t_statistic <- t_test_result\(statistic p_value <- t_test_result\)p.value

test_results <- data.frame( “t-statistic” = round(t_statistic, 3), “p-value” = round(p_value, 3))

test_results %>% kable(caption = “T-test Results of a one-sided test: Treatment vs. Control”, digits = 3, format = “markdown”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”),position = “center”, font_size = 20, full_width = FALSE) %>% column_spec(1, width = “10cm”) %>% column_spec(2, width = “10cm”)

{r, echo=FALSE, warning=FALSE, message=FALSE} library(ggplot2) mean <- 0 sd <- 1 x <- seq(-4, 4, length = 1000) y <- dnorm(x, mean, sd) data <- data.frame(x, y) alpha <- 0.05 z_alpha <- qnorm(1 - alpha, mean, sd) ggplot(data, aes(x, y)) + geom_line(color = “blue”, size = 1) + geom_area(data = subset(data, x >= z_alpha), aes(x, y), fill = “blue”, alpha = 0.4) + geom_vline(xintercept = z_alpha, color = “blue”, linetype = “dashed”, size = 0.8) + annotate(“text”, x = z_alpha + 0.2, y = 0.02, label = expression(alpha == 0.05), color = “blue”, size = 5, hjust = 0) + labs(title = “Normal Distribution”, font_size = 19, x = “Z-Score”, y = ““) + theme_minimal()

{r include=FALSE, message=FALSE, warning=FALSE} library(car) library(pscl)

base_model <- glm(death ~ pre_age + income + pack_years + alcoholfreq + school, family = binomial(link = “logit”), data = my_data) summary(base_model) vif(base_model) print(vif) pR2(base_model)[“McFadden”]

new_model_1 <- glm(death ~ pre_age + income + pack_years + alcoholfreq + school, family = binomial(link = “logit”), data = my_data) summary(new_model_1) vif(new_model_1) print(vif) pR2(base_model)[“McFadden”]

new_model_2 <- glm(death ~ pre_age + log(income) + pack_years + alcoholfreq + school + marital, family = binomial(link = “logit”), data = my_data) summary(new_model_2) vif(new_model_2) print(vif) pR2(base_model)[“McFadden”]

new_model_3<- glm(death ~ pre_age + income + pack_years + alcoholfreq + school + active, family = binomial(link = “logit”), data = my_data) summary(new_model_3) vif(new_model_3) print(vif) pR2(base_model)[“McFadden”]

new_model_4 <- glm(death ~ pre_age + income + pack_years + alcoholfreq + school + marital, family = binomial(link = “logit”), data = my_data) summary(new_model_4) vif(new_model_4) print(vif) pR2(base_model)[“McFadden”]

final_model <- glm(death ~ pre_age + log(income) + pack_years + alcoholfreq + school, family = binomial(link = “logit”), data = my_data) summary(final_model) vif(final_model) print(vif) pR2(final_model)[“McFadden”]

{r, echo=FALSE, warning=FALSE, message=FALSE} library(stargazer) stargazer(final_model, type = “text”)

{r, echo=FALSE} stargazer(vif(final_model), type = “text”)

{r, echo=FALSE} library(kableExtra) summary_stats <- data.frame(Variable = c(“Age”, “Pre_age”, “log(Income)”, “Pack_Years”, “Alcohol_Freq”, “School”), Min = c(min(my_data\(age, na.rm = TRUE),min(my_data\)pre_age, na.rm = TRUE),min(my_data\(logincome, na.rm = TRUE), min(my_data\)pack_years, na.rm = TRUE),min(my_data\(alcoholfreq, na.rm = TRUE ),min(my_data\)school, na.rm = TRUE)), Max = c(max(my_data\(age, na.rm = TRUE),max(my_data\)pre_age, na.rm = TRUE),max(my_data\(logincome, na.rm = TRUE), max(my_data\)pack_years, na.rm = TRUE),max(my_data\(alcoholfreq, na.rm = TRUE),max(my_data\)school, na.rm = TRUE)), Mean = c(mean(my_data\(age, na.rm = TRUE),mean(my_data\)pre_age, na.rm = TRUE),mean(my_data\(logincome, na.rm = TRUE), mean(my_data\)pack_years, na.rm = TRUE),mean(my_data\(alcoholfreq, na.rm = TRUE),mean(my_data\)school, na.rm = TRUE)), SD = c(sd(my_data\(age, na.rm = TRUE),sd(my_data\)pre_age, na.rm = TRUE),sd(my_data\(logincome, na.rm = TRUE), sd(my_data\)pack_years, na.rm = TRUE),sd(my_data\(alcoholfreq, na.rm = TRUE), sd(my_data\)school, na.rm = TRUE)))

summary_stats %>% kable(caption = “Summary Statistics of the Variables from final Model logistical regression”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”), full_width = TRUE, font_size = 18)

{r, echo=FALSE} library(ggplot2) ols_model <- lm(death ~ qsmk * pre_age, data = my_data) ols_summary <- summary(ols_model) ggplot(my_data, aes(x = pre_age, y = death, color = qsmk)) + geom_point(alpha = 0.6) + geom_smooth(method = “lm”, se = TRUE) + labs(title = “Impact of Smoking on Death Probability Over Age”,x = “Age (pre_age)”,y = “Probability of Death”,color = “Smoking Status”) + facet_wrap(~ qsmk, labeller = labeller(qsmk = c(“0” = “Non-Smokers”, “1” = “Smokers”))) + theme_minimal()

{r, echo=FALSE} library(ggplot2) ggplot(my_data, aes(x = age, y = pack_years, color = factor(death))) + geom_point() +labs(title = “Pack Years vs Death”,x = “Pack Years”,y = “Age”,color = “Death Status”) + theme_minimal()

{r, echo=FALSE, warning=FALSE} library(ggplot2) ggplot(my_data, aes(x = logincome, y = death)) + geom_point(alpha = 0.7, color = “blue”) +
geom_smooth(method = “lm”, se = TRUE, color = “red”, linetype = “dashed”) + labs(title = “Relationship between Income and Death Status”,x = “Log(Income)”, y = “Death Status (0 = No, 1 = Yes)”) + theme_minimal()

{r, echo=FALSE, message=FALSE} library(ggplot2) ggplot(my_data, aes(x = school, y = death)) + geom_point(alpha = 0.7, color = “blue”) +
geom_smooth(method = “lm”, se = TRUE, color = “red”, linetype = “dashed”) + labs(title = “Relationship between school and Death”,x = “school”, y = “Death Status (0 = No, 1 = Yes)”) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5),axis.title = element_text(size = 12), axis.text = element_text(size = 10) )