(1.10, p. 20) A survey was conducted to study the smoking habits of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that “£” stands for British Pounds Sterling, “cig” stands for cigarettes, and “N/A” refers to a missing component of the data.
# loading the data from my github repro
theUrl <- "https://raw.githubusercontent.com/forhadakbar/data606fall2019stat/master/SmokingData.csv"
smoking<- read.csv(url(theUrl), header = TRUE, sep = ",")Answer: Each row represents a single observation in this dataset. Each row means the data of a UK resident. The Data represents the surveyed person sex, age, Marital status, Gross income (exact or approximate), number of cigarretes smoked during the weekend and weekdays. Inaddition, i found few extra field as Highest Qualification, Nationality, Ethnicity, region they live when i load the data file. For example, first row in the table represet a uk female resident of 42 years of age who earn Under £2,600 and smokes 12 cigarestes everyday who is not married or single.
## [1] 1693
Answer: According to table index provided in home work there is 1691 rows or participents. But when i use the data file given and run nrow command i see there is 1693 rows or participents.
## 'data.frame': 1693 obs. of 14 variables:
## $ Sex : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
## $ Age : int 38 42 40 40 39 37 53 44 40 41 ...
## $ Marital.Status : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
## $ Highest.Qualification: Factor w/ 9 levels "99","A Levels",..: 7 7 3 3 5 5 3 3 4 7 ...
## $ Nationality : Factor w/ 8 levels "British","English",..: 1 1 2 2 1 1 1 2 2 2 ...
## $ Ethnicity : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ Gross.Income : Factor w/ 10 levels "£10400 to less than £15600",..: 4 8 5 1 4 2 6 1 4 7 ...
## $ Region : Factor w/ 7 levels "London","Midlands & East Anglia",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Smoke. : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 2 2 ...
## $ Amount.Weekends : Factor w/ 26 levels "0","1","10","12",..: 26 4 26 26 26 26 20 26 23 5 ...
## $ Amount.Weekdays : Factor w/ 26 levels "0","1","10","12",..: 26 4 26 26 26 26 21 26 23 4 ...
## $ Type : Factor w/ 5 levels "Both/Mainly Hand-Rolled",..: 4 5 4 4 4 4 5 4 3 5 ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : logi NA NA NA NA NA NA ...
## [1] "data.frame"
## [1] 1693 14
## [1] "Sex" "Age"
## [3] "Marital.Status" "Highest.Qualification"
## [5] "Nationality" "Ethnicity"
## [7] "Gross.Income" "Region"
## [9] "Smoke." "Amount.Weekends"
## [11] "Amount.Weekdays" "Type"
## [13] "X" "X.1"
## [1] FALSE
No null found in the data frame
## [1] TRUE
Missing values found in the data frame
(1.14, p. 29) Exercise 1.5 introduces a study where researchers studying the relationship between honesty, age, and self-control conducted an experiment on 160 children between the ages of 5 and 151. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. Half the students were explicitly told not to cheat and the others were not given any explicit instructions. Differences were observed in the cheating rates in the instruction and no instruction groups, as well as some differences across children’s characteristics within each group.
Answer: The population of interest are Children between ages 5 and 15. The sample size is 160.
Answer:
It is unclear whether 160 children is a large enough sample size or if there was a selection bias (e.g. all children from the same school) or random sampling was done. Additionally, these results could only be generalized to children between the ages of 5 and 15.
Since this is an experiment, and not an observational study hence causal relationships could possibly be determined. However, the experiment deals directly with cheating (e.g. honesty) and it does not appear that the study has a way to separate the factor of self-control. Therefore, it is unclear if a students reporting incorrect numbers are dishonest or honest with a low level of self-control due to the reward. Possibly, if the reward were varied then data could be collected determining how often a child incorrectly reported results between reward types, which would show how many children would lie for something they wanted rather than just in general.
Reading the paper. (1.28, p. 31) Below are excerpts from two articles published in the NY Times:
“Researchers analyzed data from 23,123 health plan members who participated in a voluntary exam and health behavior survey from 1978 to 1985, when they were 50-60 years old. 23 years later, about 25% of the group had dementia, including 1,136 with Alzheimer’s disease and 416 with vascular dementia. After adjusting for other factors, the researchers concluded that pack-a- day smokers were 37% more likely than nonsmokers to develop dementia, and the risks went up with increased smoking; 44% for one to two packs a day; and twice the risk for more than two packs.”
Based on this study, can we conclude that smoking causes dementia later in life? Explain your reasoning.
Answer: No, we cannot conclude that smoking causes dementia later in life because the data was only from health plan members and not a presentation of all population. This does not take into account people who do not have health plan memberships. Not only that smoking might kill people even before someone reach the age to get dimensia. so, if smoking kills someone before they show signs of dementia, it is impossible to accurately count that person.
Additionally, it is an observational study and can therefore cannot be used to establish causal relationships.
“The University of Michigan study, collected survey data from parents on each child’s sleep habits and asked both parents and teachers to assess behavioral concerns. About a third of the students studied were identified by parents or teachers as having problems with disruptive behavior or bullying. The researchers found that children who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders.”
A friend of yours who read the article says, “The study shows that sleep disorders lead to bullying in school children.” Is this statement justified? If not, how best can you describe the conclusion that can be drawn from this study?
Answer: This statement is not justified as there could be many other confounding variables between sleep habits and bullying. It is possible that the relationship is actually opposite - maybe being a bully is interfering with a child’s sleep. It can’t be justified that only becasue there is some association with two variable they have causation. Furthermore, it is an observational study and hence cannot be used to establish causal relationships.
We can only conclude what is said in the article, that “children who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders.” The more accurate conclusion is that there is association between the two variables.
(1.34, p. 35) A researcher is interested in the effects of exercise on mental health and he proposes the following study: Use stratified random sampling to ensure rep- resentative proportions of 18-30, 31-40 and 41-55 year olds from the population. Next, randomly assign half the subjects from each age group to exercise twice a week, and instruct the rest not to exercise. Conduct a mental health exam at the beginning and at the end of the study, and compare the results.
Answer: This is an experiment
Answer: “Treatment group” is the part of the group that exercises and “control group” is the people who are told not to exercise.
Answer: Yes, the blocking variable is age groups.
Answer: When researchers keep the patients uninformed about their treatment, the study is said to be blind. No, there is no way to use a placebo in this study - we can’t make the control group believe they are exercising as well.
Answer: Yes, this is an experiment and it can be used to establish a causal relationship. It can be used to generalize populaton at large since a stratified random sampling was used.
Answer: The study seems to be reasonable, I might question the age groups used for the stratified sample. Why was that particular breakdown used? Is that the best way to break down subjects into groups? Also, there might be influence on mental health for work hours, sleeping habits, diet etc
Alessandro Bucciol and Marco Piovesan. “Luck or cheating? A field experiment on honesty with children”. In: Journal of Economic Psychology 32.1 (2011), pp. 73–78. Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1307694↩