Exercise 1.8 Smoking habits of UK residents
smokingUK<- read.csv('http://raw.githubusercontent.com/Jbryer/DATA606Spring2017/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/smoking.csv')
What does each row of the data represents? Each row in the table represents a single case of study the smoking habits in UK residents
smokingUK[1:3,1:12]
## gender age maritalStatus highestQualification nationality ethnicity
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## grossIncome region smoke amtWeekends amtWeekdays type
## 1 2,600 to 5,200 The North No NA NA
## 2 Under 2,600 The North Yes 12 12 Packets
## 3 28,600 to 36,400 The North No NA NA
How many participants were included in the survery?
dim(smokingUK)
## [1] 1691 12
Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuos or discret. If categorical, indicate if the variable is ordinal.
names(smokingUK)
## [1] "gender" "age" "maritalStatus"
## [4] "highestQualification" "nationality" "ethnicity"
## [7] "grossIncome" "region" "smoke"
## [10] "amtWeekends" "amtWeekdays" "type"
Numerical Quantitve Variables - discrete amtWeekends, amtWeekdays, age Categorical Variables Nominal Variable Ethnicity Ordinal variable highestQualification, grossIncome Qualitative Variable gender, smoke , status,region, nacionality, type
Exercise 1.10 Cheaters Scope of inference pg.58 a) Identify the population of interest and the sample in the study.
160 children between the ages of 5 and 15 years old. 80 children in the instructions of not to cheat group 80 children in the no instructions group
Studying the relationship between honesty, age, conduct it is possible that the results come across with a conduct pattern based on the age of the participants and the group of study. But tossing a coin won’t determine children caracteristics, it might be a casual relationship.
Exercise 1.28 Reading the paper pg.62 a) can we conclude that smoking causes dementia later in life? Explain your reasoning
Studies indicate that smoking’s impact on dementia risk is dependent on dose that is, the more you smoke, the greater your risk. Former smokers are in a higher dementia risk over time, a pack a day smokers were 37% more like and more than 2 packets per day will increase the risk two times.
It is not justified. In this Study I can identify that there is a proxy measure between Students’ behavior and sleep disorders in which another factor in between those two variables might create a pattern that shows the bullied children.
Exercise 1.36 Exercise and mental health. a) What type of study is this? Random sampling
b)What are the treatment and control groups in this study?
Treatment group: Patients instructed not to exercise. Control group: Patients instructed to exercise twice a week.
c)Does this study make use of blocking? if so what is the blocking variable?
Yes, blocking is use for randomized cases each block to the treatment block. pg.24
the blocking variable is the age of the population.
d)Does this study make use of blinding? “Blinding is the practice of not telling subjects whether they are receiving a placebo.”
No, for this study they are not making use of blindig. all patients from the two groups know if they have to do exercise or not.
e)Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generalized to the population at large.
The study doesn’t indicate the how many participants which is a important property of this type of sudy then I am not able to detemine that this study can be generalized to the population at large.
Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?
the assesment plan for the Treatment group is not clear and well designed. There is not ength of and intensity of the exercise specified.
Random sampling studies require a detailed knowledge of the population characteristics, and for this particular study there is relevant information missing.
Exercise 1.48 Stats Scores
statsscore <- read.csv('https://raw.githubusercontent.com/jbryer/DATA606Spring2017/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/stats_scores.csv')
Create a box plot of the distribution of these scores
summary(statsscore$scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.75 78.50 77.70 82.25 94.00
boxplot(x = statsscore$scores, y = summary(statsscore$scores))
Exercise 1.50 Mix and Match Describe the distribution in the histograms below and match them to the box plots.
Exercise 1.56 Distribution and appropiate statistics a)Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below $450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that cost more than $6,000,000.
Housing price 25% – cost below 350.000 50% – cost below 450.000 75% – cost below 1’000.000 There are meaningful houses that cost more than 6’000.000
The housing prices are denoted by Q1 = 25%, Q2= 50%, and Q3 = 75%, respectively with a breakpoint of 25%. This data would be more presented with the Interquartile and median where most of the housing prices are below 1’000.000; then the meaningful houses that cost more than 6’000.000 are outliers. In the boxplot this study case fall into the skewed to the right and those houses that cost more that 6’000.000 would appear as an exponential distribution.
b)Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below $600,000, 75% of the houses cost below $900,000 and very few houses that cost more than $1,200,000.
The housing prices are denoted by Q1 = 25%, Q2= 50%, and Q3 = 75%, respectively with a breakpoint of 25%; however the outlier are close to the Q3 which help to infere that using the standar deviation would be suffice to present the data.
c)Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don’t drink since they are under 21 years old, and only a few drink excessively.
it is skewed to the left appear as an exponential distribution beacuse a few drink excessively.
d)Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn much higher salaries than the all other employees.
This is right skewed given taking in cosideration that white-collar employees make > 100.000 and most employees might make < 100.000 over. the outliers can controlled using the median/IQR.
Exercise 1.70 .70 Heart transplants. The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an o cial heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Another variable called survived was used to indicate whether or not the patient was alive at the end of the study. Of the 34 patients in the control group, 30 died. Of the 69 people in the treatment group, 45 died.
Based on the mosaic plot, it appears the survival is dependent on whether or not the patient was transplanted.
The box plots suggest that the treatment group survived longer than the control group.
Control group: 30/34 = .8824 Treatment group: 45/69 = .6522
What are the claims being tested? The claim was that transplanted patients were more likely to survive than non-transplanted patients
The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.
We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 79 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0 . Lastly, we calculate the fraction of simulations where the simulated differences in proportions are the proportion ratio of patients in treatment that died less the proportion ratio of patients in control that died . If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.