Loading the libraries
library(RCurl)
## Loading required package: bitops
library(plyr)
(a) what does each row of the data matrix represent?
Answer:
Each row of the data matrix represents a male or a female, their age, their marital status, their gross income, whether they smoke or not, how much they smoke on weekends, how much they smoke on weekdays and what type do they smoke.
url<-getURL("https://raw.githubusercontent.com/jbryer/DATA606Fall2017/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/smoking.csv")
df<-data.frame(read.csv(text=url,header=TRUE))
head(df)
## gender age maritalStatus highestQualification nationality ethnicity
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## grossIncome region smoke amtWeekends amtWeekdays type
## 1 2,600 to 5,200 The North No NA NA
## 2 Under 2,600 The North Yes 12 12 Packets
## 3 28,600 to 36,400 The North No NA NA
## 4 10,400 to 15,600 The North No NA NA
## 5 2,600 to 5,200 The North No NA NA
## 6 15,600 to 20,800 The North No NA NA
(b) How many participants were included in the survey
A total of 1691 were included in the survey.
(c)Indicate whether each variable included in the survey is numerical or categorical . If numerical , identify as continious or discrete. If categorical , Indicate if the variable is ordinal.
Gender: Categorical
Age: Numerical and discrete
Marital Status: Categorical
HighestQualification: Categorical and the variable is ordinal.
Nationality: is categorical
Ethnicity: is categorical
GrossIncome: is numerical and continous
region: is categorical
Smoke: is categorical
Amtweekends: Numerical and discrete
amtweekdays: Numerical and discrete
type: is categorical
(a)Identify The Population of interest and the sample in this study
Answer: The Population of interest is children between the ages of 5 and 15. The sample is 160 children.
(b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships
Answer: I beleive that the results of the study cannot be genralized because it is too small of a sample to apply the results to the population as a whole. That is why I believe that it cannot be used for causal relationships either.
(a) We can conclude based on the study that smoking causes demntia because more than 37% of smokers are more likely to get dementia. This is a big number and it cannot happen by chance. When increasing smoking chances are increased to 44%.
(b) I would describe it by saying that children with sleep disorders are twice as likely to have behavioural issues and bullying but not all bullying is because of sleep .
(a)What type of study is this?
This type of study is a randomized experiment.
(b) What are the treatment and Control Groups in this study?
The treatment group are the people that will exercise twice a week.
Control groups are the folks that will not be exercising. (c) Does this study make use of blocking? If so, what is the blocking variable?
Answer: Yes this study does make use of blocking. The blocking variable is the Age. The people between the ages of 18-30 have a chance of being in better mental health even if they don’t exercise so they are at low risk. 31-40 are ages who are at medium risk and the 41-55 at high risk. I would say age can play a role in this study.
(d) Does this experiment use blinding?
No this experiment does not use blinding because the individuals that are in the control group know that they are not recieveing any treatments.They are simply told not to exercise.
(e) Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generlaized to the population as a whole.
It will be difficult to establish a causal relationship between exercise and mental health because of the fact that this is a stratified population and there may be other factors that contribute to mental health that are not considered here . For example sleeping patterns, diet, environmental, professional and personal issues.
(f) Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?
Yes I would have reservations about it because I am not convinced that this study would provide any meaningful results . I would rather be more interested in doing a study on sleeping habits related to mental health experiment.
url<-getURL("https://raw.githubusercontent.com/jbryer/DATA606Fall2017/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/stats_scores.csv ")
df<-data.frame(read.csv(text=url,header=TRUE))
df
## scores
## 1 79
## 2 83
## 3 57
## 4 82
## 5 94
## 6 83
## 7 72
## 8 74
## 9 73
## 10 71
## 11 66
## 12 89
## 13 78
## 14 81
## 15 78
## 16 81
## 17 88
## 18 69
## 19 77
## 20 79
summary(df$scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.75 78.50 77.70 82.25 94.00
boxplot(df$scores)
a matches to 2
b matches to 3
c matches to 1
(a) Housing prices in a country where 25% of the houses cost below 350,000, 50% cost below $450,000 75 % cost below 1,000,000 and there are a meaningful number of houses that cost more than 6,000,000
Answer:
The frequency is right skewed as there are less number of houses that are above 6,000,000 but then there are some houses that will be between 1,000,000 and 6,000,000. The houses above 1,000,000 are going to form a long tail in the right side of the histogram.
I feel that using IQR would be the appropriate way to view the typical observation in the data becuase we will know the upper limit and lower limit and be able to see how far away the data is from the mean.
(b)
Answer:
The frequency is going to be symmetric for the most part because the house prices will be distributed evenly with only a few outliers above $1,200,000.
We could use the mean to be able to get some meaningful information
(c)
Right Skewed because very small number of students will be drinking more and as the number of students increase the drinks will get lower and lower making a tail at the end.
IQR would be a good way to get some meaningful information out of the data also looking at the outliers would be interesting here.
(d)
Symmetric distribution with a few outliers.
We could use the mean to get meanngful information.
url<-getURL("https://raw.githubusercontent.com/jbryer/DATA606Fall2017/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/heartTr.csv")
df<-data.frame(read.csv(text=url,header=TRUE))
head(df)
## id acceptyear age survived survtime prior transplant wait
## 1 15 68 53 dead 1 no control NA
## 2 43 70 43 dead 2 no control NA
## 3 61 71 52 dead 2 no control NA
## 4 75 72 52 dead 2 no control NA
## 5 6 68 54 dead 3 no control NA
## 6 42 70 36 dead 3 no control NA
mosaicplot(table(df$transplant,df$survived))
(a) based on the mosaic plot is the survival is independant of whether or not the patient got a transplant? Explain your reasoning.**
Survival is dependant on whether or not a person got a transplant because there seems to be more than 2 times more patient that survived that were in the treatment group versus the controlled group.
boxplot(df$survtime ~ df$transplant)
(b) Wht do the box plots suggest about the efficacy(effectivenenss) of the heart transplant treatment.
Answer
The Boxplot suggests that the transplant were effective because the people that were in the treatment group survived longer than the ones that were in the controlled group.
(c) What Proportion of patients in the treatment group and what proportion of the patients in the control group died?
Answer:
Control Group
nRowsControl=nrow(subset(df,transplant=="control"))
nRowsControlDied=nrow(subset(df,transplant=="control" & survived=="dead"))
Proportion of patients that died in Control Group are 0.8823529
nRowsTreatment=nrow(subset(df,transplant=="treatment"))
nRowsTreatmentDied=nrow(subset(df,transplant=="treatment" & survived=="dead"))
Proportion of patients that died in the Treatment Group are 0.6521739