Data-606 Homework 1

Introduction To Data

Loading the libraries

library(RCurl)

## Loading required package: bitops

library(plyr)

Problem 1.8

(a) what does each row of the data matrix represent?

Answer:

Each row of the data matrix represents a male or a female, their age, their marital status, their gross income, whether they smoke or not, how much they smoke on weekends, how much they smoke on weekdays and what type do they smoke.

url<-getURL("https://raw.githubusercontent.com/jbryer/DATA606Fall2017/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/smoking.csv")

df<-data.frame(read.csv(text=url,header=TRUE))
head(df)

##   gender age maritalStatus highestQualification nationality ethnicity
## 1   Male  38      Divorced     No Qualification     British     White
## 2 Female  42        Single     No Qualification     British     White
## 3   Male  40       Married               Degree     English     White
## 4 Female  40       Married               Degree     English     White
## 5 Female  39       Married         GCSE/O Level     British     White
## 6 Female  37       Married         GCSE/O Level     British     White
##        grossIncome    region smoke amtWeekends amtWeekdays    type
## 1   2,600 to 5,200 The North    No          NA          NA        
## 2      Under 2,600 The North   Yes          12          12 Packets
## 3 28,600 to 36,400 The North    No          NA          NA        
## 4 10,400 to 15,600 The North    No          NA          NA        
## 5   2,600 to 5,200 The North    No          NA          NA        
## 6 15,600 to 20,800 The North    No          NA          NA

(b) How many participants were included in the survey

A total of 1691 were included in the survey.

(c)Indicate whether each variable included in the survey is numerical or categorical . If numerical , identify as continious or discrete. If categorical , Indicate if the variable is ordinal.

Gender: Categorical

Age: Numerical and discrete

Marital Status: Categorical

HighestQualification: Categorical and the variable is ordinal.

Nationality: is categorical

Ethnicity: is categorical

GrossIncome: is numerical and continous

region: is categorical

Smoke: is categorical

Amtweekends: Numerical and discrete

amtweekdays: Numerical and discrete

type: is categorical

Problem 1.10

(a)Identify The Population of interest and the sample in this study

Answer: The Population of interest is children between the ages of 5 and 15. The sample is 160 children.

(b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships

Answer: I beleive that the results of the study cannot be genralized because it is too small of a sample to apply the results to the population as a whole. That is why I believe that it cannot be used for causal relationships either.

Problem 1.28

(a) We can conclude based on the study that smoking causes demntia because more than 37% of smokers are more likely to get dementia. This is a big number and it cannot happen by chance. When increasing smoking chances are increased to 44%.

(b) I would describe it by saying that children with sleep disorders are twice as likely to have behavioural issues and bullying but not all bullying is because of sleep .

Problem 1.36

(a)What type of study is this?

This type of study is a randomized experiment.

(b) What are the treatment and Control Groups in this study?

The treatment group are the people that will exercise twice a week.

Control groups are the folks that will not be exercising. (c) Does this study make use of blocking? If so, what is the blocking variable?

Answer: Yes this study does make use of blocking. The blocking variable is the Age. The people between the ages of 18-30 have a chance of being in better mental health even if they don’t exercise so they are at low risk. 31-40 are ages who are at medium risk and the 41-55 at high risk. I would say age can play a role in this study.

(d) Does this experiment use blinding?

No this experiment does not use blinding because the individuals that are in the control group know that they are not recieveing any treatments.They are simply told not to exercise.

(e) Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generlaized to the population as a whole.

It will be difficult to establish a causal relationship between exercise and mental health because of the fact that this is a stratified population and there may be other factors that contribute to mental health that are not considered here . For example sleeping patterns, diet, environmental, professional and personal issues.

(f) Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?

Yes I would have reservations about it because I am not convinced that this study would provide any meaningful results . I would rather be more interested in doing a study on sleeping habits related to mental health experiment.

Problem 1.48

Stats Scores

url<-getURL("https://raw.githubusercontent.com/jbryer/DATA606Fall2017/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/stats_scores.csv ")

df<-data.frame(read.csv(text=url,header=TRUE))
df

##    scores
## 1      79
## 2      83
## 3      57
## 4      82
## 5      94
## 6      83
## 7      72
## 8      74
## 9      73
## 10     71
## 11     66
## 12     89
## 13     78
## 14     81
## 15     78
## 16     81
## 17     88
## 18     69
## 19     77
## 20     79

summary(df$scores)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00

boxplot(df$scores)

Problem 1.50

a matches to 2

b matches to 3

c matches to 1

Problem 1.56

(a) Housing prices in a country where 25% of the houses cost below 350,000, 50% cost below $450,000 75 % cost below 1,000,000 and there are a meaningful number of houses that cost more than 6,000,000

Answer:

The frequency is right skewed as there are less number of houses that are above 6,000,000 but then there are some houses that will be between 1,000,000 and 6,000,000. The houses above 1,000,000 are going to form a long tail in the right side of the histogram.

I feel that using IQR would be the appropriate way to view the typical observation in the data becuase we will know the upper limit and lower limit and be able to see how far away the data is from the mean.

(b)

Answer:

The frequency is going to be symmetric for the most part because the house prices will be distributed evenly with only a few outliers above $1,200,000.

We could use the mean to be able to get some meaningful information

(c)

Right Skewed because very small number of students will be drinking more and as the number of students increase the drinks will get lower and lower making a tail at the end.

IQR would be a good way to get some meaningful information out of the data also looking at the outliers would be interesting here.

(d)

Symmetric distribution with a few outliers.

We could use the mean to get meanngful information.

Problem 1.70

url<-getURL("https://raw.githubusercontent.com/jbryer/DATA606Fall2017/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/heartTr.csv")

df<-data.frame(read.csv(text=url,header=TRUE))
head(df)

##   id acceptyear age survived survtime prior transplant wait
## 1 15         68  53     dead        1    no    control   NA
## 2 43         70  43     dead        2    no    control   NA
## 3 61         71  52     dead        2    no    control   NA
## 4 75         72  52     dead        2    no    control   NA
## 5  6         68  54     dead        3    no    control   NA
## 6 42         70  36     dead        3    no    control   NA

mosaicplot(table(df$transplant,df$survived))

(a) based on the mosaic plot is the survival is independant of whether or not the patient got a transplant? Explain your reasoning.**

Survival is dependant on whether or not a person got a transplant because there seems to be more than 2 times more patient that survived that were in the treatment group versus the controlled group.

boxplot(df$survtime ~ df$transplant)

(b) Wht do the box plots suggest about the efficacy(effectivenenss) of the heart transplant treatment.

Answer
The Boxplot suggests that the transplant were effective because the people that were in the treatment group survived longer than the ones that were in the controlled group.

(c) What Proportion of patients in the treatment group and what proportion of the patients in the control group died?

Answer:
Control Group

nRowsControl=nrow(subset(df,transplant=="control"))
nRowsControlDied=nrow(subset(df,transplant=="control" & survived=="dead"))

Proportion of patients that died in Control Group are 0.8823529

Treatment Group

nRowsTreatment=nrow(subset(df,transplant=="treatment"))
nRowsTreatmentDied=nrow(subset(df,transplant=="treatment" & survived=="dead"))

Proportion of patients that died in the Treatment Group are 0.6521739

Data-606 Homework 1

Umais Siddiqui

September 4, 2017

Introduction To Data

Problem 1.8

Problem 1.10

Problem 1.28

Problem 1.36

Problem 1.48

Stats Scores

Problem 1.50

Problem 1.56

Problem 1.70

Treatment Group