Homework #1

This is Homework #1 for the DATA 606 course. The following problems are solved in this homework from the Open Intro statistics book using the R script - Exercise 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70.

openintro library is being loaded and heartTr data being used in this R Markdown.

Exercise 1.8

1.8 (a) What does each row of the data matrix represent?

Each row of this data represents the smoking habits of UK residents by gender, age, marital status, income, whether or not they smoke and how much they smoke during weekends of week days.

1.8 (b) How many participants were included in the survey?

df <- read.csv(url('https://raw.githubusercontent.com/vepark/Datasets/mushroom/SmokingData.csv'))
nrow(df)
## [1] 1693

1.8 (c) Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

#Tried a few things below to come to conclusion about the variable type
names(df)
##  [1] "Sex"                   "Age"                  
##  [3] "Marital.Status"        "Highest.Qualification"
##  [5] "Nationality"           "Ethnicity"            
##  [7] "Gross.Income"          "Region"               
##  [9] "Smoke."                "Amount.Weekends"      
## [11] "Amount.Weekdays"       "Type"                 
## [13] "X"                     "X.1"
typeof(df)
## [1] "list"
eapply(.GlobalEnv,typeof)
## $df
## [1] "list"
sapply(df, class)
##                   Sex                   Age        Marital.Status 
##              "factor"             "integer"              "factor" 
## Highest.Qualification           Nationality             Ethnicity 
##              "factor"              "factor"              "factor" 
##          Gross.Income                Region                Smoke. 
##              "factor"              "factor"              "factor" 
##       Amount.Weekends       Amount.Weekdays                  Type 
##              "factor"              "factor"              "factor" 
##                     X                   X.1 
##             "logical"             "logical"

Categorical Variables: 1) Sex 2) Marital.Stauts 3) Highest.Qualification (Ordinal) 4) Nationality 5) Ethnicity 6) Region 7) Smoke 8) Type

Numerical variables: 1) Age (Discrete) 2) Gross.Income (Continuous) 3) Amount.Weekends (Discrete) 4) Amount.Weekdays (Discrete)

1.10 (a) Identify the population of interest and the sample in this study.

Population of interest = Children between 5 and 15 year old Sample size = 160 Children

1.10 (b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish casual relationships.

No, the results of this study can’t be generalized due to small sample size. Assocation does not always mean causation.

1.28 Reading the paper:

  1. No, the study is observational and all the samples are voluntarily participated created a bias. More experimental study is needed to conclude smoking causes dementia.

  2. No, the statement is not justified. Sleep disorder might have some negative effect on the children, but it is not the only cause for bullying behavior in children.

1.36

  1. This is a designed experimental study.

  2. Treatment group = People exercise twice a week Control group = People who does not exercise

  3. Yes, this study uses blocking. Age group is the blocking variable.

  4. No, this study is not a blind study since the testing subjects knows what kind of treatment they are receiving.

  5. Yes, this experimental study can be used to establish a casual relationship since the design of the experiment follows the sample-population concepts.

  6. I would be relunctant to fund this project as it is now since the control and treated population will have several other influencing factors that might need large number of samples and population and many replications.

1.48 Stat scores:

statScores <- c(57,66,69,71,72,73,74,77,78,78,79,79,81,81,82,83,83,88,89,94)
boxplot(statScores)

1.50 Mix and match:

  1. symmetric matches with boxplot 2
  2. bimodal matches with boxplot 3
  3. right skewed distribution matches with boxplot 1

1.56 Distributions and appropriate statistics:

  1. This is right skewed since many houses over the price of 6 mil. Median and IQR would be the appropriate statistics.

  2. This must be a symmetical distribution. Mean and standard deviation would provide a good measurement of the spread.

  3. This would be right skewed due to most students don’t drink with the exception of a few. IQR and Median would be a good measurements.

  4. This might be symmetric with slight right skew. Mean and median could be used.

1.70 Heart transplants:

hTr <- read.csv(url("https://raw.githubusercontent.com/vepark/Datasets/mushroom/heart_transplant.csv"))
mosaicplot(table(hTr$transplant,hTr$survived))

  1. No, the mosaic plot shows that most people survived with the treatment. Therefore, the survival is dependent on the treatment.
boxplot(hTr$survtime ~ hTr$transplant)

(b) Boxplot clearly shows that the treated subject survived longer than control group and therefore the treatment seems effective.

# percent control group died
(nrow(subset(hTr,transplant=="control" & survived=="dead")) / nrow(subset(hTr,transplant=="control")) ) * 100
## [1] 88.23529
# percent treatment group died
(nrow(subset(hTr,transplant=="treatment" & survived=="dead")) / nrow(subset(hTr,transplant=="treatment")) ) * 100
## [1] 65.21739

In control group ~23% died more than treatment group.

(d)-i The experiment test the hypothesis that transplant treatment helps to survive longer than non-tranplant. (d)-ii (d)-iii