library(tidyverse) #loading all library needed for this assignment
library(readxl)
library(plyr)
library(dplyr)
library(DBI)
library(dbplyr)
library(data.table)
library(rstudioapi)
library(RJDBC)
library(odbc)
library(RSQLite)
library(readr)
library(RCurl)
library(stringr)
Below are the final exam scores of twenty introductory statistics students.
57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94
Create a box plot of the distribution of these scores. The five number summary provided below may be useful.
# Let's create a dataframe for the given data
stat_data <- data.frame("stats_scores" = c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94))
summary(stat_data)
## stats_scores
## Min. :57.00
## 1st Qu.:72.75
## Median :78.50
## Mean :77.70
## 3rd Qu.:82.25
## Max. :94.00
Box plot of distribution of stat cores
boxplot(stat_data$stats_scores)
Describe the distribution in the histograms below and mathc them to the box plots. A distribution with a single mode or peak is said to be unimodal. A distribution with more than one mode is said to be bimodal, trimodal, etc., or in general, multimodal.
boxplot (1) matches the rightor positive skewed distribution (the mean is higher than the median) histogram (c) Boxplot (2) matches the symmetric unimodal distribution histogram (a) Boxplot (3) matches the multimodal distribution histogram (b)
boxplot(stat_data$stats_scores)
For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning. (a) Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below $450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that cost more than $6,000,000. (b) Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below $600,000, 75% of the houses cost below $900,000 and very few houses that cost more than $1,200,000. (c) Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don’t drink since they are under 21 years old, and only a few drink excessively. (d) Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn much higher salaries than all the other employees.
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an ocial heart transplant candidate, meaning that he was gravely ill and would most likely benet from a new heart. Some patients got a transplant and some did not. The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Of the 34 patients in the control group, 30 died. Of the 69 people in the treatment group, 45 died. Another variable called survived was used to indicate whether or not the patient was alive at the end of the study.30
# (a) Based on the mosaic plot , the survival is not independent on the patient who got transplant. in fact, survival depends on patient who got transplant since those alive are the ones who got transplants on the treatment, and more dead in the control.
# (b) the boxplot shows that in the control group , mean is lower than the mean in the treatment group ...or more survival in treatment group.
# (c) in the treatment group 45 dies, in the control group 30 died
# (d) i-the claims being tested are: the survival depends on the patient got a transplant
# the survival does not depend on neither patient got a transplant
# ii- alive ( 69-45 = 24 alive in treatment, 34-30 = 4 alive in control, so 24+4 = 28 alive, 45+30= 75 dead ...69 treatment group and 34 control group )
# Randomization distributions are always centered around the null hypothesized value
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.