The Blob

Introduction

The activity uses a gridded scale drawing of The Blob. Students generate random numbers, determine lengths on a grid, multiply by a scaling factor, and find area estimates for squares. These area estimates are the data. Students summarize these data numerically and graphically with tools such as histogram, boxplot, five-number summary (minimum, first q uartile - i.e. the 25th percentile, median, third quartile - i.e. the 75th percentile, maximum), mean, and standard deviation. Each student creates her own confidence interval estimate of the area of a cross-section of The Blob. The different confidence interval estimates of the students are compared, which leads to discussion of sampling variability and the confidence level.

Summary

We may not be able to stop The Blob, pretty much nobody can, but we can figure out how big it is. Attached on a separate sheet of paper is a Gridded Scale Drawing of The Blob. Each 1 grid square on the drawing is equivalent to 2.2 yards. Our goal is to estimate the size of The Blob. Notice from the Gridded Scale Drawing that The Blob has width and height that are roughly equal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8538   11892   15198   14606   17213   20449

The distribution of area estimates of the blob is bimodal, with a center of 15198 by median. In general area estimates range from 8538 and 20449.

Hypothesis

The distribution of area estimates of the blob is bimodal, with a center of 15198 by median. In general area estimates range from 8538 and 20449. The one sample t-test is run below.

Ho: mu=15000 The area of the Blob is equal to 15000. Ha: mu=15000 The area of the Blob is not equal to 15000

## 
##  One Sample t-test
## 
## data:  blob
## t = 13.012, df = 11, p-value = 5.038e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  12135.14 17076.07
## sample estimates:
## mean of x 
##  14605.61

I am 95% confident that the TRUE area of the blob is between 12135.14 and 17076.07. The estimate for the mean is 14605.61 with a margin of error of + or - 2,470.39.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     818   11389   16565   15095   19825   21727

The distribution of ALL area estimates of the blob is unimodal and skewed left. The measure of center are 16565 square yards (median), and 15095 suqare yards(mean). The estimates range from 818 square yards and 21727 square yards. According to the Central Limit Theorem, even if a population’s distribution is skewed, a distribution of sample’s from that population will be unimodal and reatively symmetric. Therefore, we will take samples of the means to get a better representation of the TRUE sample mean.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1239    9191   15198   13480   18463   21083

One sample size of 12 area estimates produces a distribution that is bimodal. The measures of center are 16848 square yards (median), and 15259 square yards (mean) . In general area estimates range from 8538 and 20449. Every time you run, it will change.

Drawing Many Samples

In order for the Central Limit Theorem to apply, we need to draw many samples. The Command below will draw 100 samples of 12 and store themi in a new vector called sample_means12.

Stimulations and the Central Limit Theorem- Part 3

The purpose of today’s lab Part 3 R Markdown is to look at the central limit theorem from a computational simulations perpective. In lecture we saw the theorem result; stimulation provide a powerful way to investigate how well the theory works in practice. ##Simulation: NHANES lipid data **Applications of CLT As part of the NHANES study, the triglyceride levels of 3,026 adult women were measured. Triglycerides, the main constituent of both vegetable oil and animal fat, have been linked to antherosclerosis, heart disease, and stroke. Let’s consider this whole group of women thr population for the purpose of the stimulation. We are going to conduct a study of this population of triglycerides in our population and in sample:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0    68.0    98.0   116.9   147.0   399.0

Population distribution The distribution of triglyceride levels for 3026 women is unimodal and skewed right. The center of distribution is 98 mg/dL. The middle 50% of women have levels of 68 mg/dL, so almost 25% of the population has high levels of triglycerides. Taking One Sample from the Population

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    33.0    70.0    92.0   108.7   114.0   314.0

Sample Distribution The single sample of women drawn from the population is unimodal and skewed rihgt. The center is at 110 mg/dL and the range is from 33-352 mg/dL. According to the outlier rule: There is at least one high outlier. In addition, there is a gap in the data, making 4 measures in the sample seem extreme.

## [1] 116.9451

## [1] 108.68

## [1] 67.94322

The mean of the sample (118.6 mg/dL) is higher than that of the population (116.95 mg/dL)

It is worth noting that (a) the distribution of tryglycerides in the population is clearly right-skewed, (b) the sample looks representative of the population as it should because it is representative, and (c) the sample means are close, but the sample mean is clearly off a bit in terms of estimating a population mean.

This is just one sample; the means of others randomm samples might be much further or closer to the population mean. To see that distribution, we’ll have to repeat the sampling process many times and obtain sample means.

Using a Loop

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   90.12  105.77  113.56  115.12  123.64  147.40

Describing the Distribution of 100 Sample Means

The distribution of mean triglyceride levels created from 100 samples of 25 randomly-selected women in the NHANES study is approximately Normal with a mean of 116 mg/dL.

Applying the Normal Model to Sampling Distribution of Sample Means

The distribution of 100 sample means is unimodal and approximately symmetric without any outliers. Therefore, we can model the distribution with a mean at population mean of 117 mg/dL and a standard error of the population standard deviation of 68, divided by the sq root of sample size (n= 25). This computation is shown in the code below.

## [1] 11.74973

## [1] 13.58864

## [1] 115.1168

Our model

The sampling distribution of the sample means of triglyceride levels is approximately Normal with a mean of about 117 mg/dL and a standard error of 13.6 mg/dL.

Modeling the Distribution with Z-scores

Defining the z-score formula to suit the sampling distribution of the means from above will give us the following code:

Using the relative distribution of means to calculate probabilities based on z scores

So, though not a perfect Normal Model, the approximation seems pretty good. Given this, we can make some distributional predictions of sample means of triglyceride levels for sample of 25 women. Remember, this is not a prediction of an individual woman’s triglyceride level and its relation to the mean of the population. Instead it is the probability of the mean of a sample of 25 and how it relates to the mean of the sampling distribution. Note: individual data is more likely to be deviant from a population mean than a sample’s mean is to be deviant from the mean of a sampling distribution. We can use this information for inference testing.

Probability that a sample mean is less than 100 mg/dL if true mean is 117

## [1] 0.1061973

There is a 10.6% chance that a sample mean will be less than 100 if the true mean is 117.

The sample mean triglyceride level representing the 90th percentile - top 10%

## [1] 134.3597

A sample mean triglyceride level of 134.3597 mg/dL represents the cut-off for the top 10%.

The middle 50% of sample means of triglyceride levels vs the middle 50% of the population’s triglyceride levels.

## [1] 18.3308

## [1] 91.65401

The middle 50% of sample means of triglyceride levels only vary by 18.33 mg/dL, while the population’s middle 50% (by individual) varies by 91.65 mg/dL. This confirms the Central Limit Theorem - sample means will be more Normal and less variable as sample size increases.

Would it be unusual for a sample mean to be greater than 140 mg/dL? Would it be unusual for an individual to have a triglyceride level greater than 140 mg/dL?

## [1] 0.04488361

## [1] 0.3671823

It would be highly unusual to see a sample mean greater 140 mg/dL. We would only expect to see this mean 4.5% of the time. However, seeing an individual above 140 mg/dL is much more likely. We would see this result 36.7% of the time.

Application - A new medication is undergoing experimental trials. The experimental group taking the medication has an average triglyceride level of 95 mg/dL. Is this evidence that the experimental medication is effective at lowering triglyceride levels in women?

The Hypotheses:

Null Hypothesis: mean = 117. The average triglyceride level is 117 mg/dL

Alternative Hypothesis: mean < 117. The average triglyceride level is less than 117 mg/dL

## [1] 0.07108196

Conclusion:

Though the difference is considerable. We expect to see a sample mean of 97 mg/dL 7.1% of the time. This is higher than the standard significance level of 5%, so we RETAIN THE NULL. There is not enough evidence that the experimental drug lower triglyceride levels in women.