DS401 week2

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Week 2 Quiz question 1

Consider the values below. Using the boxplot rule, is the value 200 an extreme outlier? Note: Extreme outlier is defined as a value that falls outside the boundaries of Q1- 3IQR and Q3+ 3IQR. Note that R uses Q1 -1.5IQR and Q3+1.5IQR as default values to identify outliers.

80, 121, 132, 145, 151, 119, 133, 134, 200, 195, 90, 121, 132, 123, 145, 151, 119, 133, 134, 151, 168

x = c(80, 121, 132, 145, 151, 119, 133, 134, 200, 195, 90, 121, 132, 123, 145, 151, 119, 133, 134, 151, 168)
boxplot.stats(x, coef = 3)

## $stats
## [1]  80 121 133 151 200
## 
## $n
## [1] 21
## 
## $conf
## [1] 122.6565 143.3435
## 
## $out
## numeric(0)

raw calculation

range(x)

## [1]  80 200

quantile(x, type=2)

##   0%  25%  50%  75% 100% 
##   80  121  133  151  200

q1 = 121
q3 = 151
iqr = 151-121
c(q1-3*iqr, q3+3*iqr)

## [1]  31 241

Question 2

If a number represents the geographic location of a business using the zip code, then the level of data represented by the number is probably

Norminal

Question 3

Here are some scores from a quiz. Find the mode and range for the given data.

41,37,27,41,18,48,41,50,54,37,25

x = c(41,37,27,41,18,48,41,50,54,37,25)
max(x)-min(x)

## [1] 36

table(x)

## x
## 18 25 27 37 41 48 50 54 
##  1  1  1  2  3  1  1  1

Question 4

Find the sample standard deviation for the given sample data. Round your answer to one more decimal place than is present in the original data.

19,18,26,9,15,23,10,5,15,17,8,22,12

x = c(19,18,26,9,15,23,10,5,15,17,8,22,12)
sd(x)

## [1] 6.329621

Question 5

Find the median for the given sample data. The normal monthly precipitation (in inches) for August is listed for 20 different U.S. cities. Find the median of the data. 3.8, 1.8, 2.4, 3.7, 4.1, 3.9, 1.2, 3.6, 4.2, 3.4, 3.7, 2.2, 1.6, 4.2, 3.5, 2.6, 0.4, 3.7, 2.0, 3.6

x = c(3.8, 1.8, 2.4, 3.7, 4.1, 3.9, 1.2, 3.6, 4.2, 3.4, 3.7, 2.2, 1.6, 4.2, 3.5, 2.6, 0.4, 3.7, 2.0, 3.6)
median(x)

## [1] 3.55

quantile(x, type=2)

##   0%  25%  50%  75% 100% 
## 0.40 2.10 3.55 3.75 4.20

Question 6

Multiple Choice Find the variance for the given sample data.

The normal monthly precipitation (in inches) for August is listed for 20 different U.S. cities. Find the variance of the data.

3.8, 1.8, 2.4, 3.7, 4.1, 3.9, 1.2, 3.6, 4.2, 3.4, 3.7, 2.2, 1.6, 4.2, 3.5, 2.6, 0.4, 3.7, 2.0, 3.6

x = c(3.8, 1.8, 2.4, 3.7, 4.1, 3.9, 1.2, 3.6, 4.2, 3.4, 3.7, 2.2, 1.6, 4.2, 3.5, 2.6, 0.4, 3.7, 2.0, 3.6)
var(x)

## [1] 1.246947

Question 7

Solve the problem. The heights of the adults in one town have a mean of 67.1 inches and a standard deviation of 3.5 inches. What can you conclude from Chebyshev’s theorem about the percentage of adults in the town whose heights are between 60.1 and 74.1 inches? (Hint-study the section in Business Statistics that deals with this.)

60.1 - 67.1 # =7

## [1] -7

74.1 - 67.1 # =7

## [1] 7

k = 7/3.5
1-1/k^2

## [1] 0.75

Question 8

Solve the problem.

The data below consists of the heights (in inches) of 20 randomly selected women. Find the 20% trimmed mean of the data set. Round to two decimal places. (Do not use Excel. Use R.)

69, 68, 64, 61, 65, 64, 71, 67, 62, 63, 61, 64, 75, 67, 60, 59, 64, 69, 65, 72

x = c(69, 68, 64, 61, 65, 64, 71, 67, 62, 63, 61, 64, 75, 67, 60, 59, 64, 69, 65, 72)
mean(x, trim=.2)

## [1] 65.16667

Question 9

Find the standard deviation of the sample data summarized in the given frequency distribution.

The test scores of 40 students are summarized in the frequency distribution below. Find the standard deviation. Use the procedure in Business Statistics Section 3.3. Score Students 50-59 5 60-69 7 70-79 9 80-89 10 90-99 9

Sample question

Initial data setup

freq = c(5, 12, 14, 15, 8, 4)
mid = seq(3, 13, 2)
fx = freq*mid
fx2 = freq*(mid^2)
tbl = data.frame(freq, mid, fx, fx2)
tbl

##   freq mid  fx  fx2
## 1    5   3  15   45
## 2   12   5  60  300
## 3   14   7  98  686
## 4   15   9 135 1215
## 5    8  11  88  968
## 6    4  13  52  676

sample mean

avg = sum(fx)/sum(freq)
avg

## [1] 7.724138

Question 32, Sample variance

v = sum((mid-avg)^2*freq)/(sum(freq)-1)
v

## [1] 7.5366

Question 33, population standard deviation

v2 = sum((mid-avg)^2*freq)/(sum(freq))
v2

## [1] 7.406659

s2 = v2^.5
s2

## [1] 2.721518

Real question

freq = c(5, 7, 9, 10, 9)
freq

## [1]  5  7  9 10  9

mid = seq((59-50)/2+50, (99-90)/2+90, 10)
mid

## [1] 54.5 64.5 74.5 84.5 94.5

fx = freq*mid
fx2 = freq*(mid^2)
tbl = data.frame(freq, mid, fx, fx2)
tbl

##   freq  mid    fx      fx2
## 1    5 54.5 272.5 14851.25
## 2    7 64.5 451.5 29121.75
## 3    9 74.5 670.5 49952.25
## 4   10 84.5 845.0 71402.50
## 5    9 94.5 850.5 80372.25

# mean
avg = sum(fx)/sum(freq)
avg

## [1] 77.25

# Sample var and sd
v = sum((mid-avg)^2*freq)/(sum(freq)-1)
v

## [1] 179.4231

s = v^.5
s

## [1] 13.39489

Question 10

Find the range, variance, and standard deviation for each of the two samples, then compare the two sets of results.

When investigating times required for drive-through service, the following results (in seconds) were obtained. Restaurant A 120 67 89 97 124 68 72 96 Restaurant B 115 126 49 56 98 76 78 95

A = c(120, 67, 89, 97, 124, 68, 72, 96)
B = c(115, 126, 49, 56, 98, 76, 78, 95)

stat = function(v) {
  return(c(max(v)-min(v), var(v), sd(v)))
}
stat(A)

## [1]  57.00000 493.98214  22.22571

stat(B)

## [1]  77.00000 727.98214  26.98114

{r pressure, echo=FALSE} Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.