Completed 14/15

1.

This is how you will calculate our final grade.

  • 4 assignments each worth 3.75% of the final grade

  • 16 lab/minilab reports collectively worth 10% of our final grade

  • 2 projects, Project 1 worth 3.3% and Project 2 worth 6.6% of final grade. Collectively worth 10% of final

  • In class quizzes worth 10% of final

  • chapter quizzes worth 5% of final

  • Mid-term exams 20% of final

  • Final exam 30%

  • the grading scale is as follows

    • A=(100%-90%)
    • B=(89.9%-80%)
    • C=(79.9%-60%)
    • D=(59.9%-50%)
    • F=(<50%)

2.

ddt<- read.csv("DDT-1.csv")
m=with(ddt, as.numeric(factor(MILE))) #A
length(unique(m)) #B
## [1] 17

(a)

m=with(ddt, as.numeric(factor(MILE)))
coplot(LENGTH~WEIGHT|RIVER*SPECIES,data=ddt,col=m, rows=1,)

#pch=ifelse(m>=1 & m<=5, 1, ifelse(m>5 & m<=10, 2, 3)))

(b)

The lower left three conditional plots show the Length vs. Weight for the three different species of fish found in the Tennessee River. And each data point is color coded according to the sub-setting function m.

(c)

Line A first creates a list of each different value input under the name MILES. Then, it is stored as a vector arranged in ascending order of magnitude for the inputs. It records each vector input as a level and then the as.numeric() command replaces each input value with the value of its level in the vector. Then, it makes this list of values into a subset of ddt and names it m.

(d)

Line B creates a vector from the m subset by removing any duplicate values using the unique() command. Then, using the length() command it outputs the number of elements within the resulting vector.

(e)

The top six plots are empty because no bass or buffalo fish were caught in the FCM, LCM, or SCM rivers.

with(ddt, ddt[SPECIES=="LMBASS"||SPECIES=="SMBUFFALO"&RIVER=="FCM"||RIVER=="LCM" ||RIVER=="SCM",])
## [1] RIVER   MILE    SPECIES LENGTH  WEIGHT  DDT    
## <0 rows> (or 0-length row.names)

(f)

cat <- with(ddt, ddt[ddt$SPECIES=="CCATFISH"& ddt$RIVER=="FCM",])
catddt <- cat$DDT
mean(catddt)
## [1] 45

The mean value of DDT found in catfish on the FCM river was 45.

3.

National Bridge Inventory. All highway bridges in the United States are inspected periodically for structural deficiency by the Federal Highway Administration (FHWA). Data from the FHWA inspections are compiled into the National Bridge Inventory (NBI). Several of the nearly 100 variables maintained by the NBI are listed below. Classify each variable as quantitative or qualitative. a. Length of maximum span (feet) b. Number of vehicle lanes c. Toll bridge (yes or no) d. Average daily traffic e. Condition of deck (good, fair, or poor) f. Bypass or detour length (miles) g. Route type (interstate, U.S., state, county, or city)

Solution:

Length of maximum span, number of vehicle lanes, average daily traffic, and bypass or detour length are all quantitative variables.

Toll bridge, condition of deck, and route type are all qualitative variables.

4.

(a)

The four random sampling designs are simple random sampling, stratified random sampling, cluster sampling, and systematic sampling.

(b)

Simple random sampling is similar to a random number generator where each value in the population has an equal chance of being selected. Stratified random sampling is when the population is arranged into groups based on similar characteristics within a given group. Cluster sampling is when the population is clustered into different groups and then collect data from the experimental group in a cluster. Systematic sampling is when every kth element is selected from the population.

5.

mtbe=read.csv("MTBE.csv", header=TRUE)
head(mtbe)
##     pH SpConduct DissOxy RoadsPct IndPct UrbanPct DevPct WellClass Aquifier
## 1 7.87     290.0    0.58     1.34      0    17.77  17.77   Private  Bedrock
## 2 8.63     225.9    0.84     0.72      0    10.43  10.43   Private  Bedrock
## 3 7.11     157.4    8.37     1.92      0    29.62  50.01   Private  Bedrock
## 4 7.98     723.6    0.41     2.76      0    41.65  41.65   Private  Bedrock
## 5 7.88     148.7    1.44     3.51      0    51.21  51.21   Private  Bedrock
## 6 8.36     198.2    0.18     1.48      0    22.49  29.19   Private  Bedrock
##     Depth SafeYld Distance MTBE.Detect MTBE.Level HouseDen PopDen
## 1  60.960      NA  2386.29 Below Limit        0.2   131.16  43.19
## 2  36.576      NA  3667.69 Below Limit        0.2    47.69  24.52
## 3 152.400      NA  2324.15 Below Limit        0.2    85.16  33.32
## 4      NA      NA   788.88 Below Limit        0.2   134.77  46.61
## 5  91.440      NA  1337.88 Below Limit        0.2   198.54  78.58
## 6 115.824      NA  2396.74 Below Limit        0.2   206.73  59.47
dim(mtbe)
## [1] 223  16
ind=sample(1:223,5,replace=FALSE)
mtbe[ind,]
##       pH SpConduct DissOxy RoadsPct IndPct UrbanPct DevPct WellClass Aquifier
## 170 7.87    304.20    1.57     0.07   0.00     4.54   4.54    Public  Bedrock
## 78  7.59     11.36    1.02     1.28   0.00    22.78  28.49   Private  Bedrock
## 183 7.78    352.30    4.79     3.17   8.42    25.58  25.58    Public  Bedrock
## 197 8.31    208.50    0.59     3.80   0.00    52.14  55.38    Public  Bedrock
## 121 8.08    387.80    0.71     2.86   4.21    62.71  62.71    Public  Bedrock
##       Depth   SafeYld Distance MTBE.Detect MTBE.Level HouseDen PopDen
## 170 129.540  56.77517  1150.91 Below Limit       0.20   102.33  35.70
## 78   91.440        NA  2624.08 Below Limit       0.20   108.81  40.99
## 183 182.880  32.17260   333.35      Detect       2.61   141.84  68.93
## 197  91.440  11.35503  1703.94 Below Limit       0.20   706.51 297.05
## 121 155.448 227.10068  1379.81 Below Limit       0.20   913.53 393.52

(a)

(i)

mtbeo=na.omit(mtbe)

(ii)

bedrock=mtbeo[mtbeo$Aquifier=="Bedrock",]
sd(bedrock$Depth)
## [1] 56.45357

6.

eq <- read.csv("EARTHQUAKE.csv",header=TRUE)
v <- sample(1:2929,30,replace=FALSE)
eq[v,]
##      YEAR MONTH DAY HOUR MINUTE MAGNITUDE
## 194  1994     1  18    0     30       2.7
## 938  1994     1  21   13     16       1.6
## 1753 1994     1  25   22     44       2.1
## 859  1994     1  21    7     10       1.8
## 1040 1994     1  22    0     25       1.8
## 1602 1994     1  25    1      7       2.0
## 642  1994     1  20    8      3       2.4
## 1850 1994     1  26   12     41       2.4
## 2073 1994     1  28    2     25       1.6
## 1269 1994     1  23    3     17       2.3
## 2536 1994     2   2   16     25       2.2
## 683  1994     1  20   11     32       2.2
## 2865 1994     2   5   14     31       1.6
## 1218 1994     1  22   21     21       2.4
## 502  1994     1  19   18     33       1.9
## 94   1994     1  17   16     22       3.4
## 2295 1994     1  29   21     24       2.3
## 959  1994     1  21   16     17       1.8
## 991  1994     1  21   19      9       2.2
## 1964 1994     1  27    7     27       1.9
## 2736 1994     2   4   10     30       1.7
## 746  1994     1  20   18     27       1.7
## 1830 1994     1  26   10     15       2.0
## 2678 1994     2   3   19      7       1.0
## 186  1994     1  17   23     49       4.0
## 275  1994     1  18   13     24       4.3
## 1613 1994     1  25    3     31       2.2
## 2065 1994     1  28    1     35       1.5
## 2180 1994     1  28   22     25       2.4
## 1316 1994     1  23    9     30       2.7

(a)

(i)

plot(ts(eq$MAG))

(ii)

median(eq$MAGNITUDE)
## [1] 2

7.

(a)

The data collection method was stratified sampling.

(b)

The population was all fish in the Tennesee river and its tributaries.

(c)

River and Species are the only qualitative variables.

8.

  1. What type of graph is used to describe the data?

A pareto graph is used to describe the data.

  1. Identify the variable measured for each of the 106 robot designs.

The variable measured was Robot limb types.

  1. Use graph to identify the social robot design that is currently used the most.

The design that is most used is the one with zero legs.

  1. Compute class relative frequencies for the different categories shown in the graph.

The relative frequency of None is 0.1415, Both is 0.0755, Legs0 is 0.5943, and Wheels0 is 0.1887.

  1. Use the results, part d, to construct a Pareto diagram for the data
freq=c(15,8,63,20)
RL=c("None","Both","Legs0","Wheels0")
l=rep(RL,freq)
pareto<-function(x,mn="Pareto barplot",...){  # x is a vector
x.tab=table(x)
xx.tab=sort(x.tab, decreasing=TRUE,index.return=FALSE)
cumsum(as.vector(xx.tab))->cs
length(x.tab)->lenx
bp<-barplot(xx.tab,ylim=c(0,max(cs)),las=2)
lb<-seq(0,cs[lenx],l=11)
axis(side=4,at=lb,labels=paste(seq(0,100,length=11),"%",sep=""),las=1,line=-1,col="Blue",col.axis="Red")
for(i in 1:(lenx-1)){
segments(bp[i],cs[i],bp[i+1],cs[i+1],col=i,lwd=2)
}
title(main=mn,...)

}
pareto(l)

#I tried to use the pareto function multiple times but it only wants to work with your pareto function.

9.

slices=c(32,6,12)
lbs=c("Windows","Explorer","Office")
pie(slices, labels=lbs,main="Microsoft products with security issues")

Explorer has the lowest proportion of security issues in 2012.

freq=c(6,8,22,3,11)
RL=c("Denial of Service","Information Disclosure","Remote Code Execution","Spoofing","Priviledge Elevation")
l=rep(RL,freq)
pareto(l)

Microsoft should focus on Remote code execution.

10.

swd=read.csv("SWDEFECTS.csv", header=TRUE)
head(swd)
##   Mloc Mvg Mevg Mivg   Hn  Hvol Hpgmlen Hdiff Hintell Heffort   Hb Htime Hloc
## 1  1.1 1.4  1.4  1.4  1.3   1.3    1.30  1.30    1.30       1 1.30     1    2
## 2  1.0 1.0  1.0  1.0  1.0   1.0    1.00  1.00    1.00       1 1.00     1    1
## 3 24.0 5.0  1.0  3.0 63.0 309.1    0.11  9.50   32.54    2937 0.10   163    1
## 4 20.0 4.0  4.0  2.0 47.0 215.5    0.06 16.00   13.47    3448 0.07   192    0
## 5 24.0 6.0  6.0  2.0 72.0 346.1    0.06 17.33   19.97    6000 0.12   333    0
## 6 24.0 6.0  6.0  2.0 72.0 346.1    0.06 17.33   19.97    6000 0.12   333    0
##   Hcomm Hblank loc.comm uniOp uniOpnd totOp totOpnd brnchcnt defect
## 1     2      2        2   1.2     1.2   1.2     1.2      1.4  FALSE
## 2     1      1        1   1.0     1.0   1.0     1.0      1.0   TRUE
## 3     0      6        0  15.0    15.0  44.0    19.0      9.0  FALSE
## 4     0      3        0  16.0     8.0  31.0    16.0      7.0  FALSE
## 5     0      3        0  16.0    12.0  46.0    26.0     11.0  FALSE
## 6     0      3        0  16.0    12.0  46.0    26.0     11.0  FALSE
##   predict.vg.10 predict.evg.14.5 predict.ivg.9.2 predict.loc.50
## 1            no               no              no             no
## 2            no               no              no             no
## 3            no               no              no             no
## 4            no               no              no             no
## 5            no               no              no             no
## 6            no               no              no             no
library(plotrix)
tab=table(swd$defect)
rtab=tab/sum(tab)
round(rtab,2)
## 
## FALSE  TRUE 
##   0.9   0.1
pie3D(rtab,labels=list("OK", "Defective"), main="pie plot od SWD")

Defective software code isn’t very likely.

11.

  1. Construct a relative frequency histogram for the voltage readings of the old process.

  2. Construct a stem-and-leaf display for the voltage readings of the old process. Which of the two graphs in parts a and b is more informative about where most of the voltage readings lie?

  3. Construct a relative frequency histogram for the voltage readings of the new process.

  4. Compare the two graphs in parts a and c. (You may want to draw the two histograms on the same graph.) Does it appear that the manufacturing process can be established locally (i.e., is the new process as good as or better than the old)?

  5. Find and interpret the mean, median, and mode for each of the voltage readings data sets. Which is the preferred measure of central tendency? Explain.

  6. Calculate the z-score for a voltage reading of 10.50 at the old location.

  7. Calculate the z-score for a voltage reading of 10.50 at the new location.

  8. Based on the results of parts f and g, at which location is a voltage reading of 10.50 more likely to occur? Explain.

  9. Construct a box plot for the data at the old location. Do you detect any outliers?

  10. Use the method of z-scores to detect outliers at the old location.

  11. Construct a box plot for the data at the new location. Do you detect any outliers?

  12. Use the method of z-scores to detect outliers at the new location.

  13. Compare the distributions of voltage readings at the two locations by placing

(a)

volt <- read.csv("VOLTAGE.csv", header=TRUE)
old <- volt[volt$LOCATION=="OLD",]
hist(old$VOLTAGE, xlim=c(8.0,10.6), breaks=c(8.0,8.29,8.58,8.87,9.16,9.44,9.73,10.02,10.31,10.6), main="Relative frequencies of old voltage readings")

(b)

stem(old$VOLTAGE)
## 
##   The decimal point is at the |
## 
##    8 | 1
##    8 | 778
##    9 | 
##    9 | 677888899
##   10 | 0000000011122333
##   10 | 6

The plot for part a is more informative.

(c)

new <-subset(volt,subset=LOCATION=="NEW")
vtn <- new$VOLTAGE
vtn
##  [1]  9.19  9.63 10.10  9.70 10.09  9.60 10.05 10.12  9.49  9.37 10.01  8.82
## [13]  9.43 10.03  9.85  9.27  8.83  9.39  9.48  9.64  8.82  8.65  8.51  9.14
## [25]  9.75  8.78  9.35  9.54  9.36  8.68
max(vtn)
## [1] 10.12
min(vtn)
## [1] 8.51
lept <- min(vtn)-0.05
rept <- max(vtn)+0.05
rnge <- rept-lept
inc <- rnge/9
inc
## [1] 0.19
seq(lept, rept,by=inc) ->cl
cl
##  [1]  8.46  8.65  8.84  9.03  9.22  9.41  9.60  9.79  9.98 10.17
cvtn <-cut(vtn,breaks=cl)
new.tab=table(cvtn)
barplot(new.tab,space=0,main="Frequency Histogram(New)",las=2)

hist(vtn,nclass=10)

(d)

12.

pipe <- read.csv("ROUGHPIPE.csv",header=TRUE)
Rpipe <- pipe$ROUGH
m <- mean(Rpipe)
stan <- sd(Rpipe)
#according to Chebyshev's rule at least 95% of the data should be within 4.72 standard deviations of the mean

intmin <- m-4.72*stan
intmin
## [1] -0.5918618
intmax <- m+4.72*stan
intmax
## [1] 4.353862
intval <- length(pipe[pipe$ROUGH>=intmin & pipe$ROUGH<=intmax,])
(intval/length(Rpipe)) *100
## [1] 100

Thus, the interval (-0.5919,4.3539) contains at least 95% of the data values.

13.

  1. Find the mean, median, and mode for the number of ant species discovered at the 11 sites. Interpret each of these values.
ants <- read.csv("GOBIANTS.csv", header=TRUE)
mean(ants$AntSpecies)
## [1] 12.81818
median(ants$AntSpecies)
## [1] 5
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(ants$AntSpecies)
## [1] 5
  1. Which measure of central tendency would you recommend to describe the center of the number of ant species distribution? Explain.

I would recommend the mode, because it tells you the most often occurrence.

  1. Find the mean, median, and mode for the total plant cover percentage at the 5 Dry Steppe sites only.
steppe <- ants[ants$Region=="Dry Steppe",]
mean(steppe$PlantCov)
## [1] 40.4
median(steppe$PlantCov)
## [1] 40
getmode(steppe$PlantCov)
## [1] 40
  1. Find the mean, median, and mode for the total plant cover percentage at the 6 Gobi Desert sites only.
gobi <- ants[ants$Region=="Gobi Desert",]
mean(gobi$PlantCov)
## [1] 28
median(gobi$PlantCov)
## [1] 26
getmode(gobi$PlantCov)
## [1] 30
  1. Based on the results, parts c and d, does the center of the total plant cover percentage distribution appear to be different at the two regions?

Yes. For the steppe it is at 40% whereas at the Gobi Dsert it is closer to 30%.

14.

  1. Use a graphical method to describe the velocity distribution of galaxy cluster A1775.
galaxy <- read.csv("GALAXY2.csv", header=TRUE)
hist(galaxy$VELOCITY)

  1. Examine the graph, part a. Is there evidence to support the double cluster theory? Explain.

There does appear to be evidence to support the double cluster. Because the histogram contains two different peaks of values.

  1. Calculate numerical descriptive measures (e.g., mean and standard deviation) for galaxy velocities in cluster A1775. Depending on your answer to part b, you may need to calculate two sets of numerical descriptive measures, one for each of the clusters (say, A1775A and A1775B) within the double cluster.
A1775A <- galaxy[galaxy$VELOCITY<=21000,]
mean(A1775A)
## [1] 19462.24
median(A1775A)
## [1] 19408
getmode(A1775A)
## [1] 20210
A1775B <- galaxy[galaxy$VELOCITY>21000,]
mean(A1775B)
## [1] 22838.47
median(A1775B)
## [1] 22780
getmode(A1775B)
## [1] 22922
  1. Suppose you observe a galaxy velocity of 20,000 km/s. Is this galaxy likely to belong to cluster A1775A or A1775B? Explain.

It is more likely to belong to cluster A1775A. This is because cluster A1775A has middle values right near 20,000 km/s.

15.

library(ggplot2)
b = ggplot(ddt, aes(x = RIVER, y = LENGTH),family="Courier New")
 b = b + geom_boxplot(aes(fill= SPECIES)) + 
  theme(axis.text.x = element_text(angle=0, vjust=0.6)) + 
  labs(title="Caleb Gray")

b