Completed 14/15
This is how you will calculate our final grade.
4 assignments each worth 3.75% of the final grade
16 lab/minilab reports collectively worth 10% of our final grade
2 projects, Project 1 worth 3.3% and Project 2 worth 6.6% of final grade. Collectively worth 10% of final
In class quizzes worth 10% of final
chapter quizzes worth 5% of final
Mid-term exams 20% of final
Final exam 30%
the grading scale is as follows
ddt<- read.csv("DDT-1.csv")
m=with(ddt, as.numeric(factor(MILE))) #A
length(unique(m)) #B
## [1] 17
m=with(ddt, as.numeric(factor(MILE)))
coplot(LENGTH~WEIGHT|RIVER*SPECIES,data=ddt,col=m, rows=1,)
#pch=ifelse(m>=1 & m<=5, 1, ifelse(m>5 & m<=10, 2, 3)))
The lower left three conditional plots show the Length vs. Weight for the three different species of fish found in the Tennessee River. And each data point is color coded according to the sub-setting function m.
Line A first creates a list of each different value input under the name MILES. Then, it is stored as a vector arranged in ascending order of magnitude for the inputs. It records each vector input as a level and then the as.numeric() command replaces each input value with the value of its level in the vector. Then, it makes this list of values into a subset of ddt and names it m.
Line B creates a vector from the m subset by removing any duplicate values using the unique() command. Then, using the length() command it outputs the number of elements within the resulting vector.
The top six plots are empty because no bass or buffalo fish were caught in the FCM, LCM, or SCM rivers.
with(ddt, ddt[SPECIES=="LMBASS"||SPECIES=="SMBUFFALO"&RIVER=="FCM"||RIVER=="LCM" ||RIVER=="SCM",])
## [1] RIVER MILE SPECIES LENGTH WEIGHT DDT
## <0 rows> (or 0-length row.names)
cat <- with(ddt, ddt[ddt$SPECIES=="CCATFISH"& ddt$RIVER=="FCM",])
catddt <- cat$DDT
mean(catddt)
## [1] 45
The mean value of DDT found in catfish on the FCM river was 45.
National Bridge Inventory. All highway bridges in the United States are inspected periodically for structural deficiency by the Federal Highway Administration (FHWA). Data from the FHWA inspections are compiled into the National Bridge Inventory (NBI). Several of the nearly 100 variables maintained by the NBI are listed below. Classify each variable as quantitative or qualitative. a. Length of maximum span (feet) b. Number of vehicle lanes c. Toll bridge (yes or no) d. Average daily traffic e. Condition of deck (good, fair, or poor) f. Bypass or detour length (miles) g. Route type (interstate, U.S., state, county, or city)
Solution:
Length of maximum span, number of vehicle lanes, average daily traffic, and bypass or detour length are all quantitative variables.
Toll bridge, condition of deck, and route type are all qualitative variables.
The four random sampling designs are simple random sampling, stratified random sampling, cluster sampling, and systematic sampling.
Simple random sampling is similar to a random number generator where each value in the population has an equal chance of being selected. Stratified random sampling is when the population is arranged into groups based on similar characteristics within a given group. Cluster sampling is when the population is clustered into different groups and then collect data from the experimental group in a cluster. Systematic sampling is when every kth element is selected from the population.
mtbe=read.csv("MTBE.csv", header=TRUE)
head(mtbe)
## pH SpConduct DissOxy RoadsPct IndPct UrbanPct DevPct WellClass Aquifier
## 1 7.87 290.0 0.58 1.34 0 17.77 17.77 Private Bedrock
## 2 8.63 225.9 0.84 0.72 0 10.43 10.43 Private Bedrock
## 3 7.11 157.4 8.37 1.92 0 29.62 50.01 Private Bedrock
## 4 7.98 723.6 0.41 2.76 0 41.65 41.65 Private Bedrock
## 5 7.88 148.7 1.44 3.51 0 51.21 51.21 Private Bedrock
## 6 8.36 198.2 0.18 1.48 0 22.49 29.19 Private Bedrock
## Depth SafeYld Distance MTBE.Detect MTBE.Level HouseDen PopDen
## 1 60.960 NA 2386.29 Below Limit 0.2 131.16 43.19
## 2 36.576 NA 3667.69 Below Limit 0.2 47.69 24.52
## 3 152.400 NA 2324.15 Below Limit 0.2 85.16 33.32
## 4 NA NA 788.88 Below Limit 0.2 134.77 46.61
## 5 91.440 NA 1337.88 Below Limit 0.2 198.54 78.58
## 6 115.824 NA 2396.74 Below Limit 0.2 206.73 59.47
dim(mtbe)
## [1] 223 16
ind=sample(1:223,5,replace=FALSE)
mtbe[ind,]
## pH SpConduct DissOxy RoadsPct IndPct UrbanPct DevPct WellClass Aquifier
## 170 7.87 304.20 1.57 0.07 0.00 4.54 4.54 Public Bedrock
## 78 7.59 11.36 1.02 1.28 0.00 22.78 28.49 Private Bedrock
## 183 7.78 352.30 4.79 3.17 8.42 25.58 25.58 Public Bedrock
## 197 8.31 208.50 0.59 3.80 0.00 52.14 55.38 Public Bedrock
## 121 8.08 387.80 0.71 2.86 4.21 62.71 62.71 Public Bedrock
## Depth SafeYld Distance MTBE.Detect MTBE.Level HouseDen PopDen
## 170 129.540 56.77517 1150.91 Below Limit 0.20 102.33 35.70
## 78 91.440 NA 2624.08 Below Limit 0.20 108.81 40.99
## 183 182.880 32.17260 333.35 Detect 2.61 141.84 68.93
## 197 91.440 11.35503 1703.94 Below Limit 0.20 706.51 297.05
## 121 155.448 227.10068 1379.81 Below Limit 0.20 913.53 393.52
mtbeo=na.omit(mtbe)
bedrock=mtbeo[mtbeo$Aquifier=="Bedrock",]
sd(bedrock$Depth)
## [1] 56.45357
eq <- read.csv("EARTHQUAKE.csv",header=TRUE)
v <- sample(1:2929,30,replace=FALSE)
eq[v,]
## YEAR MONTH DAY HOUR MINUTE MAGNITUDE
## 194 1994 1 18 0 30 2.7
## 938 1994 1 21 13 16 1.6
## 1753 1994 1 25 22 44 2.1
## 859 1994 1 21 7 10 1.8
## 1040 1994 1 22 0 25 1.8
## 1602 1994 1 25 1 7 2.0
## 642 1994 1 20 8 3 2.4
## 1850 1994 1 26 12 41 2.4
## 2073 1994 1 28 2 25 1.6
## 1269 1994 1 23 3 17 2.3
## 2536 1994 2 2 16 25 2.2
## 683 1994 1 20 11 32 2.2
## 2865 1994 2 5 14 31 1.6
## 1218 1994 1 22 21 21 2.4
## 502 1994 1 19 18 33 1.9
## 94 1994 1 17 16 22 3.4
## 2295 1994 1 29 21 24 2.3
## 959 1994 1 21 16 17 1.8
## 991 1994 1 21 19 9 2.2
## 1964 1994 1 27 7 27 1.9
## 2736 1994 2 4 10 30 1.7
## 746 1994 1 20 18 27 1.7
## 1830 1994 1 26 10 15 2.0
## 2678 1994 2 3 19 7 1.0
## 186 1994 1 17 23 49 4.0
## 275 1994 1 18 13 24 4.3
## 1613 1994 1 25 3 31 2.2
## 2065 1994 1 28 1 35 1.5
## 2180 1994 1 28 22 25 2.4
## 1316 1994 1 23 9 30 2.7
plot(ts(eq$MAG))
median(eq$MAGNITUDE)
## [1] 2
The data collection method was stratified sampling.
The population was all fish in the Tennesee river and its tributaries.
River and Species are the only qualitative variables.
A pareto graph is used to describe the data.
The variable measured was Robot limb types.
The design that is most used is the one with zero legs.
The relative frequency of None is 0.1415, Both is 0.0755, Legs0 is 0.5943, and Wheels0 is 0.1887.
freq=c(15,8,63,20)
RL=c("None","Both","Legs0","Wheels0")
l=rep(RL,freq)
pareto<-function(x,mn="Pareto barplot",...){ # x is a vector
x.tab=table(x)
xx.tab=sort(x.tab, decreasing=TRUE,index.return=FALSE)
cumsum(as.vector(xx.tab))->cs
length(x.tab)->lenx
bp<-barplot(xx.tab,ylim=c(0,max(cs)),las=2)
lb<-seq(0,cs[lenx],l=11)
axis(side=4,at=lb,labels=paste(seq(0,100,length=11),"%",sep=""),las=1,line=-1,col="Blue",col.axis="Red")
for(i in 1:(lenx-1)){
segments(bp[i],cs[i],bp[i+1],cs[i+1],col=i,lwd=2)
}
title(main=mn,...)
}
pareto(l)
#I tried to use the pareto function multiple times but it only wants to work with your pareto function.
slices=c(32,6,12)
lbs=c("Windows","Explorer","Office")
pie(slices, labels=lbs,main="Microsoft products with security issues")
Explorer has the lowest proportion of security issues in 2012.
freq=c(6,8,22,3,11)
RL=c("Denial of Service","Information Disclosure","Remote Code Execution","Spoofing","Priviledge Elevation")
l=rep(RL,freq)
pareto(l)
Microsoft should focus on Remote code execution.
swd=read.csv("SWDEFECTS.csv", header=TRUE)
head(swd)
## Mloc Mvg Mevg Mivg Hn Hvol Hpgmlen Hdiff Hintell Heffort Hb Htime Hloc
## 1 1.1 1.4 1.4 1.4 1.3 1.3 1.30 1.30 1.30 1 1.30 1 2
## 2 1.0 1.0 1.0 1.0 1.0 1.0 1.00 1.00 1.00 1 1.00 1 1
## 3 24.0 5.0 1.0 3.0 63.0 309.1 0.11 9.50 32.54 2937 0.10 163 1
## 4 20.0 4.0 4.0 2.0 47.0 215.5 0.06 16.00 13.47 3448 0.07 192 0
## 5 24.0 6.0 6.0 2.0 72.0 346.1 0.06 17.33 19.97 6000 0.12 333 0
## 6 24.0 6.0 6.0 2.0 72.0 346.1 0.06 17.33 19.97 6000 0.12 333 0
## Hcomm Hblank loc.comm uniOp uniOpnd totOp totOpnd brnchcnt defect
## 1 2 2 2 1.2 1.2 1.2 1.2 1.4 FALSE
## 2 1 1 1 1.0 1.0 1.0 1.0 1.0 TRUE
## 3 0 6 0 15.0 15.0 44.0 19.0 9.0 FALSE
## 4 0 3 0 16.0 8.0 31.0 16.0 7.0 FALSE
## 5 0 3 0 16.0 12.0 46.0 26.0 11.0 FALSE
## 6 0 3 0 16.0 12.0 46.0 26.0 11.0 FALSE
## predict.vg.10 predict.evg.14.5 predict.ivg.9.2 predict.loc.50
## 1 no no no no
## 2 no no no no
## 3 no no no no
## 4 no no no no
## 5 no no no no
## 6 no no no no
library(plotrix)
tab=table(swd$defect)
rtab=tab/sum(tab)
round(rtab,2)
##
## FALSE TRUE
## 0.9 0.1
pie3D(rtab,labels=list("OK", "Defective"), main="pie plot od SWD")
Defective software code isn’t very likely.
Construct a relative frequency histogram for the voltage readings of the old process.
Construct a stem-and-leaf display for the voltage readings of the old process. Which of the two graphs in parts a and b is more informative about where most of the voltage readings lie?
Construct a relative frequency histogram for the voltage readings of the new process.
Compare the two graphs in parts a and c. (You may want to draw the two histograms on the same graph.) Does it appear that the manufacturing process can be established locally (i.e., is the new process as good as or better than the old)?
Find and interpret the mean, median, and mode for each of the voltage readings data sets. Which is the preferred measure of central tendency? Explain.
Calculate the z-score for a voltage reading of 10.50 at the old location.
Calculate the z-score for a voltage reading of 10.50 at the new location.
Based on the results of parts f and g, at which location is a voltage reading of 10.50 more likely to occur? Explain.
Construct a box plot for the data at the old location. Do you detect any outliers?
Use the method of z-scores to detect outliers at the old location.
Construct a box plot for the data at the new location. Do you detect any outliers?
Use the method of z-scores to detect outliers at the new location.
Compare the distributions of voltage readings at the two locations by placing
volt <- read.csv("VOLTAGE.csv", header=TRUE)
old <- volt[volt$LOCATION=="OLD",]
hist(old$VOLTAGE, xlim=c(8.0,10.6), breaks=c(8.0,8.29,8.58,8.87,9.16,9.44,9.73,10.02,10.31,10.6), main="Relative frequencies of old voltage readings")
stem(old$VOLTAGE)
##
## The decimal point is at the |
##
## 8 | 1
## 8 | 778
## 9 |
## 9 | 677888899
## 10 | 0000000011122333
## 10 | 6
The plot for part a is more informative.
new <-subset(volt,subset=LOCATION=="NEW")
vtn <- new$VOLTAGE
vtn
## [1] 9.19 9.63 10.10 9.70 10.09 9.60 10.05 10.12 9.49 9.37 10.01 8.82
## [13] 9.43 10.03 9.85 9.27 8.83 9.39 9.48 9.64 8.82 8.65 8.51 9.14
## [25] 9.75 8.78 9.35 9.54 9.36 8.68
max(vtn)
## [1] 10.12
min(vtn)
## [1] 8.51
lept <- min(vtn)-0.05
rept <- max(vtn)+0.05
rnge <- rept-lept
inc <- rnge/9
inc
## [1] 0.19
seq(lept, rept,by=inc) ->cl
cl
## [1] 8.46 8.65 8.84 9.03 9.22 9.41 9.60 9.79 9.98 10.17
cvtn <-cut(vtn,breaks=cl)
new.tab=table(cvtn)
barplot(new.tab,space=0,main="Frequency Histogram(New)",las=2)
hist(vtn,nclass=10)
pipe <- read.csv("ROUGHPIPE.csv",header=TRUE)
Rpipe <- pipe$ROUGH
m <- mean(Rpipe)
stan <- sd(Rpipe)
#according to Chebyshev's rule at least 95% of the data should be within 4.72 standard deviations of the mean
intmin <- m-4.72*stan
intmin
## [1] -0.5918618
intmax <- m+4.72*stan
intmax
## [1] 4.353862
intval <- length(pipe[pipe$ROUGH>=intmin & pipe$ROUGH<=intmax,])
(intval/length(Rpipe)) *100
## [1] 100
Thus, the interval (-0.5919,4.3539) contains at least 95% of the data values.
ants <- read.csv("GOBIANTS.csv", header=TRUE)
mean(ants$AntSpecies)
## [1] 12.81818
median(ants$AntSpecies)
## [1] 5
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(ants$AntSpecies)
## [1] 5
I would recommend the mode, because it tells you the most often occurrence.
steppe <- ants[ants$Region=="Dry Steppe",]
mean(steppe$PlantCov)
## [1] 40.4
median(steppe$PlantCov)
## [1] 40
getmode(steppe$PlantCov)
## [1] 40
gobi <- ants[ants$Region=="Gobi Desert",]
mean(gobi$PlantCov)
## [1] 28
median(gobi$PlantCov)
## [1] 26
getmode(gobi$PlantCov)
## [1] 30
Yes. For the steppe it is at 40% whereas at the Gobi Dsert it is closer to 30%.
galaxy <- read.csv("GALAXY2.csv", header=TRUE)
hist(galaxy$VELOCITY)
There does appear to be evidence to support the double cluster. Because the histogram contains two different peaks of values.
A1775A <- galaxy[galaxy$VELOCITY<=21000,]
mean(A1775A)
## [1] 19462.24
median(A1775A)
## [1] 19408
getmode(A1775A)
## [1] 20210
A1775B <- galaxy[galaxy$VELOCITY>21000,]
mean(A1775B)
## [1] 22838.47
median(A1775B)
## [1] 22780
getmode(A1775B)
## [1] 22922
It is more likely to belong to cluster A1775A. This is because cluster A1775A has middle values right near 20,000 km/s.
library(ggplot2)
b = ggplot(ddt, aes(x = RIVER, y = LENGTH),family="Courier New")
b = b + geom_boxplot(aes(fill= SPECIES)) +
theme(axis.text.x = element_text(angle=0, vjust=0.6)) +
labs(title="Caleb Gray")
b