All students in this course are expected to adhere to University standards of academic integrity.
Cheating, plagiarism, misrepresentation, and other forms of academic dishonesty will not be tolerated.
This includes, but is not limited to, consulting with another person during an exam, turning in written work that was prepared by someone other than you, making minor modifications to the work of someone else and turning it in as your own, or engaging in misrepresentation in seeking a postponement or extension.
For complete information, please go to CWRU Academic Integrity Policy.
.Rmd file to turn in as .pdf
report.Rmd, .pdf to the Canvas
Assignment PageYou have a pdf of OIStats book in your readings folder of your Repo
If the answer to a question part, like a), is in your code block,
# a) to show it in your codeA car insurance company advertises that
A market researcher at a competing insurance discounter
1.a) Are conditions for inference satisfied?
1.b) Perform a hypothesis test and state your conclusion.
1.c) Do you agree with the market researcher
1.d) Calculate a 90% confidence interval
1.e) Do your results from the hypothesis test
sample1 <- 82
mean1 <- 395
sd1 <- 102
prob1 <- pnorm(432, mean1, sd1)
test1 <- (mean1-432)/(sd1/sqrt(sample1))
test1## [1] -3.284797
pvalue1 <- pt(test1,sample1-1,lower.tail = FALSE)
pvalue1## [1] 0.9992455
pe1 <- 395
min1 <- pe1 - sd1/sqrt(sample1)*qnorm(0.95)
max1 <- pe1 + sd1/sqrt(sample1)*qnorm(0.95)
min1## [1] 376.4723
max1## [1] 413.5277
H0 = the amount of savings advertised is not an overestimate H1 = the amount of savings advertised is an overestimate
A company offering online speed reading courses
A random sample of 100 students yielded
Is there evidence that the company’s claim is false?
2.a) Are conditions for inference satisfied?
2.b) Perform a hypothesis test evaluating
2.c) Calculate a 95% confidence interval
2.d) Do your results from the hypothesis test
# I'm not sure if we should use pnorm() or rnorm() in this situation
mean2 <- 415
sd2 <- 220
t.test(c(rnorm(100,mean2,sd2)), mu = 500)##
## One Sample t-test
##
## data: c(rnorm(100, mean2, sd2))
## t = -5.0638, df = 99, p-value = 1.904e-06
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
## 343.0712 431.4308
## sample estimates:
## mean of x
## 387.251
zvalue2 <- 1.96
min2 <- mean2-zvalue2*sd2/sqrt(100)
max2 <- mean2+zvalue2*sd2/sqrt(100)
min2## [1] 371.88
max2## [1] 458.12
H0 = the true average improvement is 500% H1 = the true average improvement is less than 500%.
The National Survey of Family Growth conducted by the Centers for Disease Control
One of the variables collected on this survey
The histogram below shows the distribution of ages at first marriage
age at first marriage
Estimate the average age at first marriage of women
sample3 <- 5534
mean3 <- 23.44
sd3 <- 4.72
pe3 <- 22.34
min3 <- pe3 - sd3/sqrt(sample3)*qnorm(0.975)
max3 <- pe3 + sd3/sqrt(sample3)*qnorm(0.975)
min3## [1] 22.21564
max3## [1] 22.46436
the age probability falls between [22.21564,22.46436] in the case of a 95% confidence interval.
This question uses a dataset from a case-control study
This dataset is relatively clean,
Some issues with the data might be left over from the method of data entry
Let’s only deal with bounded values
The question we want to answer is:
Which tobacco group (not including 75+)
We’ll answer this by using \(\frac{ncases}{ncontrols}\),
Calculate occurrence percentage for each age group
data(esoph)
?esoph
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
esoph$alcgp = gsub("g/day","", esoph$alcgp)
esoph$tobgp = gsub("g/day","", esoph$tobgp)
esophesoph <- subset(esoph, agegp != "75+")
esoph <- subset(esoph, alcgp != "120+")
esoph <- subset(esoph, tobgp != "30+")
esophesoph <- esoph %>%
group_by(agegp) %>%
summarise(n = ncases/ncontrols) %>%
summarise(n = mean(n))## `summarise()` has grouped output by 'agegp'. You can override using the
## `.groups` argument.
esophDo groups with higher levels of daily tobacco consumption - have higher occurrences of (o)esopheageal cancer?
Which tobacco group (not including 75+) - has the highest occurrence of (o)esophageal cancer? Ans: 10-19
For this question, we’ll look at some classical
The data consists of 5 experiments,
We want to compare the different experiments
5.a) Create a table reporting summary statistics for each experiment.
And report the following:
5.b) Create visualizations comparing the different experiments
data(morley)
?morley
experiment1 <- subset(morley, Expt == 1, select = Expt:Speed)
experiment2 <- subset(morley, Expt == 2, select = Expt:Speed)
experiment3 <- subset(morley, Expt == 3, select = Expt:Speed)
experiment4 <- subset(morley, Expt == 4, select = Expt:Speed)
experiment5 <- subset(morley, Expt == 5, select = Expt:Speed)
varm1 <- var(experiment1$Speed)
varm2 <- var(experiment2$Speed)
varm3 <- var(experiment3$Speed)
varm4 <- var(experiment4$Speed)
varm5 <- var(experiment5$Speed)
sdm1 <- sd(experiment1$Speed)
sdm2 <- sd(experiment2$Speed)
sdm3 <- sd(experiment3$Speed)
sdm4 <- sd(experiment4$Speed)
sdm5 <- sd(experiment5$Speed)
meanm1 <- mean(experiment1$Speed)
meanm2 <- mean(experiment2$Speed)
meanm3 <- mean(experiment3$Speed)
meanm4 <- mean(experiment4$Speed)
meanm5 <- mean(experiment5$Speed)
maxm1 <- max(experiment1$Speed)
maxm2 <- max(experiment2$Speed)
maxm3 <- max(experiment3$Speed)
maxm4 <- max(experiment4$Speed)
maxm5 <- max(experiment5$Speed)
summary <- matrix(data = c(varm1,varm2,varm3,varm4,varm5,sdm1,sdm2,sdm3,sdm4,sdm5,meanm1,meanm2,meanm3,meanm4,meanm5,maxm1,maxm2,maxm3,maxm4,maxm5),ncol = 5, byrow = TRUE)
colnames(summary) <- c('expt1','expt2','expt3','expt4','expt5')
rownames(summary) <- c('variance','sd','mean','max')
summary <- as.table(summary)
summary## expt1 expt2 expt3 expt4 expt5
## variance 11009.47368 3741.05263 6257.89474 3605.00000 2939.73684
## sd 104.92604 61.16414 79.10686 60.04165 54.21934
## mean 909.00000 856.00000 845.00000 820.50000 831.50000
## max 1070.00000 960.00000 970.00000 920.00000 950.00000
library(ggplot2)
ggplot(data = experiment1, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
geom_boxplot()+
ggtitle("Experiment 1: run and speed")ggplot(data = experiment2, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
geom_boxplot()+
ggtitle("Experiment 2: run and speed")ggplot(data = experiment3, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
geom_boxplot()+
ggtitle("Experiment 3: run and speed")ggplot(data = experiment4, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
geom_boxplot()+
ggtitle("Experiment 4: run and speed")ggplot(data = experiment5, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
geom_boxplot()+
ggtitle("Experiment 5: run and speed")Variance for each experiment (1-5): 11009.47368 3741.05263 6257.89474 3605.00000 2939.73684 sd for each experiment (1-5): 104.92604 61.16414 79.10686 60.04165 54.21934 mean for each experiment (1-5): 909.00000 856.00000 845.00000 820.50000 831.50000 max for each experiment (1-5): 1070.00000 960.00000 970.00000 920.00000 950.00000
What do you find most interesting or exciting
Use the structure of a 5 paragraph essay
Data Science: A Brief Reflection
–> Essay text here <– Data science is a commonly used approach to both qualitatively and quantitatively measure large amount of data within a dataset. It helps us organize data and better visualize them in order to make full use of the statistics we have collected. It is widely applied in many fields, including AI and machine learning [1].
Specifically, four major components of data science are “Data Strategy”,“Data Engineering”,“Data Analysis and Models”, and ““Data Visualization and Operationalization” [2]. Data Sicence basics can also be broken down into 5 subunits (statistics, domain expertise, data engineering, visualization, and advanced computing) that together significantly determine the nature of such a critical subject [3]. Those components toegther are considered elements of a data science tool chain.
A data science pipeline is a set of methods that toegther process the raw data and conclude with detailed answers to real-life questions. Elements of a data science pipeline include “continuous and scalable processing” of raw data, “cloud-based elasticity and agility”, self-contained and isolated resources, “access to a large amount of data and the ability to self-serve”, and “disaster recovery and high availability” [4]. There are 5 major stages of a data science pipeline. First, we need to collect data. Second, we need to cleanse/tidy up the data. After that, we need to process the data with modeling skills. Then, we need detailed, comprehensive understanding of data. Last but not least, we would also have to revise the data from previous steps if necessary.
Elements of a data analytics strategy include “collecting data”, “data analysis”, “reporting results”, “improving processes”, and “building a data-drive culture”. [5] Specifically, for data analysis, it is critical that we always keep in mind the three parts of data analysis: “reporting, insights, and prediction” [6].
One thing I found interesting about data science is how organize it can be. With the assistance of such a programming lanaguge, we are able to better sort our data by different categories. With ggplot(), for example, we can also visualize the different variables that together play an important role in the experiment.
–> References here <– 1. Data Science vs. machine learning: What’s the difference? Coursera. (n.d.). Retrieved December 19, 2022, from https://www.coursera.org/articles/data-science-vs-machine-learning 2. 4 components of a data science project. Macadamian. (2019, October 15). Retrieved December 19, 2022, from https://www.macadamian.com/learn/4-components-of-a-data-science-project/ 3. Johnson, D. (2022, November 19). What is Data Science? introduction, basic concepts & process. Guru99. Retrieved December 19, 2022, from https://www.guru99.com/data-science-tutorial.html#3 4. Davor DSouza • April 19th, Integration, M. J. on D., & Integration, S. R. on D. (2022, December 19). Data Science Pipelines: Ultimate guide in 2022 - Learn. Hevo. Retrieved December 19, 2022, from https://hevodata.com/learn/data-science-pipeline/#21 5. Domo Resource - the 5 elements of a data analytics strategy. Domo. (n.d.). Retrieved December 19, 2022, from https://www.domo.com/learn/article/the-5-elements-of-a-data-analytics-strategy 6. The 3 levels of data analysis- A framework for assessing data organization maturity. GitLab. (n.d.). Retrieved December 19, 2022, from https://about.gitlab.com/blog/2019/11/04/three-levels-data-analysis/ ____________
This problem will be similar to the LE3 on Degradation of Hard Coat Acrylics.
.csv file of a clean and tidy data
set..csv datafile is located in the data subfolder of
the exam-finalTCO’s are transparent conductive oxides
Here is the abstract of the paper
tco-degr abstract
Here is a mindmap of her data science study
tco-DataStructue
Here is information on the samples studied
tco-samples
Here is information about the exposures she did
tco-exposures
And a table about the exposures
tco-exposures2
Some questions to try to address, showing your results.
table1 <- read.csv("data/1304LemireTCO-Processed-Gok-updated-v0.4.csv",sep = ",")## Error in file(file, "rt"): cannot open the connection
data %>%
group_by(MaterialType) %>%
plot(table1$time,table1$SFEpolar)## Error in UseMethod("group_by"): no applicable method for 'group_by' applied to an object of class "function"
timepolar <- lm(SFEpolar~time, table1)## Error in is.data.frame(data): object 'table1' not found
summary(timepolar)## Error in summary(timepolar): object 'timepolar' not found
abline(timepolar)## Error in abline(timepolar): object 'timepolar' not found
ANSWERS:
Here we’ll use a base R dataset about vapor pressure
The Dataset contains 19 observations
8.a) Start by plotting the data, temperature (x) vs. vapor pressure (y)
This relationship is clearly not linear.
Let’s use this equation to fit our data with a linear model.
8.f) What are your approximations for \(A\) and \(B\)
data(pressure)
?pressure
# a
plot(pressure$temperature, pressure$pressure, main = "Temperature vs. Pressure", xlab = "temp", ylabl = "pressure") # b
library(dplyr)
logP <- log(x = pressure$pressure)
pressure2 <- cbind(pressure,logP)
fracT <- 1/(pressure$temperature + 273.15)
pressure2 <- cbind(pressure2,fracT)
pressure2# d
plot(pressure2$fracT, pressure2$logP, main = "Pressure vs. Temperature", xlab = "temperature", ylab = "pressure")
lmpressure <- pressure2[-1,]
# c
logP_fracT <- lm(logP~fracT, lmpressure)
# e
summary(logP_fracT)##
## Call:
## lm(formula = logP ~ fracT, data = lmpressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08900 -0.01946 -0.00134 0.01883 0.14166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.825e+01 4.968e-02 367.4 <2e-16 ***
## fracT -7.296e+03 2.121e+01 -344.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04897 on 16 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.184e+05 on 1 and 16 DF, p-value: < 2.2e-16
abline(logP_fracT)ANSWERS:
[if an answer is in your code block, put # a) to show it
in your code]