Final Exam ( worth 20 pts)

  • Will be held Monday 12/19
    • From 12pm to 3pm, in Nord 356, or remote
  • Comprehensive overview of the course

Academic Integrity Policy

All students in this course are expected to adhere to University standards of academic integrity.

Cheating, plagiarism, misrepresentation, and other forms of academic dishonesty will not be tolerated.

This includes, but is not limited to, consulting with another person during an exam, turning in written work that was prepared by someone other than you, making minor modifications to the work of someone else and turning it in as your own, or engaging in misrepresentation in seeking a postponement or extension.

  • Ignorance will not be accepted as an excuse.
  • If you are not sure whether something you plan to submit
    • would be considered either cheating or plagiarism,
    • it is your responsibility to ask for clarification.

For complete information, please go to CWRU Academic Integrity Policy.

Final Exam Format

  • The exam will appear in the prof repo
  • In /assignments/exam-final folder
  • Done as .Rmd file to turn in as .pdf report
  • Submit Final Exam .Rmd, .pdf to the Canvas Assignment Page

Types of Questions

  • 8 questions total
  • OI Stats questions to do
  • Data Wrangling: Tidying, EDA,
  • 5 Paragraph Essay Question with cites: about Data Science
    • Citations to literature supporting your discussion
      • These are done as footnotes
      • Format: Author, Title, Source:Journal,Magazine, Page, Year, URL link
  • Data Analysis: Modeling using Linear Regression

Points per question

    1. OIS 1 pt
    1. OIS 1 pt
    1. OIS 1 pt
    1. Tidy data wrangling 2 pt
    1. EDA, Summary Stats & Visualization 3 pts
    1. 5 paragraph Essay 4 pts
    1. EDA on Real Dataset problem 4 pts
    1. Linear Regression on a dataset 4 pts
    • Do an exploratory data analysis on Degradation of Transparent Conductive Oxides

You have a pdf of OIStats book in your readings folder of your Repo

  • this is open book, open resource test

If the answer to a question part, like a), is in your code block,

  • put # a) to show it in your code

1. Hypothesis Test: Car Insurance (1 pt)

OIStats v2 4.30

A car insurance company advertises that

  • customers switching to their insurance
  • save, on average, $432 on their yearly premiums.

A market researcher at a competing insurance discounter

  • is interested in showing that this value is an overestimate
  • so he can provide evidence to government regulators
    • that the company is falsely advertising their prices.
  • He randomly samples 82 customers who recently switched to this insurance
    • and finds an average savings of $395,
    • with a standard deviation of $102.

1.a) Are conditions for inference satisfied?

1.b) Perform a hypothesis test and state your conclusion.

1.c) Do you agree with the market researcher

  • that the amount of savings advertised is an overestimate?
  • Explain your reasoning.

1.d) Calculate a 90% confidence interval

  • for the average amount of savings
    • of all customers who switch their insurance.

1.e) Do your results from the hypothesis test

  • and the confidence interval agree?
  • Explain.
sample1 <- 82
mean1 <- 395
sd1 <- 102

prob1 <- pnorm(432, mean1, sd1)
test1 <- (mean1-432)/(sd1/sqrt(sample1))
test1
## [1] -3.284797
pvalue1 <- pt(test1,sample1-1,lower.tail = FALSE)
pvalue1
## [1] 0.9992455
pe1 <- 395
min1 <- pe1 - sd1/sqrt(sample1)*qnorm(0.95)
max1 <- pe1 + sd1/sqrt(sample1)*qnorm(0.95)
min1
## [1] 376.4723
max1
## [1] 413.5277

ANSWER 1.a) Yes, the sample # is big enough.

ANSWER 1.b)

H0 = the amount of savings advertised is not an overestimate H1 = the amount of savings advertised is an overestimate

ANSWER 1.c) we reject the null hypothesis based on the answer in part b, so it is an overestimate.

ANSWER 1.d) [376.4723, 413.5277]

ANSWER 1.e) Yes, both proved that the amount of saving advertised is an overestimate as we successfully reject the null hypothesis and proved that the advertised saving goes beyond the actual range based on the 90% c.i.


2. Speed Reading (1 pt)

OIStats v3 4.23

A company offering online speed reading courses

  • claims that students who take their courses
    • show a 5 times (500%) increase in
    • the number of words they can read in a minute without losing comprehension.

A random sample of 100 students yielded

  • an average increase of 415%
    • with a standard deviation of 220%.

Is there evidence that the company’s claim is false?

2.a) Are conditions for inference satisfied?

2.b) Perform a hypothesis test evaluating

  • if the company’s claim is reasonable
    • or if the true average improvement is less than 500%.
  • Make sure to interpret your response
    • in context of the hypothesis test
    • and the data.
  • Use \(\alpha = 0.025\).

2.c) Calculate a 95% confidence interval

  • for the average increase in the number of words
    • students can read in a minute
    • without losing comprehension.

2.d) Do your results from the hypothesis test

  • and the confidence interval agree?
  • Explain.
# I'm not sure if we should use pnorm() or rnorm() in this situation
mean2 <- 415
sd2 <- 220
t.test(c(rnorm(100,mean2,sd2)), mu = 500)
## 
##  One Sample t-test
## 
## data:  c(rnorm(100, mean2, sd2))
## t = -5.0638, df = 99, p-value = 1.904e-06
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
##  343.0712 431.4308
## sample estimates:
## mean of x 
##   387.251
zvalue2 <- 1.96
min2 <- mean2-zvalue2*sd2/sqrt(100)
max2 <- mean2+zvalue2*sd2/sqrt(100)
min2
## [1] 371.88
max2
## [1] 458.12

ANSWER 2.a) Yes.

ANSWER 2.b)

H0 = the true average improvement is 500% H1 = the true average improvement is less than 500%.

ANSWER 2.c) [371.88,458.12]

ANSWER 2.d) we can conclude based on the hypothesis test (as we successfully reject the null hypothesis) as well as the c.i. that the true average improvement is less than 500%.


3. 95% Confidence Interval with Average Age of First Marriage (1 pt)

OIStats v4 7.56 Age at first marriage

The National Survey of Family Growth conducted by the Centers for Disease Control

  • gathers information on
    • family life,
    • marriage and divorce,
    • pregnancy,
    • infertility,
    • use of contraception,
    • and men’s and women’s health.

One of the variables collected on this survey

  • is the age at first marriage.

The histogram below shows the distribution of ages at first marriage

  • of 5,534 randomly sampled women between 2006 and 2010.
  • The average age at first marriage among these women
    • is 23.44 with a standard deviation of 4.72.

age at first marriage

Estimate the average age at first marriage of women

  • using a 95% confidence interval,
    • and interpret this interval in context.
  • Discuss any relevant assumptions.
sample3 <- 5534
mean3 <- 23.44
sd3 <- 4.72

pe3 <- 22.34
min3 <- pe3 - sd3/sqrt(sample3)*qnorm(0.975)
max3 <- pe3 + sd3/sqrt(sample3)*qnorm(0.975)
min3
## [1] 22.21564
max3
## [1] 22.46436

ANSWERS: [22.21564,22.46436]

the age probability falls between [22.21564,22.46436] in the case of a 95% confidence interval.


4. Tidy Data Wrangling (2 pts)

This question uses a dataset from a case-control study

  • of (o)esophageal cancer in Ille-et-Vilaine, France.

This dataset is relatively clean,

  • but there are some adjustments that we would like to make
  • to make it more readable for us to answer questions.

Some issues with the data might be left over from the method of data entry

  • remove the inconsistent “g/day” text
    • from each entry that has it
    • without removing the whole observation
    • or that specific predictor

Let’s only deal with bounded values

  • remove the age group 75+
  • remove the alcohol group 120+
  • remove the tobacco group 30+

The question we want to answer is:

  • do groups with higher levels of daily tobacco consumption
    • have higher occurrences of (o)esopheageal cancer?

Which tobacco group (not including 75+)

  • has the highest occurrence of (o)esophageal cancer?

We’ll answer this by using \(\frac{ncases}{ncontrols}\),

  • which we’ll call occurrence percentage

Calculate occurrence percentage for each age group

  • summarize this in a table
data(esoph)
?esoph
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
esoph$alcgp = gsub("g/day","", esoph$alcgp)
esoph$tobgp = gsub("g/day","", esoph$tobgp)
esoph
esoph <- subset(esoph, agegp != "75+")
esoph <- subset(esoph, alcgp != "120+")
esoph <- subset(esoph, tobgp != "30+")
esoph
esoph <- esoph %>%
  group_by(agegp) %>%
  summarise(n = ncases/ncontrols) %>%
  summarise(n = mean(n))
## `summarise()` has grouped output by 'agegp'. You can override using the
## `.groups` argument.
  esoph

ANSWERS:

Do groups with higher levels of daily tobacco consumption - have higher occurrences of (o)esopheageal cancer?

Which tobacco group (not including 75+) - has the highest occurrence of (o)esophageal cancer? Ans: 10-19


5. EDA, Summary Stats & Visualization (3 pts)

For this question, we’ll look at some classical

  • Michelson data from 1879 on the speed of light.

The data consists of 5 experiments,

  • each with 20 consecutive runs with a speed of light measurement
    • for each run (km/sec, with 299000 subtracted).

We want to compare the different experiments

  • using visualizations
  • and summary statistics.

5.a) Create a table reporting summary statistics for each experiment.

And report the following:

  • variance,
  • standard deviation,
  • mean,
  • maximum

5.b) Create visualizations comparing the different experiments

  • how is the data in each experiment distributed?
    • (justify using at least one plot)
  • use a box and whisker plot to compare the means and distributions
data(morley)
?morley
experiment1 <- subset(morley, Expt == 1, select = Expt:Speed)
experiment2 <- subset(morley, Expt == 2, select = Expt:Speed)
experiment3 <- subset(morley, Expt == 3, select = Expt:Speed)
experiment4 <- subset(morley, Expt == 4, select = Expt:Speed)
experiment5 <- subset(morley, Expt == 5, select = Expt:Speed)

varm1 <- var(experiment1$Speed)
varm2 <- var(experiment2$Speed)
varm3 <- var(experiment3$Speed)
varm4 <- var(experiment4$Speed)
varm5 <- var(experiment5$Speed)

sdm1 <- sd(experiment1$Speed)
sdm2 <- sd(experiment2$Speed)
sdm3 <- sd(experiment3$Speed)
sdm4 <- sd(experiment4$Speed)
sdm5 <- sd(experiment5$Speed)

meanm1 <- mean(experiment1$Speed)
meanm2 <- mean(experiment2$Speed)
meanm3 <- mean(experiment3$Speed)
meanm4 <- mean(experiment4$Speed)
meanm5 <- mean(experiment5$Speed)

maxm1 <- max(experiment1$Speed)
maxm2 <- max(experiment2$Speed)
maxm3 <- max(experiment3$Speed)
maxm4 <- max(experiment4$Speed)
maxm5 <- max(experiment5$Speed)

summary <- matrix(data = c(varm1,varm2,varm3,varm4,varm5,sdm1,sdm2,sdm3,sdm4,sdm5,meanm1,meanm2,meanm3,meanm4,meanm5,maxm1,maxm2,maxm3,maxm4,maxm5),ncol = 5, byrow = TRUE)
colnames(summary) <- c('expt1','expt2','expt3','expt4','expt5')
rownames(summary) <- c('variance','sd','mean','max')
summary <- as.table(summary)
summary
##                expt1       expt2       expt3       expt4       expt5
## variance 11009.47368  3741.05263  6257.89474  3605.00000  2939.73684
## sd         104.92604    61.16414    79.10686    60.04165    54.21934
## mean       909.00000   856.00000   845.00000   820.50000   831.50000
## max       1070.00000   960.00000   970.00000   920.00000   950.00000
library(ggplot2)
ggplot(data = experiment1, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
  geom_boxplot()+
  ggtitle("Experiment 1: run and speed")

ggplot(data = experiment2, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
  geom_boxplot()+
  ggtitle("Experiment 2: run and speed")

ggplot(data = experiment3, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
  geom_boxplot()+
  ggtitle("Experiment 3: run and speed")

ggplot(data = experiment4, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
  geom_boxplot()+
  ggtitle("Experiment 4: run and speed")

ggplot(data = experiment5, aes(x = Run, y = Speed), xlab = "run", ylab = "speed")+
  geom_boxplot()+
  ggtitle("Experiment 5: run and speed")

ANSWER 5.a)

Variance for each experiment (1-5): 11009.47368 3741.05263 6257.89474 3605.00000 2939.73684 sd for each experiment (1-5): 104.92604 61.16414 79.10686 60.04165 54.21934 mean for each experiment (1-5): 909.00000 856.00000 845.00000 820.50000 831.50000 max for each experiment (1-5): 1070.00000 960.00000 970.00000 920.00000 950.00000

ANSWER 5.b) experiment 1 skewed to left, experiment 3 and 5 skewed to right. 2 and 4 are relatively fine.


6. What is data science? (5 paragraph essay with citations) (4 pts)

What do you find most interesting or exciting

  • about data science and EDA?
  • What defines data science
    • and how has it come about.
  • What are its characteristics, and what are the elements of
    • a data science tool chain,
    • a data science pipeline, and
    • a data analysis.

Use the structure of a 5 paragraph essay

  • (Introduction, 3 topic paragraphs, 1 concluding paragraph)
  • with citations/references.

–> Essay title here <—

Data Science: A Brief Reflection

–> Essay text here <– Data science is a commonly used approach to both qualitatively and quantitatively measure large amount of data within a dataset. It helps us organize data and better visualize them in order to make full use of the statistics we have collected. It is widely applied in many fields, including AI and machine learning [1].

Specifically, four major components of data science are “Data Strategy”,“Data Engineering”,“Data Analysis and Models”, and ““Data Visualization and Operationalization” [2]. Data Sicence basics can also be broken down into 5 subunits (statistics, domain expertise, data engineering, visualization, and advanced computing) that together significantly determine the nature of such a critical subject [3]. Those components toegther are considered elements of a data science tool chain.

A data science pipeline is a set of methods that toegther process the raw data and conclude with detailed answers to real-life questions. Elements of a data science pipeline include “continuous and scalable processing” of raw data, “cloud-based elasticity and agility”, self-contained and isolated resources, “access to a large amount of data and the ability to self-serve”, and “disaster recovery and high availability” [4]. There are 5 major stages of a data science pipeline. First, we need to collect data. Second, we need to cleanse/tidy up the data. After that, we need to process the data with modeling skills. Then, we need detailed, comprehensive understanding of data. Last but not least, we would also have to revise the data from previous steps if necessary.

Elements of a data analytics strategy include “collecting data”, “data analysis”, “reporting results”, “improving processes”, and “building a data-drive culture”. [5] Specifically, for data analysis, it is critical that we always keep in mind the three parts of data analysis: “reporting, insights, and prediction” [6].

One thing I found interesting about data science is how organize it can be. With the assistance of such a programming lanaguge, we are able to better sort our data by different categories. With ggplot(), for example, we can also visualize the different variables that together play an important role in the experiment.

–> References here <– 1. Data Science vs. machine learning: What’s the difference? Coursera. (n.d.). Retrieved December 19, 2022, from https://www.coursera.org/articles/data-science-vs-machine-learning 2. 4 components of a data science project. Macadamian. (2019, October 15). Retrieved December 19, 2022, from https://www.macadamian.com/learn/4-components-of-a-data-science-project/ 3. Johnson, D. (2022, November 19). What is Data Science? introduction, basic concepts & process. Guru99. Retrieved December 19, 2022, from https://www.guru99.com/data-science-tutorial.html#3 4. Davor DSouza • April 19th, Integration, M. J. on D., & Integration, S. R. on D. (2022, December 19). Data Science Pipelines: Ultimate guide in 2022 - Learn. Hevo. Retrieved December 19, 2022, from https://hevodata.com/learn/data-science-pipeline/#21 5. Domo Resource - the 5 elements of a data analytics strategy. Domo. (n.d.). Retrieved December 19, 2022, from https://www.domo.com/learn/article/the-5-elements-of-a-data-analytics-strategy 6. The 3 levels of data analysis- A framework for assessing data organization maturity. GitLab. (n.d.). Retrieved December 19, 2022, from https://about.gitlab.com/blog/2019/11/04/three-levels-data-analysis/ ____________

7. EDA of TCO degradation (4 pts)

This problem will be similar to the LE3 on Degradation of Hard Coat Acrylics.

  • But you are given a .csv file of a clean and tidy data set.
  • You will need to do EDA and make figures and summaries of what you find.
  • And list the insights you can develop from your EDA.
  • The .csv datafile is located in the data subfolder of the exam-final

TCO’s are transparent conductive oxides

  • Such as ITO, AZO and FTO.
  • Heather Lemire Mirletz did her MS thesis on these
  • and has a journal paper being published.

Here is the abstract of the paper

tco-degr abstract

Here is a mindmap of her data science study

tco-DataStructue

Here is information on the samples studied

tco-samples

Here is information about the exposures she did

tco-exposures

And a table about the exposures

tco-exposures2

Some questions to try to address, showing your results.

  • 7.a) Which type of TCO (ITO, AZO, FTO) is most stable?
  • 7.b) Which type of Exposure is most aggressive?
  • 7.c) How do open vs. encapsulated samples compare.
  • 7.d) What other insights can you identify and demonstrate from your EDA?
table1 <- read.csv("data/1304LemireTCO-Processed-Gok-updated-v0.4.csv",sep = ",")
## Error in file(file, "rt"): cannot open the connection
data %>%
  group_by(MaterialType) %>%
plot(table1$time,table1$SFEpolar)
## Error in UseMethod("group_by"): no applicable method for 'group_by' applied to an object of class "function"
timepolar <- lm(SFEpolar~time, table1)
## Error in is.data.frame(data): object 'table1' not found
summary(timepolar)
## Error in summary(timepolar): object 'timepolar' not found
abline(timepolar)
## Error in abline(timepolar): object 'timepolar' not found

ANSWERS:

ANSWER 7.a)

ANSWER 7.b)

ANSWER 7.c)

ANSWER 7.d)


8. Linear regression on a dataset (4 pts)

Here we’ll use a base R dataset about vapor pressure

  • to discuss the use of linearity in science.

The Dataset contains 19 observations

  • of temperature (Celsius) vs. vapor pressure (mmHg) for mercury

8.a) Start by plotting the data, temperature (x) vs. vapor pressure (y)

This relationship is clearly not linear.

  • However, we may be able to pull a linear relationship from these two metrics.
  • A simplified form of the “Antoine equation” can be used
    • to model the relationship between temperature and vapor pressure.
  • \(log{P} = A - \frac{B}{T}\)
  • \(P\) is vapor pressure, \(T\) is temperature
  • \(A\) and \(B\) are constants representing y intercept and slope

Let’s use this equation to fit our data with a linear model.

  • 8.b) Mutate two new columns from the existing data,
    • for \(log{P}\)
    • and for \(\frac{1}{T}\)
  • 8.c) Create a linear model using lm()
    • using the new columns as your variables
  • 8.d) Plot your linear model with \(log{P}\) and \(\frac{1}{T}\) axes
  • 8.e) Report the results of your model (model summary) in a table

8.f) What are your approximations for \(A\) and \(B\)

  • in the simplified Antoine model using this data?
data(pressure)
?pressure

# a
plot(pressure$temperature, pressure$pressure, main = "Temperature vs. Pressure", xlab = "temp", ylabl = "pressure") 

# b
library(dplyr)
logP <- log(x = pressure$pressure)
pressure2 <- cbind(pressure,logP)
fracT <- 1/(pressure$temperature + 273.15)
pressure2 <- cbind(pressure2,fracT)
pressure2
# d
plot(pressure2$fracT, pressure2$logP, main = "Pressure vs. Temperature", xlab = "temperature", ylab = "pressure")
lmpressure <- pressure2[-1,]

# c
logP_fracT <- lm(logP~fracT, lmpressure)

# e
summary(logP_fracT)
## 
## Call:
## lm(formula = logP ~ fracT, data = lmpressure)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08900 -0.01946 -0.00134  0.01883  0.14166 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.825e+01  4.968e-02   367.4   <2e-16 ***
## fracT       -7.296e+03  2.121e+01  -344.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04897 on 16 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 1.184e+05 on 1 and 16 DF,  p-value: < 2.2e-16
abline(logP_fracT)

ANSWERS:

[if an answer is in your code block, put # a) to show it in your code]

ANSWER 8.a)

ANSWER 8.b)

ANSWER 8.c)

ANSWER 8.d)

ANSWER 8.e)

ANSWER 8.f) A = 1.825e+01, B = -7.296e+03