Part 1. Title and Introduction

Data analytics is a discipline of scrutinizing raw data to make inferences about an information. The methods and processes of data analytics have been mechanized into mechanical progressions and algorithms that work over raw data for human consumption. Some of its merits is that it help a business to improve its performance. The analysis of data is encompasses many miscellaneous types of analysis. Any type of information can be exposed to data analytics techniques to get some perception that can be utilized to improve stuffs. Analytics methods can expose trends and metrics that would else be lost in the bulk data. This data can then be used to advance processes to increase the general efficiency of a system or even business. This report will therefore describe constitutes outputs of the analysed data and show how it has been used to test various hypotheses.

Hypothesis Testing

Statistical hypothesis testing, also called confirmatory data analysis is a prescribed technique for inspecting our thoughts about the world using statistics. It is usually used by researchers to test specific forecasts, called hypotheses that arise from theories. It is also mainly used to test whether experimental output have enough information to cast distrust on conventional wisdom. It helps scientist to determine whether the data from the sample is statistically significant.
San Jose State University Statistics Department, viewed hypothesis testing as one of the most important concepts in statistics since it enables us to determine if something really occurred, or if a certain treatment have positive influence, or if groups tend to differ from one another or if one variable foresees another. Confirmatory data analysis is considered to be one of the most vital practices for measuring the validity and reliability of results in any systematic exploration.
Example
In a certain city in the US it was considered at a certain time that people of certain color or race had lower intelligence capacity compared to the Hispanic’s. A hypothesis had to be performed which showed that intelligence is not based on color or race. People of various races, colors and cultures were given intelligence tests and the data was analyzed. Statistical hypothesis testing then proved that the results were statistically significant in that the similar measurements of intelligence between races are not merely sample error.
Hypothesis testing involves several steps which include;
i. State the null hypothesis (Ho) and alternate hypothesis (Ha or H1).
ii. Collect data in a way premeditated to test the hypothesis.
iii. Carry out an appropriate statistical test.
iv. Make conclusion whether to reject or fail to reject your null hypothesis.
v. Present your findings results and discussion section.

Testing differences

In statistics we have many circumstances where we may wish to compare means for two samples or even populations. The technique we use will entirely depend on the type of data we have and how it is grouped. However this comparison between two means of two samples have some significance which includes the following;
i. Evaluation of means tests helps us to determine if the experimental or control groups we are testing have similar means.
ii. It provides a way to test the hypothesis that the two control or experimental groups differ from each other.
Example
Is the night shift production less than the day shift production, are the rates of return from fixed asset investments different from those from common stock investments, and so on? Any difference observed between two sample means will be contingent on both the means and the sample standard deviations
REFERENCES
Bluman, A. G. (2009). Elementary statistics: A step by step approach. New York;: McGraw-Hill Higher Education.
Brandt, S., & Brandt, S. (1998). Data analysis. Springer-Verlag. Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological bulletin, 57(5), 416.
Woronow, A., & Love, K. M. (1990). Quantifying and testing differences among means of compositional data suites. Mathematical Geology, 22(7), 837-852.

Part 2. Analysis section

Task 1. Hypothesis test using z test (n>30).

## [1] 15  6  9 15 19 18
##  [1] 15 13 15 14 15 20 15 20  8  9 17 20 15 18 13 11 19 23 16 16 13 17 20 20 22
## [26] 13 14 15 19 21 22 21 13 22 19 16 20 15 15 11 12 16 18 22 21 26 10 18 14 14
## [51] 17 16  9 13 16 16 14 18 16 20 28 16 18 17 22

Test your hypothesis using your Z test value

## [1] 3.583226

Our Z value is 3.583 > our critical value(CV) 2.575.
We use the right side z test for this analysis to make our conclusion.
Since our Z value 3.583 is greater than CV which is 2.575 we reject Ho and conclude that the mean true value of the sample is not equal to 16.4.
We have rejected the null hypothesis since both our calculation and R code are same.
REFERENCES
Gaddis, G. M., & Gaddis, M. L. (1990). Introduction to biostatistics: Part 5, Statistical inference techniques for hypothesis testing with nonparametric data. Annals of emergency medicine, 19(9), 1054-1059.
Khademi, A. (2015). Statistical Hypothesis Testing with SAS and R. Journal of Statistical Software, 68, 1-4.

Test your hypothesis using your p value

## Warning: package 'BSDA' was built under R version 4.1.3
## Loading required package: lattice
## 
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
## 
##     Orange
## [1] 0.1674469
## [1] 3.583226

Looking at the little table that I showed earlier, we can see that 3.583226 is bigger than the critical value of 1.96 that would be required to be significant at
α = .05 , and also greater than the value of 2.58 that would be required to be significant at a level of
α = .01 . . Therefore, we can conclude that the effect is insignificant hence reject Ho.

REFERENCES
Biau, D. J., Jolles, B. M., & Porcher, R. (2010). P value and the theory of hypothesis testing: an explanation for new researchers. Clinical Orthopaedics and Related Research®, 468(3), 885-892.
Zhang, S., Chen, H. S., & Pfeiffer, R. M. (2013). A combined p-value test for multiple hypothesis testing. Journal of Statistical Planning and Inference, 143(4), 764-770.

Task 2. Hypothesis testing using t test (n<30).

## [1] 0.14864
## [1] -2.580665

From the table the t value at 0.01 level of significance is 2.58 which is greater than the calculated t value -2.58 therefore we fail to Ho and conclude that 16.40943 is the true mean of the sample..

REFERENCES
Chatzipantsiou, C., Dimitriadis, M., Papadakis, M., & Tsagris, M. (2018). Extremely efficient permutation and bootstrap hypothesis tests using R. arXiv preprint arXiv:1806.10947.
Savin, N. E. (1984). Multiple hypothesis testing. Handbook of econometrics, 2, 827-879.

Task 3. Compare two means using z test.
x,y: It tells us about the datasets used in the test.
alternative: The alternative hypothesis for the test. It can be ‘greater’, ‘less’, ‘two. sided’ based on the alternative hypothesis.

mu: The true value of the mean.

sigma.x: It represents the population standard deviation for the x sample.

sigma.y: It represents the population standard deviation for the y sample.

conf. level: confidence level of the interval

## [1] -0.0859772

Since the p-value[ -0.0859772] is less than the level of significance (α) = 0.05, we reject the null hypothesis.
This means we have sufficient evidence to say that the two means are different.
REFERENCES
Meng, X. L., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological bulletin, 111(1), 172.
Afthanorhan, A., Ahmad, N., & Sabri, A. (2015). A parametric approach using z-test for comparing 2 means to multi-group analysis in partial least square structural equation modeling (PLS-SEM). British Journal of Applied Science & Technology, 6(2), 194.

Task 4. Compare two independent means using t test.

## [1] 2.155
## [1] 0.6120064
## [1] 2.412143
## [1] 0.3042094
## 
##  One Sample t-test
## 
## data:  Sample_4a
## t = 18.632, df = 27, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  1.834547 2.475453
## sample estimates:
## mean of x 
##     2.155

REFERENCES
Elliott, A. C., & Woodward, W. A. (2007). Comparing one or two means using the t-test. Statistical Analysis and Quick Reference Guide Book. Thousand Oaks.
Park, H. M. (2009). Comparing group means: t-tests and one-way ANOVA using Stata, SAS, R, and SPSS.

Task 5. Correlation and regression analysis

FAITHFUL DATASET

The “faithful” is a dataset that contains a list of 272 observations of the geyser eruptions which occurred during October 1980. The duration of each eruption and the waiting time between eruptions have been provided in the dataset and measured in minutes.
The Faithful data set is a cone geyser located in Yellowstone National Park in Wyoming. It has erupted every 44 to 125 minutes since 2000, spewing 3,700 to 8,400 US gallons of boiling water to a height of 106 to 185 feet. The average height of an eruption is 145 feet.

## [1] 0.9008112

The correlation coefficient can be defined as a statistical measure of the strength of the association ar relationship between the relative movements of two variables. Coefficient of determination

## [1] 0.8114608

The coefficient of determination is a measure used in statistical analysis to measure the proportion of variances or access how well a model explains and forecasts future results.

There is 95% prediction interval of the eruption duration for the waiting time of 80 minutes is between 3.1961 and 5.1564 minutes.

The slope of the best fit line and its direction tells us how the dependent variable eruptions changes for every one unit increase in the independent variable waiting , on average.

REFERENCES
John, B. (2017). A Dataset of Gaze Behavior in VR Faithful to Natural Statistics. Rochester Institute of Technology.
Cao, Z., Wei, F., Li, W., & Li, S. (2018, April). Faithful to the original: Fact aware neural abstractive summarization. In thirty-second AAAI conference on artificial intelligence.