Statistics with Python Assignment 6

1. Explain the “Test Data” we have been using in Statistics with Python.

Throughout our textbook we have used different test-data sets. I am not sure, which one this question is referring to. However, what we usually did, was generating a sample data with normal distribution (using randn() function) where we specified its mean and standard deviation. For some examples, we only had one sample with this ditribution, for some other examples we had two samples, one with normal distribution, and the second sample was the first one plus another normal distribution. Therefore, they were not independent. For the examples we had in “Significant Tests” Chapter, we generated two samples independently. Each had a normal distribution and we specified the mean and standard devition of each one. Because these samples were independent, our t-test, ANOVA, and other methods rejected our null hypothesis.

2. Use the Student’s t-distribution on the following:

Define your distribution parameters using

sample_space = arange(-8, 8, 0.001)

dof = len(sample_space) – 1

then go on to calculate the pdf, the cdf and plot the student’s t pdf and cdf.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as ss
sample_space = np.arange(-8,8,0.001)
dof = len(sample_space) - 1
# Calculate PDF
pdf = ss.t.pdf(sample_space,dof)
# plot PDF
plt.plot(sample_space,pdf)
plt.title('Gaussian PDF')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.show()

# Calculate CDF
pdf = ss.t.cdf(sample_space,dof)
# plot PDF
plt.plot(sample_space,pdf)
plt.title('Gaussian CDF')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.show()

3. Use the One-Way ANOVA Test on the following.

(Upload your chunk of code with explanation in Canvas.)

Generate three independent samples

data1 = 6 * randn(100,000) + 60

data2 = 6 * randn(100,000) + 60

data3 = 6 * randn(100,000) + 64

Use alpha = 0.05 and print your results.

import numpy as np
import scipy.stats as ss
# seed the random number generator
np.random.seed(1)
# generate 3 independent samples
data1 = 6 * np.random.randn(100000) + 60
data2 = 6 * np.random.randn(100000) + 60
data3 = 6 * np.random.randn(100000) + 64
# compare samples
stat, p = ss.f_oneway(data1,data2,data3)
print('Statistics= %.3f, p= %.5f' %(stat,p))

## Statistics= 14543.767, p= 0.00000

# interpret
alpha = 0.05
if p>alpha:
  print('Same distributions (fail to reject H0)')
else:
  print('Different distributions (reject H0)')

## Different distributions (reject H0)

4. Use the Chi-Squared Distribution on the following:

Define your distribution parameters using

sample_space = arange(0, 70, 0.001)

dof = 20

then go on to calculate the pdf, cdf and plot the Chi-squared distribution’s pdf and cdf.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as ss
sample_space = np.arange(0, 70, 0.001)
dof = 20
# Calculate PDF
pdf = ss.chi2.pdf(sample_space,dof)
# plot PDF
plt.plot(sample_space,pdf)
plt.title('Chi-Squared PDF')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.show()

# Calculate CDF
pdf = ss.chi2.cdf(sample_space,dof)
# plot PDF
plt.plot(sample_space,pdf)
plt.title('Chi-Squared CDF')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.show()

## 5. Who is Francois Chollet? He is a young French software engineer working in AI field (currently at Google). He created the Keras deep-learning library in 2015 (when he was 25 years old). He is the author of many technical papers including Xception: Deep Learning with Depthwise Separable Convolutions, which is among the top ten most cited papers in CVPR proceedings at more than 18,000 citations! He is also the author of the book Deep Learning with Python and the co-author with Joseph J. Allaire of Deep Learning With R.

6. Give a brief history (up to 2 pages) of the field of Analytics.

Analytics is the systematic computational analysis of data or statistics.Main purpose of using it is for the discovery, interpretation, and communication of meaningful patterns in data. Analytics may have different applications such as:

marketing

management

finance

information security

health care

This field needs extensive computation and advanced techniques to work with big data. It is a multidisciplinary field which needs computer skills, mathematics, statistics. It uses descriptive techniques and predictive models (including supervised and unsupervised machine learning techniques) to gain valuable knowledge from data.

The statistics background of this field was initiated statisticians such as Pascal, Gauss and Pearson in 18th and 19th centuries. In the first half of the 20th century, we had significant advancement in mathematical statistics. The invention of computers speeded up this advancement. SPSS and SAS are two important statistical software packages developed in 1960s and 1970s and have been used since then. Introducing the relational databases in 1970s was another important item in this field which led to modern database management systems.

In 1980s the concept of data warehousing was introduced to store and alnalyze large volumes of data.

Early 21st century brought big data revolution and social media. We then observed advancemnt in machine learning algorithms and computational power. And recently, artificial intelligence and deep learning have caused a breakthrough in image processing, language processing, health care, etc.

Cloud computing and technologies like Apache Kafka and Apache Flink enable real-time data processing and analytics recently.

7. A tolerance interval is defined in terms of what two quantities?

Coverage proportion and confidence

8. Explain the “coverage.”

The proportion of population covered by the intereval.

9. What is the “tolerance coefficient?”

The degree of confidence with which the interval covers the specified coverage.

10. Explain the “confidence.”

The probabilistic confidence that the interval covers the propotion of the population, i.e., the probabilistic confidence of the coverage.