descriptive statistics:they summarize important characteristics for both samples of data and distributions.
location statistics: they describe where the data is located. They include the mean, the median and the mode.
** mean of a sample: \(\overline X = \sum_{i=1}^{n}X_i\) with \(n\) being the number of data points in the sample. ** mean of a distribution: \(E(g(x)=\mu=\int_{-\infty}^{\infty}g(u)f(u)du\) with \(f(x)\) being the density function of the distribution.
** median: the value for which the data is split into two equal groups
**quantiles: a quantile \(X_q\) is a point in the data where \(q%\) of the data lies to the right of \(X_q\).
dispersion statistics:they describe the spread of the data meaning the range of values in the dataset. They include the variance, the median of the absolute deviations and the interquartile range.
** variance & standard deviation for a sample: \(Var(X)=s_x^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\overline{x})^2\)
** variance & standard deviation for a distribution: \(E((x-\mu)^2)=\sigma ^2=VAR(X)=\int_{-\infty}^{\infty}(u-\mu)^2f(u)du\)
** median of the absolute deviations: absolute value of the difference between each of the samples and the median of all of them:\(mad(X)=median_i(|x_i-median_i(x_i)|)\)
** interquartile range: it is the difference between the 75th quartile and the 25th quartile.
skewness:this value measures the asymmetry of a distribution. If a distribution is symmetrical, then it has no skewness. If the skew is negative then the distribution will be grouped towards the left. If the skew is positive, it will be grouped towards the right.
kurtosis: this parameter measures how populated the tails of a distribution are, for a normal distribution, this value is worth 3.
*robustness: The change of a small number of measurements barely affects a robust statistic. Median, IQR and MAD are all robust. Mean, variance, skewness, kurtosis are all non robust statistics.
PROPERTIES: * E(X+Y)=E(X)+E(Y) * VAR(X+Y)=VAR(X)+VAR(y)+2CoV(X,Y)
Distribución uniforme: Es un escalón que va de x=a a x=b y en ese intervalo su densidad vale \(\frac{1}{b-a}\) Simétrica y sin valores negativos
Distribución Normal: Para mayor sample number, más cerca está de la forma teórica esperada. Simétrica y con valores negativos.
Distribución Chi cuadrado: Es igual a la suma de n variables independientes al cuadrado que provengan de una distribucion normal estandarizada. Tiene n grados de libertad. La media vale n y la varianza 2n. Se usa para calcular los intervalos de confianza para la media y la varianza de una población. Asimétrica y sin valores negativos.
Distribución T-student: Es igual al cociente de una distribución normal estandarizada y la raíz cuadrada de una chi-cuadrado dividida por sus n g.d.l.s La media vale 0 y la varianza \(\frac{n}{(n-2)}\). Se utiliza para hallar una estimación de la media cuando la varianza es desconocida. Se utiliza también para encontrar el intervalo de confianza de poblaciones normales. Cuantos mas gdls mas se parece a una normal estandarizada. Simétrica y con valores negativos.
Distribución F: Es igual al cociente de dos distribuciones chi cuadrado (U1 y U2) con diferentes grados de libertad (d1, d2), cada una de ellas dividida por su correspondiente g.d.l. Es asimétrica y sin valores negativos Está relacionada con el test ANOVA
Bernouilli: Is the discrete probability distribution of a random variable which takes the value of 1 (with a probability p) and the value of 0 with a probability q=1−p. Is the probability distribution of any single experiment that asks yes-no question. E.g to toss a coin.
Binomial: Sum of n Bernoulli experiments. Let us assume that we carry out experiment and the result of the experiment can be “success” or “failure”. The probability of “success” in one experiment is p. Then probability of failure is q=1−p. We carry out experiments n times. Distribution of k successes is binomial:
Hypergeométrica: There are N balls, R are red and we pick n balls. Let X be the number of black balls in the sample, then the distribution of X is the hypergeometric distribution
Poisson: Counting the number of occurrence of an event in a certain time period or in a certain region in space, where the event occurs completely at random. It is often used to describe the occurrence of a rare event.
Condición que deben cumplir dos variables independientes: P(A|B)=P(A) y P(B|A)=P(B)