Introduction to Statistics

Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. In this course, we will cover essential statistical concepts and methods using R and the WHO dataset.

Descriptive statistics

Razlikujemo srednju vrijednost tj. mean. Ako imamo outliera koristimo median umjesto mean ili MAD umjesto sd. Na MAD i median ne uticu outlier.

Tako da mozemo reci, ako imamo tzv.skewed distribution/zakrivljenu distribuciju korisimo: 1. meadian
2. spearman corelation

Ako imamo normalnu distribuciju koristimo: mean, Parson corelation i standardnu devijaciju.

Da bismo znali da li nam je normalna ili zakrivljena distribucija korstimo se vizuelnim tehnikama ili testovima

library(ggplot2)
WHO <- read.csv("WHO.csv")
ggplot(WHO, aes (x=LifeExpectancy)) + geom_density() #vidimo zakrivljenost

Testiranje normalnosti

Prije tumacenja Shapiro testa bitno nam je da znamo sta je testiramo odnosno ?ta nam je nulta hipoteza. Nul hypothesis u Shapiro test je na?a distribucija je normalna. Ako nam p-value manji od 0.05 onda sa 95% sigurno??u odbacujemo nultu hipotezu, tj. u ovom slucaju zakljucujomo da se radi o zakrivljenoj distribuciji tj. mi pretpostavljamo inormality.

shapiro.test(WHO$LifeExpectancy)
## 
##  Shapiro-Wilk normality test
## 
## data:  WHO$LifeExpectancy
## W = 0.93077, p-value = 5.696e-08

ako zelimo da uradimo descriptivnu statistiku za LifeExpectancy najjednostavnije je

summary(WHO$LifeExpectancy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   47.00   64.00   72.50   70.01   76.00   83.00
sd (WHO$LifeExpectancy) 
## [1] 9.259075

Vidimo da nam jos nedostaje MAD i sd u slucaju normalne distribucije

median != mean

npr. imamo vektor podataka: 57,40,103,234,93,53,116,98,108,121,22 srednju vrijednost racunamo na nacin da saberemo sve ove podatke i podijelimo sa broj obzervacija tj.:

sum(c(57,40,103,234,93,53,116,98,108,121,22))/11 #dakle srednja vrijednost je 95
## [1] 95

Za medijanu je malo druga?ije: prvo moramo poredati podatke od najmanje ka najvecoj: 22,40,53,57,93,98,103,108,116,121,234.

Medijana je srednja vrijednost - dale imamo 11 elemenata srednja vrijednost nam je ?esti element tj. 98. Dakle medijana je 98

median(c(57,40,103,234,93,53,116,98,108,121,22))
## [1] 98

Raspon je jos jedna od mjera opisne statistike a to je razlika najvece i najmanje vrijednosti u ovom slucaju 234-22=221 kroz funkciju u R dobiva se na nacin

range(c(57,40,103,234,93,53,116,98,108,121,22))
## [1]  22 234

Interkvartalni raspon (IQR) je raspon izmedju prvog i treceg quartile. tj. sredina odnosno 50% distribucije (Q3-Q1) Q1 je medijana elemenata izmedju minimalne i medijane, dakle ono se ne racunaju! U ovom primjeru Q1 je:

(53+57)/2
## [1] 55

Po istoj logici Q3 je:

(108+116)/2
## [1] 112

Kroz R provjerimo

summary(c(57,40,103,234,93,53,116,98,108,121,22))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      22      55      98      95     112     234

outlayer

The general formula to detect outliers is based on the Interquartile Range (IQR) method. Here’s how it’s done:

  1. Calculate the IQR: \[ IQR = Q3 - Q1 \] Where:

    • \(Q1\) is the first quartile (25th percentile)
    • \(Q3\) is the third quartile (75th percentile)
  2. Determine the lower and upper bounds for outliers: \[ \text{Lower Bound} = Q1 - 1.5 \times IQR \] \[ \text{Upper Bound} = Q3 + 1.5 \times IQR \]

  3. Outliers are data points that fall below the lower bound or above the upper bound: \[ \text{Outliers} = \{ x \, | \, x < Q1 - 1.5 \times IQR \text{ or } x > Q3 + 1.5 \times IQR \} \]

1.5 * (112-55) - 55 # donja granica nam je -30.5, tako da sa "donje strane" nemamo outliera
## [1] 30.5
1.5 * (112-55) +112 #dakle sve vece od 197.5 je outlayer samim tim i vrijednosti od 234
## [1] 197.5

U R mozemo to provjeriti i istovremeno i vizuelno i da nam “izbaci”

boxplot(c(57,40,103,234,93,53,116,98,108,121,22))$out

## [1] 234

Na isti nacin mozemo se vratit nasem primjeru i provjeriti outlier: ovo uradite sami

boxplot(WHO$LifeExpectancy)$out

## numeric(0)
boxplot(WHO$ChildMortality)$out

## [1] 163.5 128.6 149.8 145.7 129.1 181.6 147.4

Varijansa (s2) : average squared deviance of each score from the mean.

varijansa = 32246/10  #3224.6 

Tj., svaki element npr 22 udaljen od srednje vrijednosti pa kvadriran, tj.

(22-95)^2 #5329
## [1] 5329
#i tako za svaki elemenet (40-95)^2.. i to se sve sabere i podijeli sa n-1 tj. sa 11-1 tj. sa 10

Standardna devijacija se vise koriste i ona je drugi korijen iy varijanse tj.

sqrt (3224.6) #56.79
## [1] 56.78556

#provjerimo i u R

var(c(57,40,103,234,93,53,116,98,108,121,22))
## [1] 3224.6
sd (c(57,40,103,234,93,53,116,98,108,121,22))
## [1] 56.78556

Prije korelacije moramo visualizirati podatak To je iz razloga sto veza izmedju dvije varijable moze biti parabolicna a kad radimo corelaciju ona bude 0

cor(WHO$FertilityRate, WHO$ChildMortality,use='complete.obs') 
## [1] 0.8640376
                          #korelacija izmedju x i y korelacija je 0.864. Argument "use=" se koristi da izbjegnemo NA
plot(WHO$FertilityRate, WHO$ChildMortality)

plot(WHO$GNI,WHO$LiteracyRate)

plot(log(WHO$GNI),WHO$LiteracyRate)

summary(WHO$FertilityRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.260   1.835   2.400   2.941   3.905   7.580      11
quantile (WHO$FertilityRate, na.rm = T) # procentni kvintili
##    0%   25%   50%   75%  100% 
## 1.260 1.835 2.400 3.905 7.580
var (WHO$FertilityRate, na.rm = T) # varijansa
## [1] 2.193315
sd(WHO$FertilityRate, na.rm = T) #standardna devijacija#note da je sd=sqrt(var) sqrt(244660.5)
## [1] 1.480984

2. Central Limit Theorem

The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the original population distribution.

# Example: Sampling distribution of the mean for Life Expectancy
set.seed(123) # For reproducibility
sample_means <- replicate(1000, mean(sample(WHO$LifeExpectancy, 30, replace = TRUE)))
hist(sample_means, breaks = 30, main = "Sampling Distribution of the Mean", xlab = "Mean Life Expectancy")

3. Normal Distribution

The normal distribution is a probability distribution that is symmetric about the mean, representing many real-world phenomena.

# Plotting the normal distribution
x <- seq(40, 90, by = 0.1)
y <- dnorm(x, mean = mean(WHO$LifeExpectancy, na.rm = TRUE), sd = sd(WHO$LifeExpectancy, na.rm = TRUE))
plot(x, y, type = "l", main = "Normal Distribution of Life Expectancy", xlab = "Life Expectancy", ylab = "Density")

CLT states the sample distribution of a statistic will be close to normal with a large enough sample size. As a rough estimate CLT predicts a roughly normal distribution under any of the following conditions: 1. population distribution is normal; or
2. sampling distribution is symetric and the sample size <=15; or 3. sampling distribution is moderatly skewed and the sample size is 16<= n <= 30; or
4. the sample size is >30 without outliers.

4. T-Tests

T-tests are used to compare the means of two groups. We can perform an independent t-test to compare life expectancy between two regions.

Test statistics = signal/noise = variance explained by the model/variance not explaned by the model = effect/error The larger t is (or other statistics), the more likely you will reject Ho, since there is more signal than noise

# T-test between two regions
t.test(LifeExpectancy ~ Region, data = WHO[WHO$Region %in% c("Europe", "Africa"), ])
## 
##  Welch Two Sample t-test
## 
## data:  LifeExpectancy by Region
## t = -15.478, df = 82.606, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0
## 95 percent confidence interval:
##  -21.19264 -16.36601
## sample estimates:
## mean in group Africa mean in group Europe 
##             57.95652             76.73585

Interpretation of the Welch Two Sample T-Test Results

  1. Test Statistics:
    • The t-statistic is -15.478. This large negative value indicates that the mean life expectancy in Africa is significantly lower than in Europe.
  2. Degrees of Freedom:
    • The degrees of freedom (df) for this test is approximately 82.606. This value is used to determine the critical value for the t-distribution.
  3. P-Value:
    • The p-value is reported as < 2.2e-16. This value is extremely low, suggesting that there is a very strong statistical significance in the difference between the two groups. Typically, a p-value less than 0.05 indicates that we can reject the null hypothesis.
  4. Alternative Hypothesis:
    • The alternative hypothesis states that the true difference in means between life expectancy in Africa and Europe is not equal to zero. Given the results, we reject the null hypothesis (which posits that there is no difference) in favor of the alternative.
  5. Confidence Interval:
    • The 95% confidence interval for the difference in means is (-21.19264, -16.36601). This means we are 95% confident that the true difference in life expectancy between the two regions lies between approximately 16.37 years and 21.19 years, with Africa having the lower mean.
  6. Sample Estimates:
    • The mean life expectancy in Africa is 57.96 years, while in Europe it is 76.74 years. This indicates a substantial difference of about 18.78 years in favor of Europe.

Conclusion

The results of this t-test provide strong evidence that there is a significant difference in life expectancy between Africa and Europe, with Africa having a notably lower mean life expectancy. The confidence interval confirms that the difference is statistically significant and likely reflects a real disparity in health outcomes between these two regions.

Primjer na 90%

t.test(LifeExpectancy ~ Region, data = WHO[WHO$Region %in% c("Europe", "Africa"), ],conf.level = 0.90)
## 
##  Welch Two Sample t-test
## 
## data:  LifeExpectancy by Region
## t = -15.478, df = 82.606, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0
## 90 percent confidence interval:
##  -20.79761 -16.76104
## sample estimates:
## mean in group Africa mean in group Europe 
##             57.95652             76.73585

5. Chi-Square Tests

Chi-square tests assess whether there is a significant association between categorical variables. We’ll create a new categorical variable for child mortality and then perform a chi-square test.

###ASSUMPIONS FOR CHI SQUERE TEST

  1. Random sample
  2. Indipendent observation for the sample (one observation per subject). This means one person can not be in both groups
  3. All expected counts are greater then 1 in each of our cells 4. No more than 20% of cells with and expected counts are less then five

The Chi-Squared test assesses how expectations compare to actual observed data. It is commonly used in the following contexts:

  1. Goodness-of-Fit Test: Determines if a sample matches a population.
  2. Test for Independence: Evaluates whether two categorical variables are independent of each other.

Key Concepts

  1. Observed Frequencies: The actual counts from your data.
  2. Expected Frequencies: The counts you would expect if there were no association between the variables. This is calculated based on the marginal totals of the contingency table.
  3. Degrees of Freedom (df): This is calculated as \((\text{number of rows} - 1) \times (\text{number of columns} - 1)\) for tests of independence.

The Chi-Squared Statistic

The formula to calculate the Chi-Squared statistic (\(X^2\)) is:

\[ X^2 = \sum \frac{(O - E)^2}{E} \]

Where: - \(O\) = observed frequency - \(E\) = expected frequency

The summation is done over all categories in the table.

Interpretation of Results

  1. Null Hypothesis (\(H_0\)): Assumes that there is no association between the two categorical variables.
  2. Alternative Hypothesis (\(H_a\)): Assumes that there is an association between the two variables.

After calculating the Chi-Squared statistic, you compare it to a critical value from the Chi-Squared distribution table, based on the chosen significance level (commonly 0.05) and the degrees of freedom.

  • p-value: If the p-value is less than the significance level (e.g., \(p < 0.05\)), you reject the null hypothesis, suggesting that an association exists.

Example Interpretation

In your output:

library( dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Create a categorical variable for Child Mortality
WHO <- WHO %>%
  mutate(ChildMortalityCategory = cut(ChildMortality, breaks = c(-Inf, 20, 50, 100, Inf), labels = c("Low", "Moderate", "High", "Very High")))

table(WHO$ChildMortalityCategory, WHO$Region)
##            
##             Africa Americas Eastern Mediterranean Europe South-East Asia
##   Low            3       23                    12     48               3
##   Moderate       3       11                     3      3               5
##   High          26        1                     5      2               3
##   Very High     14        0                     2      0               0
##            
##             Western Pacific
##   Low                    12
##   Moderate               12
##   High                    3
##   Very High               0
# Chi-square test between Child Mortality Category and Region
chisq.test(table(WHO$ChildMortalityCategory, WHO$Region))
## Warning in chisq.test(table(WHO$ChildMortalityCategory, WHO$Region)):
## Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(WHO$ChildMortalityCategory, WHO$Region)
## X-squared = 142.11, df = 15, p-value < 2.2e-16
  • X-squared = 142.11: This is the calculated Chi-Squared statistic.
  • df = 15: This indicates there are 15 degrees of freedom.
  • p-value < 2.2e-16: A very small p-value indicates strong evidence against the null hypothesis.

The chisq.test() function in R does not have a conf.level parameter like the t.test() function. The Chi-squared test primarily provides the test statistic and p-value without direct calculation of confidence intervals.

Kada zelimo da vidimo da koje grupe odstupaju od ocekivanog

chisq.test (table(WHO$ChildMortalityCategory, WHO$Region))$residual
## Warning in chisq.test(table(WHO$ChildMortalityCategory, WHO$Region)):
## Chi-squared approximation may be incorrect
##            
##                 Africa   Americas Eastern Mediterranean     Europe
##   Low       -4.2806846  1.1193971             0.1614481  3.8849552
##   Moderate  -1.9491146  1.6738873            -0.5838146 -2.2357570
##   High       5.3626905 -2.3141016             0.2178213 -2.7007171
##   Very High  5.2399292 -1.6989991             0.1377623 -2.0907257
##            
##             South-East Asia Western Pacific
##   Low            -1.1394566      -0.5485667
##   Moderate        2.0035968       3.0188489
##   High            0.4860278      -1.0879692
##   Very High      -0.9524791      -1.4922480

This output shows the Pearson residuals from a chi-squared test comparing the Child Mortality Category across different WHO Regions. The residuals provide insight into the strength and direction of the relationship between the two variables, indicating where the observed counts differ most from the expected counts.

Pearson residuals help us understand the difference between observed and expected counts in a contingency table. The formula for calculating residuals is:

residual

observed count − expected count expected count residual= expected count ​

observed count−expected count ​

Key results from the test:

  • Chi-squared statistic: 142.11
  • Degrees of freedom: 15
  • p-value: less than 2.2e-16, meaning the null hypothesis (that child mortality is independent of region) can be rejected. There is a strong association between child mortality categories and region.

Interpreting the Pearson residuals:

  • Positive residuals mean the observed count is higher than expected.
  • Negative residuals mean the observed count is lower than expected.
  • Larger absolute values indicate cells where the association is stronger, usually when the residual is above 2 or below -2.

Region-by-Region Interpretation:

Africa:

  • Low: Residual of -4.28 (strong negative residual), meaning there are fewer low child mortality cases than expected.
  • Moderate: Residual of -1.95 (slightly negative), meaning there are somewhat fewer moderate child mortality cases.
  • High: Residual of 5.36 (strong positive residual), meaning there are far more high child mortality cases than expected.
  • Very High: Residual of 5.24 (strong positive residual), meaning there are far more very high child mortality cases than expected.

Africa stands out with higher-than-expected child mortality in the High and Very High categories.

Americas:

  • Low: Residual of 1.12, meaning there are slightly more low child mortality cases than expected.
  • Moderate: Residual of 1.67, meaning there are more moderate child mortality cases than expected.
  • High: Residual of -2.31, meaning there are fewer high child mortality cases than expected.
  • Very High: Residual of -1.70, meaning there are fewer very high child mortality cases than expected.

The Americas show a trend of fewer high and very high mortality rates than expected.

Eastern Mediterranean:

  • Residuals across categories are mostly small, indicating the observed child mortality levels are close to expected levels.

Europe:

  • Low: Residual of 3.88, meaning there are more low child mortality cases than expected.
  • Moderate: Residual of -2.24, meaning there are fewer moderate child mortality cases.
  • High: Residual of -2.70, meaning there are fewer high child mortality cases.
  • Very High: Residual of -2.09, meaning there are fewer very high child mortality cases.

Europe has fewer moderate, high, and very high mortality cases, and more low mortality cases than expected.

South-East Asia:

  • Low: Residual of -1.14, meaning fewer low child mortality cases than expected.
  • Moderate: Residual of 2.00, meaning more moderate child mortality cases than expected.
  • High: Residual of 0.49, close to expected.
  • Very High: Residual of -0.95, slightly fewer than expected.

Western Pacific:

  • Low: Residual of -0.55, slightly fewer low child mortality cases than expected.
  • Moderate: Residual of 3.02, meaning there are more moderate child mortality cases than expected.
  • High: Residual of -1.09, slightly fewer high child mortality cases.
  • Very High: Residual of -1.49, fewer very high child mortality cases.

Summary:

  • Africa and Europe stand out the most. Africa has more high and very high child mortality cases than expected, while Europe has fewer moderate, high, and very high cases and more low child mortality cases.
  • The Americas also show fewer high and very high cases, while the Western Pacific has more moderate mortality cases.

These residuals highlight the differences between regions in terms of child mortality patterns.

Practical Steps in R

  1. Create a Contingency Table: Use table() to summarize the counts of observations.
  2. Run the Test: Use chisq.test() to perform the Chi-Squared test.
  3. Interpret Results: Look at the Chi-Squared statistic and p-value.