Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. In this course, we will cover essential statistical concepts and methods using R and the WHO dataset.
Descriptive statistics
Razlikujemo srednju vrijednost tj. mean. Ako imamo outliera koristimo median umjesto mean ili MAD umjesto sd. Na MAD i median ne uticu outlier.
Tako da mozemo reci, ako imamo tzv.skewed distribution/zakrivljenu
distribuciju korisimo: 1. meadian
2. spearman corelation
Ako imamo normalnu distribuciju koristimo: mean, Parson corelation i standardnu devijaciju.
Da bismo znali da li nam je normalna ili zakrivljena distribucija korstimo se vizuelnim tehnikama ili testovima
library(ggplot2)
WHO <- read.csv("WHO.csv")
ggplot(WHO, aes (x=LifeExpectancy)) + geom_density() #vidimo zakrivljenost
Prije tumacenja Shapiro testa bitno nam je da znamo sta je testiramo odnosno ?ta nam je nulta hipoteza. Nul hypothesis u Shapiro test je na?a distribucija je normalna. Ako nam p-value manji od 0.05 onda sa 95% sigurno??u odbacujemo nultu hipotezu, tj. u ovom slucaju zakljucujomo da se radi o zakrivljenoj distribuciji tj. mi pretpostavljamo inormality.
shapiro.test(WHO$LifeExpectancy)
##
## Shapiro-Wilk normality test
##
## data: WHO$LifeExpectancy
## W = 0.93077, p-value = 5.696e-08
ako zelimo da uradimo descriptivnu statistiku za LifeExpectancy najjednostavnije je
summary(WHO$LifeExpectancy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47.00 64.00 72.50 70.01 76.00 83.00
sd (WHO$LifeExpectancy)
## [1] 9.259075
Vidimo da nam jos nedostaje MAD i sd u slucaju normalne distribucije
npr. imamo vektor podataka: 57,40,103,234,93,53,116,98,108,121,22 srednju vrijednost racunamo na nacin da saberemo sve ove podatke i podijelimo sa broj obzervacija tj.:
sum(c(57,40,103,234,93,53,116,98,108,121,22))/11 #dakle srednja vrijednost je 95
## [1] 95
Za medijanu je malo druga?ije: prvo moramo poredati podatke od najmanje ka najvecoj: 22,40,53,57,93,98,103,108,116,121,234.
Medijana je srednja vrijednost - dale imamo 11 elemenata srednja vrijednost nam je ?esti element tj. 98. Dakle medijana je 98
median(c(57,40,103,234,93,53,116,98,108,121,22))
## [1] 98
Raspon je jos jedna od mjera opisne statistike a to je razlika najvece i najmanje vrijednosti u ovom slucaju 234-22=221 kroz funkciju u R dobiva se na nacin
range(c(57,40,103,234,93,53,116,98,108,121,22))
## [1] 22 234
Interkvartalni raspon (IQR) je raspon izmedju prvog i treceg quartile. tj. sredina odnosno 50% distribucije (Q3-Q1) Q1 je medijana elemenata izmedju minimalne i medijane, dakle ono se ne racunaju! U ovom primjeru Q1 je:
(53+57)/2
## [1] 55
Po istoj logici Q3 je:
(108+116)/2
## [1] 112
summary(c(57,40,103,234,93,53,116,98,108,121,22))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22 55 98 95 112 234
outlayer
The general formula to detect outliers is based on the Interquartile Range (IQR) method. Here’s how it’s done:
Calculate the IQR: \[ IQR = Q3 - Q1 \] Where:
Determine the lower and upper bounds for outliers: \[ \text{Lower Bound} = Q1 - 1.5 \times IQR \] \[ \text{Upper Bound} = Q3 + 1.5 \times IQR \]
Outliers are data points that fall below the lower bound or above the upper bound: \[ \text{Outliers} = \{ x \, | \, x < Q1 - 1.5 \times IQR \text{ or } x > Q3 + 1.5 \times IQR \} \]
1.5 * (112-55) - 55 # donja granica nam je -30.5, tako da sa "donje strane" nemamo outliera
## [1] 30.5
1.5 * (112-55) +112 #dakle sve vece od 197.5 je outlayer samim tim i vrijednosti od 234
## [1] 197.5
U R mozemo to provjeriti i istovremeno i vizuelno i da nam “izbaci”
boxplot(c(57,40,103,234,93,53,116,98,108,121,22))$out
## [1] 234
Na isti nacin mozemo se vratit nasem primjeru i provjeriti outlier: ovo uradite sami
boxplot(WHO$LifeExpectancy)$out
## numeric(0)
boxplot(WHO$ChildMortality)$out
## [1] 163.5 128.6 149.8 145.7 129.1 181.6 147.4
Varijansa (s2) : average squared deviance of each score from the mean.
varijansa = 32246/10 #3224.6
Tj., svaki element npr 22 udaljen od srednje vrijednosti pa kvadriran, tj.
(22-95)^2 #5329
## [1] 5329
#i tako za svaki elemenet (40-95)^2.. i to se sve sabere i podijeli sa n-1 tj. sa 11-1 tj. sa 10
Standardna devijacija se vise koriste i ona je drugi korijen iy varijanse tj.
sqrt (3224.6) #56.79
## [1] 56.78556
#provjerimo i u R
var(c(57,40,103,234,93,53,116,98,108,121,22))
## [1] 3224.6
sd (c(57,40,103,234,93,53,116,98,108,121,22))
## [1] 56.78556
Prije korelacije moramo visualizirati podatak To je iz razloga sto veza izmedju dvije varijable moze biti parabolicna a kad radimo corelaciju ona bude 0
cor(WHO$FertilityRate, WHO$ChildMortality,use='complete.obs')
## [1] 0.8640376
#korelacija izmedju x i y korelacija je 0.864. Argument "use=" se koristi da izbjegnemo NA
plot(WHO$FertilityRate, WHO$ChildMortality)
plot(WHO$GNI,WHO$LiteracyRate)
plot(log(WHO$GNI),WHO$LiteracyRate)
summary(WHO$FertilityRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.260 1.835 2.400 2.941 3.905 7.580 11
quantile (WHO$FertilityRate, na.rm = T) # procentni kvintili
## 0% 25% 50% 75% 100%
## 1.260 1.835 2.400 3.905 7.580
var (WHO$FertilityRate, na.rm = T) # varijansa
## [1] 2.193315
sd(WHO$FertilityRate, na.rm = T) #standardna devijacija#note da je sd=sqrt(var) sqrt(244660.5)
## [1] 1.480984
The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the original population distribution.
# Example: Sampling distribution of the mean for Life Expectancy
set.seed(123) # For reproducibility
sample_means <- replicate(1000, mean(sample(WHO$LifeExpectancy, 30, replace = TRUE)))
hist(sample_means, breaks = 30, main = "Sampling Distribution of the Mean", xlab = "Mean Life Expectancy")
The normal distribution is a probability distribution that is symmetric about the mean, representing many real-world phenomena.
# Plotting the normal distribution
x <- seq(40, 90, by = 0.1)
y <- dnorm(x, mean = mean(WHO$LifeExpectancy, na.rm = TRUE), sd = sd(WHO$LifeExpectancy, na.rm = TRUE))
plot(x, y, type = "l", main = "Normal Distribution of Life Expectancy", xlab = "Life Expectancy", ylab = "Density")
CLT states the sample distribution of a statistic will be close to
normal with a large enough sample size. As a rough estimate CLT predicts
a roughly normal distribution under any of the following conditions: 1.
population distribution is normal; or
2. sampling distribution is symetric and the sample size <=15; or 3.
sampling distribution is moderatly skewed and the sample size is 16<=
n <= 30; or
4. the sample size is >30 without outliers.
T-tests are used to compare the means of two groups. We can perform an independent t-test to compare life expectancy between two regions.
Test statistics = signal/noise = variance explained by the model/variance not explaned by the model = effect/error The larger t is (or other statistics), the more likely you will reject Ho, since there is more signal than noise
# T-test between two regions
t.test(LifeExpectancy ~ Region, data = WHO[WHO$Region %in% c("Europe", "Africa"), ])
##
## Welch Two Sample t-test
##
## data: LifeExpectancy by Region
## t = -15.478, df = 82.606, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0
## 95 percent confidence interval:
## -21.19264 -16.36601
## sample estimates:
## mean in group Africa mean in group Europe
## 57.95652 76.73585
The results of this t-test provide strong evidence that there is a significant difference in life expectancy between Africa and Europe, with Africa having a notably lower mean life expectancy. The confidence interval confirms that the difference is statistically significant and likely reflects a real disparity in health outcomes between these two regions.
t.test(LifeExpectancy ~ Region, data = WHO[WHO$Region %in% c("Europe", "Africa"), ],conf.level = 0.90)
##
## Welch Two Sample t-test
##
## data: LifeExpectancy by Region
## t = -15.478, df = 82.606, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0
## 90 percent confidence interval:
## -20.79761 -16.76104
## sample estimates:
## mean in group Africa mean in group Europe
## 57.95652 76.73585
Chi-square tests assess whether there is a significant association between categorical variables. We’ll create a new categorical variable for child mortality and then perform a chi-square test.
###ASSUMPIONS FOR CHI SQUERE TEST
The Chi-Squared test assesses how expectations compare to actual observed data. It is commonly used in the following contexts:
The formula to calculate the Chi-Squared statistic (\(X^2\)) is:
\[ X^2 = \sum \frac{(O - E)^2}{E} \]
Where: - \(O\) = observed frequency - \(E\) = expected frequency
The summation is done over all categories in the table.
After calculating the Chi-Squared statistic, you compare it to a critical value from the Chi-Squared distribution table, based on the chosen significance level (commonly 0.05) and the degrees of freedom.
In your output:
library( dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Create a categorical variable for Child Mortality
WHO <- WHO %>%
mutate(ChildMortalityCategory = cut(ChildMortality, breaks = c(-Inf, 20, 50, 100, Inf), labels = c("Low", "Moderate", "High", "Very High")))
table(WHO$ChildMortalityCategory, WHO$Region)
##
## Africa Americas Eastern Mediterranean Europe South-East Asia
## Low 3 23 12 48 3
## Moderate 3 11 3 3 5
## High 26 1 5 2 3
## Very High 14 0 2 0 0
##
## Western Pacific
## Low 12
## Moderate 12
## High 3
## Very High 0
# Chi-square test between Child Mortality Category and Region
chisq.test(table(WHO$ChildMortalityCategory, WHO$Region))
## Warning in chisq.test(table(WHO$ChildMortalityCategory, WHO$Region)):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(WHO$ChildMortalityCategory, WHO$Region)
## X-squared = 142.11, df = 15, p-value < 2.2e-16
The chisq.test()
function in R does not have a
conf.level parameter like the t.test()
function. The
Chi-squared test primarily provides the test statistic and p-value
without direct calculation of confidence intervals.
chisq.test (table(WHO$ChildMortalityCategory, WHO$Region))$residual
## Warning in chisq.test(table(WHO$ChildMortalityCategory, WHO$Region)):
## Chi-squared approximation may be incorrect
##
## Africa Americas Eastern Mediterranean Europe
## Low -4.2806846 1.1193971 0.1614481 3.8849552
## Moderate -1.9491146 1.6738873 -0.5838146 -2.2357570
## High 5.3626905 -2.3141016 0.2178213 -2.7007171
## Very High 5.2399292 -1.6989991 0.1377623 -2.0907257
##
## South-East Asia Western Pacific
## Low -1.1394566 -0.5485667
## Moderate 2.0035968 3.0188489
## High 0.4860278 -1.0879692
## Very High -0.9524791 -1.4922480
This output shows the Pearson residuals from a chi-squared test comparing the Child Mortality Category across different WHO Regions. The residuals provide insight into the strength and direction of the relationship between the two variables, indicating where the observed counts differ most from the expected counts.
Pearson residuals help us understand the difference between observed and expected counts in a contingency table. The formula for calculating residuals is:
observed count − expected count expected count residual= expected count
observed count−expected count
Africa stands out with higher-than-expected child mortality in the High and Very High categories.
The Americas show a trend of fewer high and very high mortality rates than expected.
Europe has fewer moderate, high, and very high mortality cases, and more low mortality cases than expected.
These residuals highlight the differences between regions in terms of child mortality patterns.
table()
to summarize the counts of observations.chisq.test()
to
perform the Chi-Squared test.