PSM visuals

Olumide Adeola

PSM Tutorial Course Visuals

https://chasfatprojects.netlify.app/.

Discrete Probability Distributions

Binomial Distributions

Binomial Standard Deviation: 1.581139 

Probability of getting exactly 6 heads in 10 flips: 0.07160367 

Bernoulli’s Distribution

Bernoulli Standard Deviation: 0.4582576 

Hypergeometric distribution

Hypergeometric Standard Deviation: 0.9787004 

Negative Binomial Distribution

Negative Binomial Standard Deviation: 4.330127 

Conditions Governing the Poisson Distribution

For a situation to be modeled by a Poisson distribution, the following conditions should be satisfied:

  1. Independence: The occurrences of events are independent of each other. The occurrence of one event does not affect the probability of another event occurring.

  2. Constant Mean Rate: The average rate (λ) of occurrence of events must be constant over the observed interval. This means that if you know the average number of events that occur in a day, this average should remain stable over time.

  3. Discrete Events: The number of events counted is a discrete variable. You can count occurrences like the number of emails received in an hour, the number of phone calls at a call center, etc.

  4. Rare Events: The Poisson distribution is particularly useful for modeling rare events. If the number of trials (or the time frame) is large, but the actual event count is small relative to that (like the number of accidents at a specific intersection over a year), it fits well.

Implications of the Poisson Distribution

  1. Modeling Count Data: The Poisson distribution is ideal for modeling the number of times an event occurs in a fixed period. Examples include:

    • Number of cars passing through a toll booth in an hour.

    • Number of emails received per hour.

    • Number of decay events per unit time from a radioactive source.

  2. Applications in Various Fields:

    • Healthcare: Modeling the number of patients arriving at an emergency department.

    • Telecommunications: Analyzing the number of phone calls received at a call center.

    • Traffic Engineering: Predicting the number of accidents at intersections.

  3. Relation to Other Distributions:

    • If the number of trials is large, and the probability of success is small, the Binomial distribution can be approximated by the Poisson distribution with λ=n⋅p= n pλ=n⋅p.

    • This makes it easier to compute probabilities in situations with many trials and low probabilities.

Probability of receiving exactly 3 calls: 0.1403739 

Example Scenario: Call Center

Scenario: A call center receives an average of 5 calls per hour (λ=5= 5λ=5). We want to analyze the probability of receiving a certain number of calls in an hour.

Understanding Rarity

  • Rarity: In this context, if we look at the probabilities for receiving 0, 1, or 2 calls, we might find that these probabilities are relatively high compared to receiving 10 or more calls, which would be considered rare events given the average rate.

  • A higher kkk (like 10 or more calls) represents an outcome that is less likely (rare) compared to lower values (0-5 calls), where most outcomes are clustered around the mean.

Continuous Distributions

Normal Distribution

Characteristics:

  • Shape: Bell-shaped curve.

  • Parameters: Mean (μ) and Standard Deviation (σ).

  • Use Cases: Heights, test scores, measurement errors—many natural phenomena are approximately normally distributed due to the Central Limit Theorem.

Exponential Distribution

Characteristics:

  • Shape: Right-skewed.

  • Parameter: Rate (λ).

  • Use Cases: Time until an event occurs (e.g., waiting time in queues, lifetime of devices)

Uniform Distribution

Characteristics:

  • Shape: Flat, rectangular.

  • Parameters: Minimum (aaa) and Maximum (bbb).

  • Use Cases: Modeling situations where all outcomes are equally likely (e.g., rolling a fair die)

Gamma Distribution

Characteristics:

  • Shape: Can be right-skewed or resemble a normal distribution depending on parameters.

  • Parameters: Shape (kkk) and Scale (θ).

  • Use Cases: Modeling waiting times, reliability data.

Chi-Square Distribution

Characteristics:

  • Shape: Right-skewed, with the skewness decreasing as the degrees of freedom increase.

  • Parameter: Degrees of freedom (dfdfdf).

  • Use Cases: Commonly used in hypothesis testing, particularly in tests of independence and goodness-of-fit.

Student’s t-Distribution

Characteristics:

  • Shape: Bell-shaped, similar to the normal distribution but with heavier tails. The shape depends on the degrees of freedom.

  • Parameter: Degrees of freedom (dfdfdf).

  • Use Cases: Used in hypothesis testing, especially when the sample size is small and the population standard deviation is unknown.

Weibull Distribution

Characteristics:

  • Shape: Can take on different shapes depending on its parameters, typically right-skewed.

  • Parameters: Shape parameter (kkk) and scale parameter (λ).

  • Use Cases: Used in reliability analysis and survival studies.

Key Assumptions of the Student’s t-Test

  1. Normality:

    • The data in each group should be approximately normally distributed. This assumption is crucial, especially for small sample sizes (typically less than 30). For larger samples, the Central Limit Theorem suggests that the distribution of the sample mean will be approximately normal.
  2. Independence:

    • The observations must be independent of each other. This means that the data points in one group should not influence those in another group.
  3. Equal Variances (for Independent t-test):

    • The variances of the two groups should be approximately equal (homogeneity of variance). This can be tested using Levene’s test or similar methods.

Types of Student’s t-Tests

  1. Independent t-test:

    • Compares means from two different groups.

    • Example: Comparing test scores of students from two different classes.

  2. Paired t-test:

    • Compares means from the same group at two different times.

    • Example: Measuring blood pressure before and after treatment in the same group of patients.

  3. One-sample t-test:

    • Compares the mean of a single group to a known value (e.g., population mean).

Explanation of the Visualization Above

  • The plot shows the probability density function (PDF) of the Student’s t-distribution for different degrees of freedom.

  • As the degrees of freedom increase, the t-distribution approaches the standard normal distribution. This is because, with larger sample sizes, the sample mean becomes a better estimator of the population mean.


    Welch Two Sample t-test

data:  group1 and group2
t = -1.0742, df = 37.082, p-value = 0.2897
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8.863821  2.721440
sample estimates:
mean of x mean of y 
 51.41624  54.48743 

Types of t-Tests

  1. Independent t-test:

    • Purpose: Compares the means of two independent groups to determine if they are significantly different from each other.

    • Example: Comparing the test scores of students from two different classes.

    • Assumptions: The samples must be independent, and the data should be normally distributed with equal variances.

  2. Paired t-test:

    • Purpose: Compares the means of two related groups (the same subjects measured at two different times) to see if there is a significant difference.

    • Example: Measuring the blood pressure of patients before and after treatment.

    • Assumptions: The differences between pairs should be normally distributed.

  3. One-sample t-test:

    • Purpose: Compares the mean of a single group to a known value (such as a population mean).

    • Example: Testing whether the average height of a sample of students is significantly different from the national average.

    • Assumptions: The sample should be normally distributed.

    Statistical Tests and degrees of freedom

    • When you conduct a statistical test (like a t-test), degrees of freedom help define how many independent values are used to estimate the variability in the data.

    • For example, in an independent t-test comparing two groups, if you have the scores of two different classes, the number of scores you have minus the number of groups tells you how much freedom you have in calculating the statistics.

Why is it Important?

  • Influences Critical Values: Degrees of freedom affect the shape of the distribution used in statistical tests. This, in turn, affects the critical values that determine whether your test results are significant.

  • Informs Sample Size: More degrees of freedom typically mean you have a larger sample size, which can lead to more reliable estimates.

  • Power of the Test: Higher degrees of freedom can increase the power of the test, meaning a better chance of detecting a true effect if it exists


    Paired t-test

data:  first_instrument and second_instrument
t = -1.7928, df = 9, p-value = 0.1066
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.261771  0.261771
sample estimates:
mean difference 
             -1 
Mean of differences: 1 
Standard deviation of differences: 1.763834 
Standard Error of the Mean (SEM): 0.5577734 
Degrees of Freedom: 9 
t-Statistic: -1.792843 

    Paired t-test

data:  first_instrument and second_instrument
t = -1.7928, df = 9, p-value = 0.1066
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.261771  0.261771
sample estimates:
mean difference 
             -1 
Mean of differences: 1 
Standard deviation of differences: 1.763834 
Standard Error of the Mean (SEM): 0.5577734 
Degrees of Freedom: 9 
t-Statistic: -1.792843 

Question

A random sample of 6 patients with ischemic heart disease were treated with clofibrate and the concentration of their plasma fibrinogen determined as follows

patients no : 1 2 3 4 5 6

pre value 379 351 420 303 346 370

post-value 325 333 391 275 311 323

Does the treatment have any statistically significant effect.


    Paired t-test

data:  pre_values and post_values
t = 6.4974, df = 5, p-value = 0.001289
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 21.25356 49.07977
sample estimates:
mean difference 
       35.16667 
Mean of differences: -35.16667 
Standard deviation of differences: 13.2577 
Standard Error of the Mean (SEM): 5.412434 
Degrees of Freedom: 5 
t-Statistic: 6.497385 

Test Results Summary

  • t-Statistic: 6.4974

  • Degrees of Freedom (df): 5

  • p-value: 0.001289

  • Mean Difference: -35.16667

  • 95% Confidence Interval: (21.25356, 49.07977)

Conclusion

  1. Statistical Significance:

    • The p-value of 0.001289 is significantly less than the common alpha level of 0.05. This indicates that we reject the null hypothesis, which states that there is no difference in fibrinogen levels before and after treatment with clofibrate.
  2. Mean Difference:

    • The mean difference of -35.16667 suggests that, on average, the post-treatment fibrinogen levels are 35.17 units lower than the pre-treatment levels. This is a clinically meaningful reduction.
  3. Confidence Interval:

    • The 95% confidence interval for the mean difference is (21.25356, 49.07977). This means we are 95% confident that the true mean difference in fibrinogen levels lies between approximately 21.25 and 49.08 units. Since the interval does not include zero, it further supports the conclusion that the treatment had a significant effect.

Independent t test

Let’s say we have two groups of patients with different treatments for ischemic heart disease, and we want to compare their plasma fibrinogen levels.

Sample Data

  • Group A (Treatment 1): 379, 351, 420, 303, 346, 370

  • Group B (Treatment 2): 325, 333, 391, 275, 311, 323

Steps for Independent Samples t-Test

  1. Formulate Hypotheses:

    • Null Hypothesis (H0): The means of the two groups are equal (no difference).

    • Alternative Hypothesis (H1): The means of the two groups are not equal (there is a difference).

  2. Perform the t-Test.

  3. Visualize the Data.

  4. Interpret the Results.

R Code


    Welch Two Sample t-test

data:  group_A and group_B
t = 1.5896, df = 9.99, p-value = 0.143
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -14.13316  84.46649
sample estimates:
mean of x mean of y 
 361.5000  326.3333 

Chi-Square Test Overview

  1. Hypotheses:

    • Null Hypothesis (H0): There is no association between the variables (they are independent).

    • Alternative Hypothesis (H1): There is an association between the variables (they are dependent).


    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 15.042, df = 1, p-value = 0.0001052

Correlation is a statistical measure that describes the strength and direction of a relationship between two variables. It helps determine how closely related the variables are and whether an increase in one variable corresponds to an increase or decrease in another.

Key Concepts in Correlation

  1. Correlation Coefficient:

    • The most common measure of correlation is the Pearson correlation coefficient (r), which ranges from -1 to 1.

      • r = 1: Perfect positive correlation (as one variable increases, the other also increases).

      • r = -1: Perfect negative correlation (as one variable increases, the other decreases).

      • r = 0: No correlation (no linear relationship between the variables).

    • There are also other types of correlation coefficients, such as Spearman’s rank correlation, which is used for non-parametric data.

  2. Interpreting r:

    • 0.1 to 0.3: Weak correlation

    • 0.3 to 0.5: Moderate correlation

    • 0.5 to 0.7: Strong correlation

    • 0.7 to 0.9: Very strong correlation

    • 0.9 to 1.0: Extremely strong correlation

  3. Scatter Plot:

    • A scatter plot is often used to visualize the relationship between two variables. Each point represents an observation in the dataset, with one variable plotted on the x-axis and the other on the y-axis.

Calculating Correlation in R

Correlation coefficient (r): 0.9965217 

Regression

Regression analysis is a statistical technique used to model and analyze the relationships between a dependent variable (outcome) and one or more independent variables (predictors). It helps in predicting the value of the dependent variable based on the values of the independent variables.

Key Concepts in Regression

  1. Types of Regression:

    • Linear Regression: Models the relationship between two variables by fitting a linear equation (line) to the observed data.

    • Multiple Linear Regression: Extends linear regression to include multiple independent variables.

    • Logistic Regression: Used when the dependent variable is categorical (e.g., binary outcomes like yes/no).

    • Polynomial Regression: Models the relationship as a polynomial equation, allowing for curves in the data.

    
    Call:
    lm(formula = Y ~ X, data = data)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.3333 -1.6667  0.1667  1.1667  4.0000 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)   1.3333     1.5899   0.839    0.426    
    X             8.6667     0.2562  33.823 6.38e-10 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 2.327 on 8 degrees of freedom
    Multiple R-squared:  0.9931,    Adjusted R-squared:  0.9922 
    F-statistic:  1144 on 1 and 8 DF,  p-value: 6.377e-10

Blood Pressure Data for Patients
Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP)
Patient Systolic BP (mmHg) Diastolic BP (mmHg)
1 110 65
2 124 70
3 116 75
4 120 80
5 135 85
6 148 90
7 136 95
8 165 100
9 152 105
10 172 110

Call:
lm(formula = DBP ~ SBP, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3153 -4.2159 -0.4493  3.7626  8.6980 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.21439   13.48479  -0.313 0.762630    
SBP          0.66556    0.09685   6.872 0.000128 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.111 on 8 degrees of freedom
Multiple R-squared:  0.8551,    Adjusted R-squared:  0.837 
F-statistic: 47.23 on 1 and 8 DF,  p-value: 0.0001281